AI-generated summaries
Today's ML research,
without the noise.
Daily summaries of the latest machine learning papers from arXiv, processed every 8 hours.
67
Papers today
8h
Update frequency
7
Days of history
A Simple Plug-in for Improving Eviction-Based KV Cache Compression
NLP
Large Language Models
Efficient ML
- VECTOR introduces a three-way token allocation strategy that considers both importance and reconstructability.
- The method employs a lightweight Ordinary Least Squares (OLS) calibration for efficient value reconstruction.
- VECTOR can be integrated into existing eviction-based methods with minimal adaptations.
- Empirical results show significant performance improvements in high-compression scenarios.
Read more
A Simple Plug-in for Improving Eviction-Based KV Cache Compression
Summary
The paper addresses the challenge of key-value (KV) cache growth in large language models (LLMs), which becomes a bottleneck for long-context inference. Traditional methods typically rely on binary eviction or representation approximation, leading to inefficiencies in memory utilization. The authors introduce VECTOR, a plug-and-play augmentation for eviction-based pipelines that implements a three-way token routing strategy: retention, approximation, and eviction. This approach combines an importance signal from a base scorer with a reconstructability signal derived from an offline-calibrated regression-based value estimation. By leveraging reconstructability, VECTOR recovers valuable information that would otherwise be lost during binary eviction while maintaining the stability of key vectors for attention routing. Experimental results demonstrate that VECTOR enhances the quality-memory trade-offs under medium-to-high compression, particularly in strict budget scenarios, thus improving the performance of existing eviction methods.
Methodology
VECTOR utilizes a three-way allocation strategy that categorizes tokens into retention, approximation, and eviction based on their importance and reconstructability. It employs an offline-calibrated OLS method to reconstruct value representations from keys, allowing for efficient online inference without architectural changes. This method enhances existing token-importance eviction strategies by incorporating a reconstructability dimension.
Results
The experimental evaluation on long-context benchmarks indicates that VECTOR significantly improves downstream performance under strict memory budgets, particularly in high-compression settings. The results validate the effectiveness of the reconstructability-aware allocation strategy in enhancing the quality of KV cache compression.
Implications
VECTOR's approach to KV cache compression has potential applications in optimizing memory usage for large language models, particularly in scenarios requiring long-context inference. This could lead to more efficient deployments in various applications, such as retrieval-heavy question answering and multi-turn reasoning tasks.
Understanding and Improving Noisy Embedding Techniques in Instruction Finetuning
NLP
Large Language Models
Theory
- SymNoise outperforms NEFTune by 6.7% in fine-tuning performance.
- The study clarifies the functional equivalence of Gaussian and uniform noise when scaled appropriately.
- SymNoise enhances model performance by regulating local curvature during training.
- The method demonstrates significant improvements across multiple instruction datasets.
Read more
Understanding and Improving Noisy Embedding Techniques in Instruction Finetuning
Summary
This paper investigates the use of noise injection in the fine-tuning of language models, specifically focusing on the NEFTune method, which employs uniform noise. Despite NEFTune's success, the reasons for its superiority over Gaussian noise remain unclear. The author conducts a comprehensive theoretical and empirical analysis, revealing that both noise types can achieve comparable performance when appropriately scaled. The paper introduces a novel fine-tuning method called Symmetric Noise Fine Tuning (SymNoise), which applies symmetric Bernoulli noise to embedding vectors. This method aims to enhance model performance by regulating the local curvature of the learned function. The results demonstrate that fine-tuning the LLaMA-2-7B model with SymNoise significantly improves performance on AlpacaEval, achieving a score of 69.04%, compared to 29.79% with standard techniques and 64.69% with NEFTune. The findings suggest that SymNoise consistently outperforms NEFTune across various datasets, establishing a new benchmark for instruction fine-tuning in large language models.
Methodology
The paper employs a theoretical and empirical analysis to compare the effects of Gaussian and uniform noise in fine-tuning language models. It introduces SymNoise, which uses symmetric Bernoulli noise applied to embedding vectors, aiming to regulate the curvature of the learned function without additional computational costs.
Results
The introduction of SymNoise leads to a substantial increase in performance on AlpacaEval, with scores rising from 29.79% to 69.04%. This represents a 39.25 percentage point improvement and a 6.7% advantage over NEFTune, which scored 64.69%. SymNoise consistently outperformed NEFTune across various datasets and models.
Implications
The findings suggest that incorporating noise-based strategies, particularly SymNoise, can significantly enhance the performance of large language models during instruction fine-tuning. This advancement may lead to more effective applications of language models in real-world scenarios, improving their robustness and accuracy.
Archimedean Copula Inference via Taylor-Mode AD
Theory
Efficient ML
Optimization
- ACOPULA is a JAX-native framework that handles arbitrary per-variable censoring in nested Archimedean copulas.
- It computes exact likelihoods and parameter gradients in polynomial time, overcoming limitations of existing tools.
- The framework supports both classical and neural copula generators, making it family-agnostic.
- ACOPULA demonstrates significant speed improvements, achieving a ∼650× speedup over existing R implementations.
Read more
Archimedean Copula Inference via Taylor-Mode AD
Summary
This paper introduces ACOPULA, a novel framework for nested Archimedean copula inference that addresses significant limitations of existing tools. Existing implementations struggle with arbitrary per-variable censoring, complex nesting structures, and exact parameter gradients, often being restricted to low-dimensional or bivariate cases. ACOPULA leverages Taylor-mode automatic differentiation to compute exact nested-copula likelihoods and parameter gradients efficiently, accommodating any Archimedean generator, whether classical or neural. The framework operates in polynomial time, enabling it to handle high-dimensional datasets with complex censoring patterns. The authors validate ACOPULA through extensive simulations and real-world applications, demonstrating its capability to fit models to large datasets, including ICU admissions and financial returns, while achieving significant speed improvements over existing methods. This work not only provides a robust tool for statistical inference in survival analysis and finance but also opens avenues for further research in copula modeling and applications in various fields.
Methodology
The authors developed ACOPULA using Taylor-mode automatic differentiation to compute higher-order derivatives of copula generator compositions in a single forward pass. This approach replaces traditional hand-derived methods for calculating mixed partial derivatives, enabling the framework to handle arbitrary nesting structures and per-variable censoring directly.
Results
ACOPULA was validated through simulations and real-world applications, including fitting models to 85,229 ICU admissions and an 11-sector hierarchical model of S&P 500 daily returns. The framework successfully handled high dimensions (up to d=98) and demonstrated a significant speed advantage over existing methods, scaling effectively to datasets with up to d=8,000.
Implications
The development of ACOPULA has significant implications for statistical modeling in survival analysis, finance, and clinical research, where hierarchical multivariate data with partial observations are common. Its ability to efficiently compute likelihoods and gradients opens new possibilities for complex data analysis and enhances the interpretability of dependence structures in various applications.
Preisach Attention: A Hysteretic Model of Sequential Memory
Theory
Efficient ML
Time Series
- Introduction of the Preisach Attention Layer (PAL) as a new sequence modeling architecture.
- PAL achieves Turing-completeness with O(1) depth, lower than traditional transformers.
- Establishes expressiveness separation between PAL and transformer attention based on rate-independence.
- The extremum stack in PAL acts as a minimal sufficient statistic for input history.
Read more
Preisach Attention: A Hysteretic Model of Sequential Memory
Summary
This paper introduces the Preisach Attention Layer (PAL), a novel architecture for sequence modeling that leverages the classical Preisach hysteresis operator. Unlike traditional softmax attention mechanisms, PAL employs a binary relay operator characterized by learned activation and deactivation thresholds, allowing it to maintain a stack of local extrema as its internal state. The author demonstrates that a single-layer PAL-Transformer is Turing-complete with O(1) depth, contrasting with the O(log n) depth required by standard hard-attention transformers. The paper establishes that PAL and transformer attention are expressively incomparable; PAL can compute historical range statistics efficiently, while transformers excel in random-access retrieval. The extremum stack in PAL serves as a minimal sufficient statistic for input history, providing a natural forgetting mechanism based on significance rather than recency. The proposed architecture is particularly suited for tasks requiring long episodic memory and weak positional dependence, achieving an inference cost of O(n log n) compared to O(n²) for standard attention mechanisms.
Methodology
The paper defines the Preisach Attention Layer and its multi-head variant, connecting them to the classical Preisach operator. It proves Turing-completeness through simulation of a two-stack pushdown automaton and explores the expressiveness of PAL compared to standard transformers, focusing on rate-independence and the properties of the extremum stack.
Results
The main results include the establishment of Turing-completeness for PAL at O(1) depth, a formal expressiveness separation between PAL and transformers, and the characterization of PAL's function class through its connection to rate-independent operators.
Implications
The findings suggest that PAL could be a more efficient alternative for sequence modeling tasks that require long-term memory and are less sensitive to the exact positioning of inputs, potentially impacting areas such as natural language processing and time series analysis.
Assessing Predictive Models for Fairness Based on Movement Patterns
Theory
- Introduces the concept of assessing fairness in predictive models based on individuals' movement patterns.
- Challenges the traditional assumption of spatial fairness tied to a single geographical location.
- Proposes a method that incorporates multiple locations, visit frequency, and duration into fairness assessments.
- Demonstrates the effectiveness of the approach through experiments on synthetic datasets.
Read more
Assessing Predictive Models for Fairness Based on Movement Patterns
Summary
This paper addresses the issue of fairness in predictive models by extending the concept of spatial fairness to include individuals' movement patterns. Traditional assessments of spatial fairness assume that individuals are tied to a single geographical location, which overlooks the complexities of human mobility. The authors propose a novel approach that associates individuals' movements with multiple geographic regions, taking into account the frequency and duration of visits to these locations. By employing spatial scan statistics, the study evaluates whether predictive models treat individuals unfairly based on their movement patterns. The experimental results demonstrate the effectiveness of this approach in detecting unfairness across various synthetic datasets, highlighting a consistent trade-off in localization performance across different spatial resolutions. This work emphasizes the need to consider movement patterns as a protected attribute in fairness assessments, thereby broadening the scope of fairness in machine learning.
Methodology
The authors developed a method that first associates individuals' movements with relevant geographic regions, considering various spatial partitions. They then apply spatial scan statistics to assess fairness based on these movement patterns, allowing for a nuanced evaluation of predictive models.
Results
The experimental evaluation showed that the proposed approach effectively detects unfairness related to movement patterns in thousands of synthetic datasets. The results indicated a consistent trade-off in localization performance across different resolutions, confirming the method's robustness in identifying unfair treatment.
Implications
This research has significant implications for the design and assessment of predictive models in various domains, such as finance and law enforcement, where movement patterns may influence decision-making. It encourages the integration of movement data into fairness assessments, promoting more equitable outcomes in machine learning applications.
The Attribution Contract: Feature Attribution for Generative Language Models
NLP
Large Language Models
Interpretability
- Introduces the Attribution Contract to clarify feature attribution claims in generative language models.
- Identifies contract ambiguity as a significant issue in feature attribution for GLMs.
- Highlights the self-attribution fallacy, where attributions to generated tokens are misinterpreted as prompt-level explanations.
- Proposes that feature attribution methods should be assessed as method-contract pairs rather than in isolation.
Read more
The Attribution Contract: Feature Attribution for Generative Language Models
Summary
This paper addresses the challenges of feature attribution in generative language models (GLMs), where the definition of 'features' is ambiguous due to the nature of these models. In autoregressive models, previously generated tokens serve as both inputs and outputs, complicating the attribution process. Similarly, diffusion models generate outputs through iterative processes rather than fixed predictions. The author introduces the concept of the 'Attribution Contract', which clarifies the conditions under which feature attribution claims are made, including the target output, eligible features, generative processes, and fixed elements. The paper argues that many disagreements in feature attribution stem from unstated contracts rather than the algorithms themselves. Through case studies of autoregressive and diffusion models, it demonstrates when attributions to earlier tokens or intermediate states are informative or misleading. The paper concludes that feature attribution methods should be evaluated as pairs of methods and contracts, emphasizing the importance of context in interpreting attribution scores.
Methodology
The paper employs a conceptual framework to analyze feature attribution in generative language models, using case studies of autoregressive and diffusion models to illustrate the implications of different attribution settings and contracts.
Results
The analysis reveals that the same attribution method can yield different interpretations based on the specified contract, demonstrating the need for clarity in defining what is being explained and how. It shows that attributions can be misleading if the explanatory context is not properly defined.
Implications
The findings have significant implications for practitioners using generative language models in various applications, such as debugging, safety auditing, and understanding model behavior. By clarifying the attribution process, the paper aims to improve the reliability of feature attribution in these contexts.
Learning partially observed systems with neural Hamiltonian ordinary differential equations
Time Series
Theory
Robotics
- NHODE effectively learns dynamics of partially observed systems by combining HNNs and neural ODEs.
- The framework allows for training with loss defined only on observed variables, enabling inference of latent states.
- Incorporating physical structure improves prediction accuracy and stability in complex dynamical systems.
- The method is evaluated on various systems, demonstrating robustness in challenging scenarios.
Read more
Learning partially observed systems with neural Hamiltonian ordinary differential equations
Summary
The paper introduces a novel framework called neural Hamiltonian ordinary differential equations (NHODE) for learning partially observed dynamical systems. Traditional physics-informed models often require complete access to system states, which limits their applicability in scenarios where some variables are unobserved. NHODE combines Hamiltonian neural networks (HNNs) with neural ordinary differential equations (neural ODEs) to effectively learn the dynamics of systems with missing data. The framework enforces energy conservation through its Hamiltonian structure while allowing flexible training that focuses on observed variables. The authors demonstrate the effectiveness of NHODE on various systems, including linear and nonlinear mass-spring systems and the chaotic three-body problem. The results indicate that incorporating physical structure enhances prediction accuracy and stability, particularly in challenging scenarios where purely data-driven approaches struggle. The study highlights the importance of embedding physical knowledge in machine learning models to improve their ability to infer latent dynamics in partially observed systems.
Methodology
The NHODE framework integrates Hamiltonian neural networks to enforce energy conservation and utilizes neural ordinary differential equations for flexible training. The training process involves rolling out predictions from initial conditions, allowing the model to learn latent dynamics while evaluating loss only on observed state components.
Results
The NHODE framework successfully captures both observed and latent dynamics across various test cases, showing improved accuracy and stability compared to purely data-driven models. The results indicate that as more physical structure is embedded in the model, the performance in predicting system behavior enhances significantly, even in complex and chaotic scenarios.
Implications
The NHODE framework has potential applications in fields where dynamical systems are partially observed, such as robotics, physics simulations, and any domain requiring accurate modeling of complex systems with missing data. By embedding physical knowledge into machine learning models, practitioners can achieve better generalization and robustness in predictions.
Coupling-Robust Accuracy in Multiphysics Physics Informed Neural Networks via Kronecker-Preconditioned Optimization
Optimization
Theory
- The spectral radius of the standard NTK increases with the square of the coupling strength in linearly coupled systems.
- Block-diagonal Gauss-Newton preconditioning can stabilize the learning process by bounding the spectral radius of the preconditioned NTK.
- SOAP+GN optimizer maintains coupling-robust accuracy across various multiphysics systems, outperforming traditional optimizers.
- The method is validated through 234 experiments, demonstrating its effectiveness in both 1D and 2D systems.
Read more
Coupling-Robust Accuracy in Multiphysics Physics Informed Neural Networks via Kronecker-Preconditioned Optimization
Summary
This paper addresses the accuracy degradation of Physics-Informed Neural Networks (PINNs) when applied to coupled multiphysics systems, particularly as the inter-equation coupling strength increases. The authors provide a theoretical framework using neural tangent kernel (NTK) analysis, demonstrating that the spectral radius of the standard NTK grows with the square of the coupling strength, which limits the stable learning rate. They propose a block-diagonal Gauss-Newton (GN) preconditioning approach that bounds the spectral radius of the preconditioned NTK, making it independent of the coupling strength. The effectiveness of their method, termed SOAP+GN, is validated through extensive numerical experiments across various coupled systems, including 1D and 2D benchmarks. The results show that SOAP+GN maintains a coupling degradation of less than 1.1 times, significantly outperforming traditional methods like Adam+GN, which exhibited degradation exceeding 100 times. This work not only provides a theoretical basis for the observed phenomena but also demonstrates practical solutions for enhancing the robustness of PINNs in complex multiphysics scenarios.
Methodology
The authors utilize neural tangent kernel analysis to derive theoretical insights into the behavior of PINNs under varying coupling strengths. They implement a block-diagonal Gauss-Newton preconditioning technique to stabilize the learning process and combine it with the SOAP optimizer. The performance of the proposed method is evaluated through a factorial comparison across multiple systems and balancing schemes, with a focus on maintaining accuracy in the presence of strong coupling.
Results
The SOAP+GN method demonstrated a coupling degradation of less than 1.1 times across all tested regimes, while traditional methods like Adam+GN showed degradation exceeding 100 times. The method was also successfully applied to a complex 2D, 6-PDE electroosmotic flow system, where it outperformed existing approaches that relied on simplified physics.
Implications
This research has significant implications for the application of PINNs in real-world multiphysics problems, particularly in fields such as fluid dynamics, material science, and electrokinetics. The findings suggest that with appropriate preconditioning techniques, PINNs can be made more robust and accurate, enabling their use in more complex and coupled systems.
Multi-Gate Residuals
NLP
Large Language Models
Efficient ML
- MGR stabilizes activation scales without incurring additional communication overhead.
- The architecture combines features of existing methods to improve efficiency and performance.
- Empirical results show tangible performance improvements over traditional architectures.
- MGR addresses the challenges of information dilution and unbounded magnitude drift in deep networks.
Read more
Multi-Gate Residuals
Summary
The paper introduces Multi-Gate Residuals (MGR), a novel approach designed to address the limitations of existing residual connection methods in deep neural networks, particularly in the context of large-scale training and deployment. Traditional methods, such as Attention Residuals, have been effective in stabilizing activation scales but incur significant communication overhead. MGR proposes a simplified architecture that employs a scoring and gating mechanism to maintain multi-stream context while utilizing Attention Pooling to extract hidden states from these streams. This approach mitigates the issues of unbounded activation growth and information dilution across layers, which are prevalent in deeper networks. The authors demonstrate through empirical experiments that MGR not only simplifies the architecture but also enhances model performance compared to existing methods, making it practical for large-scale applications.
Methodology
The methodology involves a straightforward scoring and gating mechanism to manage multi-stream residuals, along with Attention Pooling to extract relevant hidden states. This design simplifies the architecture while retaining the advantages of advanced residual connections, effectively balancing performance and efficiency.
Results
Empirical experiments indicate that MGR achieves significant performance improvements over existing architectures, demonstrating its effectiveness in large-scale training scenarios without the drawbacks of increased communication and memory overhead.
Implications
The proposed MGR architecture has the potential to enhance the scalability and efficiency of deep learning models, particularly in resource-constrained environments, making it suitable for various applications in natural language processing and other domains requiring deep neural networks.
Certification from Examples is Hard for Circuits and Transformers under Minimal Overparametrization
Theory
- Exact certification can be exponentially hard even with minimal overparametrization.
- Adding a single gate to threshold circuits of depth ≥2 can exponentially increase certification size.
- Log-precision Transformers exhibit similar certification hardness with slight architectural changes.
- Approximate certification still requires large certificates despite allowing polynomial mistakes.
Read more
Certification from Examples is Hard for Circuits and Transformers under Minimal Overparametrization
Summary
This paper investigates the challenges of exact certification in machine learning, particularly focusing on neural networks such as circuits and Transformers. The authors define exact certification as the process of determining the minimum number of labeled examples required to confirm that a learned hypothesis matches the target function. They demonstrate that even minimal overparametrization can lead to exponential difficulty in certification across various hypothesis classes. Specifically, they show that for threshold circuits of depth ≥2, adding just one extra gate can result in certificate sizes that grow exponentially with the input dimension. A similar hardness is observed in log-precision Transformers, where slight overparametrization can complicate certification. The authors also explore approximate certification, revealing that allowing a limited number of mistakes still necessitates large certificates. Empirical studies on circuits and trained Transformers for binary addition illustrate these theoretical findings, showing that imperfect models can evade detection by large uniformly sampled certificate candidates. Overall, the work highlights the sensitivity of certification to the surrounding hypothesis class and the implications for ensuring reliable reasoning in neural networks.
Methodology
The authors formalize the certification problem using a teaching set perspective, analyzing the minimum number of labeled examples needed to certify hypotheses in various classes. They provide theoretical results for threshold circuits and Transformers, and conduct empirical evaluations on circuits and trained models for binary addition to validate their findings.
Results
The study reveals that even minimal increases in model capacity can lead to exponential certification difficulties. For threshold circuits, the addition of one gate results in exponential certificate sizes. In Transformers, slight overparametrization similarly complicates certification. The empirical analysis shows that many incorrect circuits can remain consistent with sampled certificate candidates, indicating challenges in detecting non-exact models.
Implications
The findings underscore the importance of exact certification in neural networks, particularly for applications requiring reliable reasoning and algorithmic behavior. The results suggest that practitioners must be cautious about model overparametrization and its effects on certification, which could impact the deployment of neural networks in critical tasks.
RelPrism: A Multi-Faceted Pre-training Framework with Self-Generated Tasks for Relational Databases
Graph Learning
- RelPrism is a multi-faceted self-supervised learning framework tailored for relational databases.
- It constructs intrinsic, relational, and hybrid attributes to capture diverse information for predictive tasks.
- The framework utilizes multi-granularity clustering to enhance representation learning.
- Experimental results show a 4.15% improvement in ROC-AUC for classification tasks and a 10.75% reduction in MAE for regression tasks compared to existing methods.
Read more
RelPrism: A Multi-Faceted Pre-training Framework with Self-Generated Tasks for Relational Databases
Summary
The paper introduces RelPrism, a novel multi-faceted self-supervised learning framework designed specifically for relational databases (RDBs). RDBs are crucial for various predictive tasks, yet existing self-supervised learning (SSL) methods often fail to capture the diverse and multi-faceted nature of information required for effective representation learning. RelPrism addresses this challenge by constructing intrinsic, relational, and hybrid attributes from different perspectives and applying multi-granularity clustering to create pseudo-task pools. This approach allows for comprehensive exposure to various perspectives and granularities during pre-training, enhancing the adaptability of learned representations for downstream tasks. The authors validate the effectiveness of RelPrism through experiments on 14 tasks across 5 real-world datasets, demonstrating significant improvements in performance metrics such as ROC-AUC and MAE compared to state-of-the-art baselines.
Methodology
RelPrism employs a multi-faceted approach to self-supervised learning by generating intrinsic, relational, and hybrid attributes from relational databases. It utilizes multi-granularity clustering to form pseudo-task pools, which are then used for pre-training the model. This method ensures that the learned representations are exposed to a wide range of information across different perspectives and granularities, facilitating better adaptation for downstream tasks.
Results
The experimental evaluation of RelPrism on 14 tasks across 5 real-world datasets revealed that it outperforms state-of-the-art baselines, achieving an improvement of 4.15% in ROC-AUC for classification tasks and a reduction of 10.75% in MAE for regression tasks, demonstrating its effectiveness in enhancing predictive performance.
Implications
The development of RelPrism has significant implications for enhancing predictive modeling in relational databases, particularly in domains such as finance and healthcare where multi-faceted data is prevalent. Its ability to effectively leverage diverse information can lead to more accurate predictions and better decision-making processes in various applications.
Robust OT-Guided Generative Residual Domain Adaptation for Bike-Sharing Demand Prediction under Temporal Domain Shift
Time Series
Optimization
Generative Models
- Introduces Gen-ROTDA, a robust framework for bike-sharing demand prediction under temporal shifts.
- Focuses on residual domain adaptation rather than raw demand transfer for improved prediction accuracy.
- Utilizes robust optimal transport to enhance stability against abnormal data records.
- Demonstrates superior performance in mean absolute error compared to various baseline methods.
Read more
Robust OT-Guided Generative Residual Domain Adaptation for Bike-Sharing Demand Prediction under Temporal Domain Shift
Summary
This paper addresses the challenge of bike-sharing demand prediction in the context of temporal domain shifts, where travel patterns evolve over time. Specifically, it focuses on predicting demand for Citi Bike from 2021 to 2026, proposing a novel framework called Gen-ROTDA (Generative Robust Optimal Transport-guided Residual Domain Adaptation). The framework operates by decomposing demand into stable anchor components and residual components, allowing for more effective adaptation of the residuals rather than the raw demand. Gen-ROTDA employs a deterministic label-preserving residual feature generator and incorporates robust optimal transport (OT) techniques to mitigate the influence of abnormal or poorly matched samples during training. The methodology is evaluated against various baseline methods, including source-only and target-only approaches, as well as other OT-based adaptations. The results demonstrate that Gen-ROTDA achieves the lowest mean absolute error (MAE) in the primary task of predicting demand from 2025 to 2026 and performs best on average across multiple years. Additionally, it shows greater stability under conditions of noisy target data compared to non-robust OT variants, highlighting the effectiveness of robust transport in this domain adaptation context.
Methodology
The proposed Gen-ROTDA framework decomposes bike-sharing demand into anchor and residual components. It employs a deterministic label-preserving generator to adapt source residual features towards the target domain, followed by robust optimal transport alignment to eliminate poorly matched samples. The final demand prediction is obtained by combining the anchor prediction with the adapted residual prediction.
Results
Gen-ROTDA achieved the lowest MAE on the primary task of predicting bike-sharing demand from 2025 to 2026 and outperformed other OT-family methods on average across multiple years. It also demonstrated increased stability under conditions of noisy target data compared to non-robust OT methods.
Implications
The findings suggest that robust transport methods can significantly enhance the reliability of domain adaptation techniques in time-sensitive applications like bike-sharing demand prediction. This approach could be applied to other domains facing similar temporal shifts, improving predictive accuracy and operational efficiency.
Same Architecture, Different Capacity: Optimizer-Induced Spectral Scaling Laws
NLP
Large Language Models
Optimization
- Optimizers significantly affect the spectral scaling laws of Transformer architectures.
- AdamW and Muon optimizers yield markedly different scaling behaviors, particularly in rare-token representations.
- Matched validation loss does not guarantee similar representation structures between different optimizers.
- Optimizer-induced spectral shifts can surpass architectural effects in shaping representation capacity.
Read more
Same Architecture, Different Capacity: Optimizer-Induced Spectral Scaling Laws
Summary
This paper investigates the impact of different optimizers on the spectral scaling laws of feed-forward networks (FFNs) within Transformer architectures. While traditional scaling laws have focused on model size, data, and compute, this study emphasizes the optimizer's role in shaping the representation capacity of models. The authors demonstrate that the same architecture can exhibit significantly different spectral scaling behaviors depending on the optimizer used. Specifically, they compare AdamW and Muon optimizers, revealing that AdamW shows weak hard-rank scaling on rare-token representations, while Muon achieves near-linear scaling. This indicates that the optimizer not only influences convergence speed but also the structure and effectiveness of the learned representations. The findings suggest that optimization should be considered a critical factor in representation scaling, advocating for a co-design approach that integrates both optimizer and architecture.
Methodology
The authors analyze the eigenspectra of FFN representations by measuring soft and hard spectral ranks across different optimizers (AdamW, Muon, NorMuon, and Dion variants) while keeping the architecture and width schedule fixed. They assess how effectively added FFN width translates into utilized spectral capacity and compare the spectral scaling across various token frequency regimes.
Results
The results indicate that the choice of optimizer leads to significant differences in spectral scaling laws. AdamW exhibits weak hard-rank scaling (β=0.29) while Muon achieves near-linear scaling (β=0.82). The study also finds that the hard-soft rank asymmetry is optimizer-dependent, with AdamW showing the largest asymmetry. Furthermore, the optimizer affects how representation capacity is allocated across different token frequencies, with AdamW struggling more with rare-token representations compared to Muon.
Implications
These findings suggest that the choice of optimizer is crucial in determining the effectiveness of model training and representation learning. This could lead to more informed practices in model design, where optimizers are selected not just for efficiency but also for their impact on representation structure. The results advocate for a co-design approach that integrates optimizers and architectures to enhance model performance.
Value-Gradient Hypothesis of RL for LLMs
Large Language Models
Reinforcement Learning
Theory
- Critic-free RL methods like PPO and GRPO can effectively enhance LLMs despite theoretical concerns about long-horizon credit assignment.
- The actor update in these methods behaves like a value-gradient signal, allowing for effective credit transport.
- Empirical costates in discrete transformers approximate theoretical value gradients, with controlled error margins.
- A decomposition of RL impact into value-gradient signals and reward headroom provides a practical criterion for RL effectiveness.
Read more
Value-Gradient Hypothesis of RL for LLMs
Summary
This paper explores the effectiveness of reinforcement learning (RL) in enhancing pretrained large language models (LLMs) without the use of critics, specifically focusing on methods like Proximal Policy Optimization (PPO) and Group Relative Policy Optimization (GRPO). The authors propose a value-gradient perspective to explain why these critic-free methods perform well despite classical RL theory suggesting they should struggle with long-horizon credit assignment. They demonstrate that the actor updates in these methods can be interpreted as carrying a value-gradient-like signal, which is propagated through differentiable rollouts. The paper further establishes that in discrete transformer policies, the empirical costates derived from autodifferentiation approximate the theoretical value gradients, with errors influenced by sampling gaps and policy entropy. This leads to a decomposition of RL's impact into usable value-gradient signals and reachable reward headroom, providing a criterion for identifying when RL is most beneficial during pretraining. The findings offer insights into checkpoint selection for pretraining, suggesting that RL is most effective when checkpoints are near the value-gradient regime yet far enough from saturation to allow for reward-enhancing trajectories.
Methodology
The authors utilize a theoretical framework that combines differentiable rollouts with additive-noise parameterization to analyze actor updates in critic-free RL. They derive relationships between empirical costates and value gradients in discrete transformer architectures, employing autodifferentiation techniques to validate their claims.
Results
The study confirms that the actor updates in PPO and GRPO are value-gradient-like in expectation. It also shows that the empirical costates computed through autodifferentiation closely approximate the theoretical value gradients, with errors manageable through sampling strategies. The proposed RL-impact decomposition effectively predicts when RL will yield the most significant improvements during pretraining.
Implications
The findings suggest that RL can be strategically applied to LLMs to enhance their performance, particularly by optimizing checkpoint selection based on the value-gradient framework. This could lead to more efficient training processes and improved model capabilities in various NLP tasks.
Worse than Random: The Importance of a Baseline for Unsupervised Feature Selection
Theory
- Random feature selection is proposed as a necessary baseline for evaluating unsupervised feature selection methods.
- Many state-of-the-art methods are shown to perform worse than random feature selection.
- The absence of a proper baseline complicates the assessment of the value added by new feature selection methods.
- The paper provides empirical evidence supporting the need for consistent evaluation standards in unsupervised feature selection.
Read more
Worse than Random: The Importance of a Baseline for Unsupervised Feature Selection
Summary
This paper addresses the critical issue of evaluating unsupervised feature selection methods by proposing the use of random feature selection as a baseline. The authors argue that many existing state-of-the-art unsupervised feature selection methods are often evaluated without a proper baseline, making it difficult to assess their true effectiveness. Through empirical analysis, they demonstrate that several of these methods can perform worse than random feature selection in terms of both performance and efficiency. The paper emphasizes the necessity of establishing random feature selection as a baseline to ensure that new methods provide meaningful improvements over trivial approaches. The authors provide a literature review on unsupervised feature selection, discuss the methodology for using random selection as a baseline, and present experimental results comparing various methods against this baseline, ultimately highlighting the shortcomings of current approaches in the field.
Methodology
The authors conducted empirical evaluations of various unsupervised feature selection methods against a baseline of random feature selection. They reviewed existing literature, outlined the methodology for establishing random selection as a baseline, and detailed the experimental setup used to compare the performance of different methods.
Results
The experimental results indicated that several established unsupervised feature selection methods did not outperform random feature selection, raising concerns about their effectiveness and efficiency. This finding underscores the importance of using a robust baseline for evaluation.
Implications
The findings suggest that researchers in the field of unsupervised feature selection should adopt random feature selection as a baseline to ensure that new methods demonstrate genuine improvements. This could lead to more reliable evaluations and advancements in feature selection techniques.
No Epoch Like the Present: Robust Climate Emulation Requires Out-of-Distribution Generalisation
Theory
Time Series
- Climate emulation is fundamentally an out-of-distribution prediction task.
- Seasonal variation can effectively serve as a proxy for long-term climate shifts.
- Current hybrid-ML emulators show significant performance degradation under realistic distribution shifts.
- A novel evaluation framework is proposed that does not require additional data collection.
Read more
No Epoch Like the Present: Robust Climate Emulation Requires Out-of-Distribution Generalisation
Summary
This paper addresses the critical challenge of climate emulation in the context of out-of-distribution (OOD) generalization, highlighting the inadequacies of current machine learning (ML) methods when faced with the distribution shifts caused by climate change. The authors establish that while existing ML emulators perform well within the distribution of present climate data, their reliability for future projections is uncertain. They confirm that climate change leads to significant shifts in atmospheric state distributions, rendering standard evaluation methods insufficient. By using seasonal variation as a proxy for long-term climate shifts, the authors introduce a novel evaluation framework that allows for rigorous testing of emulator robustness without the need for synthetic perturbations. Their systematic analysis reveals that state-of-the-art hybrid-ML emulators exhibit notable performance degradation under these realistic shifts. The paper concludes by advocating for compositional generalization as a pathway to enhance robustness, demonstrating that physically motivated model decompositions can improve OOD performance with minimal trade-offs against in-distribution accuracy.
Methodology
The authors analyze 40 years of observation-constrained reanalysis data to characterize climate emulation as an OOD task. They empirically validate the use of seasonal variation as a proxy for long-term climate shifts and introduce a zero-overhead evaluation framework to assess emulator robustness. The performance of existing hybrid-ML emulators is systematically evaluated under realistic distribution shifts, and the impact of physically motivated model decompositions on OOD performance is examined.
Results
The study finds that current state-of-the-art hybrid-ML emulators significantly degrade in performance when subjected to realistic seasonal distribution shifts. The introduction of physically motivated decompositions leads to substantial improvements in OOD performance, indicating a viable path towards developing more robust climate emulators.
Implications
The findings suggest that improving the robustness of climate emulators is essential for reliable climate projections, which are critical for climate policy and risk management. The proposed evaluation framework and emphasis on compositional generalization could inform future research and development in climate modeling and machine learning applications in environmental science.
Valid and Expressive Copulas for Irregular Multivariate Time Series
Time Series
- Introduction of CoPFITi, the first copula model specifically for IMTS.
- Ensures marginalization consistency by decoupling marginal distributions from the dependency structure.
- Demonstrates improved performance over existing non-copula baselines and previous copula models.
- MargFlow, a model for univariate marginals, achieves the best marginal likelihood in evaluations.
Read more
Valid and Expressive Copulas for Irregular Multivariate Time Series
Summary
This paper introduces CoPFITi, a novel copula model designed for probabilistic forecasting of irregular multivariate time series (IMTS). The authors highlight the challenges posed by IMTS, such as asynchronous observations and the need for marginalization consistency in predictions. CoPFITi combines the expressivity of normalizing flows for univariate marginals with the flexibility of a Gaussian Mixture Copula for the joint dependency structure. The model is constructed to ensure that predictions over any subset of variables agree with those obtained from larger joint predictions, thus maintaining marginalization consistency. The paper demonstrates that copula-based approaches outperform traditional methods that fit full joint distributions directly. CoPFITi is claimed to be the first copula-based framework for IMTS that is marginalization consistent by design, achieving state-of-the-art performance in joint IMTS density modeling.
Methodology
The authors developed CoPFITi by constructing the dependency structure in a latent space using a Gaussian mixture model, while independently training univariate marginals with a model called MargFlow, which utilizes Deep Sigmoidal Flows. This approach allows for the separation of marginal and joint modeling, ensuring that the model adheres to the necessary properties of copulas.
Results
CoPFITi achieved significantly better joint likelihoods compared to non-copula baselines and performed on par with or better than the existing copula model TACTiS-2, while maintaining marginalization consistency. The evaluation was conducted on established benchmark tasks across four datasets, demonstrating the effectiveness of the proposed model.
Implications
The development of CoPFITi has significant implications for fields that rely on accurate probabilistic forecasting of irregular multivariate time series, such as healthcare, climate science, and sensor networks. The ability to provide coherent probabilistic forecasts can enhance decision-making processes in these domains.
AutoMCU: Feasibility-First MCU Neural Network Customization via LLM-based Multi-Agent Systems
Efficient ML
Large Language Models
- AutoMCU shifts the focus from proxy-driven hardware-aware search to a feasibility-first approach, prioritizing backend-verified deployability.
- The system integrates hardware-in-the-loop mechanisms to eliminate infeasible architecture candidates before training.
- AutoMCU reduces the customization time for neural networks on MCUs to 1-2 hours, significantly improving efficiency.
- Real-device deployments confirm the practical applicability of the proposed system for edge intelligence applications.
Read more
AutoMCU: Feasibility-First MCU Neural Network Customization via LLM-based Multi-Agent Systems
Summary
The paper introduces AutoMCU, a novel system designed to automate the customization of neural networks for deployment on microcontroller units (MCUs), addressing the challenges posed by strict resource constraints. Traditional methods often rely on proxy metrics and involve high search costs, leading to inefficiencies in model deployment. AutoMCU employs a feasibility-first approach, utilizing a large language model (LLM) within a multi-agent system to iteratively generate and refine neural network architectures based on natural language task requirements and hardware specifications. The system incorporates hardware-in-the-loop architecture generation to filter out infeasible designs early in the process and employs a state-isolated multi-agent scheduling mechanism to coordinate the stages of proposal, training, evaluation, and deployment. Experimental results demonstrate that AutoMCU achieves competitive accuracy on datasets like CIFAR-10 and CIFAR-100 while significantly reducing customization time to 1-2 hours, compared to hundreds of GPU hours required by existing hardware-aware neural architecture search (HW-NAS) methods. The practical applicability of AutoMCU is validated through real-device deployments on STM32 microcontrollers.
Methodology
AutoMCU utilizes a large language model (LLM) to generate structured architecture candidates based on task requirements and hardware constraints. It employs a multi-agent system for orchestrating the stages of architecture proposal, training, evaluation, and deployment, ensuring that infeasible designs are filtered out early through hardware-in-the-loop analysis.
Results
Experiments on CIFAR-10 and CIFAR-100 show that AutoMCU achieves competitive accuracy while reducing the customization time to approximately 1-2 hours, compared to hundreds of GPU hours for traditional HW-NAS methods. The effectiveness and stability of AutoMCU are further validated through comparisons with existing methods like ColabNAS and GENIUS on NAS-Bench-201.
Implications
The development of AutoMCU has significant implications for deploying neural networks on resource-constrained devices, particularly in IoT applications. It streamlines the workflow for model customization, making it more accessible for developers and enhancing the feasibility of edge intelligence solutions.
Models Can Model, But Can't Bind: Structured Grounding in Text-to-Optimization
NLP
Large Language Models
Optimization
- Text-to-optimization requires both modeling and binding capabilities, with binding being the primary bottleneck.
- Text2Opt-Bench is introduced as a scalable benchmark for evaluating text-to-optimization models across diverse problem categories.
- The BIND method significantly enhances model accuracy by programmatically binding data, achieving notable improvements in performance metrics.
- Training binding-specific models yields better results than end-to-end supervised fine-tuning or reinforcement learning.
Read more
Models Can Model, But Can't Bind: Structured Grounding in Text-to-Optimization
Summary
This paper addresses the challenges of text-to-optimization, which requires two distinct capabilities: modeling the correct optimization structure and binding the problem data to the model. The authors introduce Text2Opt-Bench, a scalable benchmark of solver-verified optimization problems across 12 categories, revealing that existing models struggle with binding as instance data increases, leading to a phenomenon termed the 'effective binding limit.' To mitigate this issue, they propose BIND, an inference-time approach that externalizes numeric data to structured files, allowing models to bind data programmatically. This method significantly improves the accuracy of models like GPT-5-Nano and GPT-5, demonstrating that binding-aware inference is crucial for performance. Furthermore, the authors explore training binding-specific models, finding that supervised fine-tuning outperforms reinforcement learning in this context. The study concludes that decomposing training by binding yields stronger, more efficient models compared to traditional end-to-end approaches.
Methodology
The authors developed Text2Opt-Bench, a benchmark of solver-verified optimization problems, and evaluated over 10 models from various families. They introduced BIND, which externalizes numeric data for programmatic binding, and conducted experiments to compare the performance of binding-aware inference against traditional methods. Additionally, they explored the effectiveness of training models specifically for binding tasks.
Results
The introduction of BIND improved the accuracy of GPT-5-Nano from 59.1% to 82.4% and GPT-5 from 86.2% to 95.8%. The study found that binding was the main failure mode for models as instance data increased, and that binding-specific models outperformed end-to-end approaches across various optimization categories.
Implications
The findings suggest that improving binding capabilities in text-to-optimization can lead to significant advancements in operational research applications, enhancing decision-making processes in logistics, energy, and supply chains. The structured approach to grounding may also inform future developments in natural language processing and optimization tasks.
Learning Through Noise: Why Subliminal Learning Works and When It Fails
Theory
- Subliminal learning does not require shared initializations between teacher and student models.
- The compatibility of output heads (auxiliary and classification) is crucial for successful subliminal learning.
- Architectural differences between teacher and student models can be accommodated as long as expressiveness conditions are met.
- The study provides a theoretical basis for understanding subliminal learning and quantifies its limits.
Read more
Learning Through Noise: Why Subliminal Learning Works and When It Fails
Summary
This paper investigates the phenomenon of subliminal learning in artificial neural networks, where task-relevant knowledge is transferred from teacher to student models through distillation on task-unrelated input-output pairs. The authors challenge the existing notion that closely matched initializations between teacher and student models are necessary for subliminal learning to occur. Instead, they demonstrate that the compatibility of output heads is the critical factor. Using a controlled MNIST dataset, they separate outputs into an auxiliary head for task-unrelated noise and a class head for classification, showing that subliminal learning can happen even with random initializations and architectural differences. The study establishes a theoretical framework explaining the mechanisms behind subliminal learning and identifies conditions under which it succeeds or fails. The findings suggest that subliminal learning can be a predictable mechanism rather than a mere transfer effect, with implications for knowledge distillation and the potential for unintended bias transmission in model training.
Methodology
The authors utilized a controlled setting with a Multilayer Perceptron (MLP) on the MNIST dataset. They separated the model outputs into an auxiliary head for task-unrelated noise and a classification head for task-related outputs. The teacher model was trained on labeled MNIST data, and then the student model was trained solely on noise pairs generated from the teacher's outputs. Various configurations were tested, including random initializations and architectural changes, to assess the conditions for subliminal learning.
Results
The results showed that subliminal learning occurred even when the student model had random initializations and different architectures compared to the teacher model. The compatibility of the auxiliary and classification heads was identified as the main factor for successful knowledge transfer. In favorable conditions, students trained only on noise could achieve performance levels comparable to the teacher model.
Implications
The findings have significant implications for knowledge distillation practices, highlighting the risks of unintended bias transfer during model training. Understanding subliminal learning can help in designing more robust models and mitigating potential issues related to hidden biases or misalignments in AI systems.
Optimization of randomized neural networks for transfer operator approximation
Optimization
Theory
Efficient ML
- Introduction of RaNNDy, a randomized neural network for transfer operator approximation.
- Proposed algorithm optimizes activation functions while keeping hidden layer parameters fixed.
- Demonstrated effectiveness on benchmark problems like stochastic differential equations.
- Offers a computationally efficient alternative to fully trained neural networks.
Read more
Optimization of randomized neural networks for transfer operator approximation
Summary
This paper introduces RaNNDy, a randomized neural network architecture designed for the data-driven approximation of transfer operators in complex dynamical systems. Unlike traditional neural networks that require extensive training of all parameters, RaNNDy fixes the weights and biases of hidden layers, only training the output layer, which leads to lower computational costs and a closed-form solution for the output. However, the effectiveness of RaNNDy is limited by the initial random selection of weights and biases that define the basis functions for operator approximation. The authors propose a novel algorithm that optimizes the activation function of the hidden layers while keeping the weights and biases fixed, thereby enhancing the dictionary of basis functions used for approximation. The paper demonstrates the algorithm's efficacy through various benchmark problems, including stochastic differential equations and random walks on graphons, showcasing its potential to improve the accuracy of operator approximations without the computational burden of full neural network training.
Methodology
The authors developed an algorithm that optimizes hyperparameters of the activation functions in RaNNDy while maintaining fixed weights and biases in the hidden layers. This approach leverages the variational principle to enhance the randomized basis used for transfer operator approximation.
Results
The proposed optimization algorithm significantly improves the performance of RaNNDy in approximating transfer operators, as evidenced by successful applications to various complex dynamical systems, including the Bickley jet and high-dimensional protein folding processes.
Implications
The findings suggest that optimizing activation functions in randomized neural networks can lead to more accurate and efficient approximations of transfer operators, which are crucial for analyzing complex dynamical systems. This approach may have broader applications in fields such as molecular dynamics and fluid mechanics.
GraphFlow: A Graph-Based Workflow Management for Efficient LLM-Agent Serving
Large Language Models
Graph Learning
Efficient ML
- GraphFlow introduces a unified graph representation (wGraph) for dynamic workflow management.
- The system enables adaptive workflow generation based on task-specific semantics.
- GraphFlow optimizes memory usage by managing KV caches more efficiently.
- Extensive experiments show significant performance improvements over state-of-the-art methods.
Read more
GraphFlow: A Graph-Based Workflow Management for Efficient LLM-Agent Serving
Summary
The paper introduces GraphFlow, a novel workflow management system designed to enhance the efficiency of Large Language Model (LLM)-based agents in executing complex tasks. Traditional workflow-assisted systems often rely on static templates and shallow matching, which hinder their ability to adapt to new tasks and capture deep semantic relationships. GraphFlow addresses these limitations by utilizing a unified graph representation, termed wGraph, where each node represents an atomic operation. This allows for dynamic workflow generation tailored to specific task semantics and constraints. The system also incorporates an innovative workflow state management strategy that optimizes Key-Value (KV) cache usage, significantly reducing redundant computations. Experimental results demonstrate that GraphFlow outperforms existing methods, achieving an average performance improvement of approximately 4.95 percentage points and a 4× reduction in memory footprint across five benchmark datasets.
Methodology
GraphFlow employs a Graph Neural Network (GNN) for task-adaptive workflow generation, synthesizing workflows from the wGraph. It also implements a topology-aware state management mechanism to minimize redundant KV state storage, allowing shared operations to be efficiently managed across multiple workflows.
Results
GraphFlow consistently outperformed state-of-the-art methods, achieving an average performance improvement of about 4.95 percentage points and reducing memory usage by approximately 4× across five benchmark datasets.
Implications
The proposed framework has significant implications for enhancing the efficiency and adaptability of LLM-based agents in various applications, particularly in complex task execution scenarios where dynamic workflow management is crucial.
Multiple Neural Operators Achieve Near-Optimal Rates for Multi-Task Learning
Theory
- Derivation of near-optimal approximation rates for Multiple Neural Operators (MNO) in multi-task learning.
- Establishment of refined statistical learning rates that match those of single-task operator learning.
- Introduction of lower complexity bounds for multiple operator learning, indicating intrinsic complexity barriers.
- Comparison of MNO with DeepONet, showing similar performance in terms of approximation complexity.
Read more
Multiple Neural Operators Achieve Near-Optimal Rates for Multi-Task Learning
Summary
This paper investigates the approximation and statistical complexity of learning collections of operators in a shared multi-task setting, focusing on the Multiple Neural Operators (MNO) architecture. The authors derive near-optimal upper bounds for approximation and statistical generalization for broad classes of Lipschitz multiple operator maps. They establish that the complexity of multi-task operator learning aligns with the scaling laws of single-task operator learning, countering the notion that shared representations increase overall cost. The paper also compares MNO with a multi-task extension of DeepONet, demonstrating that both architectures exhibit similar asymptotic rates from a worst-case approximation-complexity perspective. The authors present refined approximation rates, improved statistical learning rates, and lower complexity bounds, clarifying the intrinsic barriers of multi-task operator learning.
Methodology
The authors utilize a theoretical framework to analyze the approximation and generalization capabilities of the MNO architecture. They derive upper bounds for approximation complexity through refined error analysis and establish statistical learning rates by integrating these bounds with existing generalization frameworks. Additionally, they extend lower-complexity frameworks to the multi-task setting, introducing concepts adapted to separable architectures.
Results
The paper presents several key results: (1) Near-optimal constructive approximation rates for MNO, with explicit bounds on network parameters; (2) Improved statistical learning rates that scale similarly to single-task operator learning; (3) Lower complexity bounds for multiple operator maps, indicating that the derived upper bounds are close to sharp and that some parametric complexity barriers are unavoidable.
Implications
The findings suggest that multi-task operator learning can be as efficient as single-task learning, which has significant implications for applications in areas such as parameterized kernel operators and solution operators for parameterized PDEs. The results may enhance the design of neural network architectures for multi-task learning scenarios.
Tabular foundation models for robust calibration of near-infrared chemical sensing data
Theory
Efficient ML
- TabPFN shows promise as a calibration strategy for NIR chemical sensing data.
- Preprocessing-optimized TabPFN outperforms traditional models like PLS and modern methods like CatBoost in regression tasks.
- In classification, TabPFN applied directly to raw spectra achieves top performance.
- Robustness analyses indicate limitations of TabPFN in handling spectral outliers and extrapolated samples.
Read more
Tabular foundation models for robust calibration of near-infrared chemical sensing data
Summary
This paper investigates the use of tabular foundation models, specifically TabPFN, for calibrating near-infrared (NIR) chemical sensing data. NIR spectroscopy is a valuable technique for analyzing various samples, but its practical application is hindered by challenges such as high-dimensional collinear spectra and limited sample sizes. The authors benchmark TabPFN against traditional chemometric models like PLS and modern machine learning approaches, including Ridge regression and CatBoost, across 66 NIR datasets. They employ a unified validation framework to ensure fair comparisons by optimizing preprocessing and model selection on calibration data before testing. The results indicate that preprocessing-optimized TabPFN outperforms PLS and CatBoost on raw spectra, while also showing competitive performance with Ridge. In classification tasks, TabPFN applied directly to raw spectra achieves the best average rank. However, robustness analyses reveal that TabPFN's advantages diminish in the presence of spectral outliers and extrapolated samples, where classical models still perform well. The findings suggest that tabular foundation models can enhance existing chemometric workflows, particularly in small to medium-sized calibration scenarios, while emphasizing the need for tailored strategies for spectroscopy-specific challenges.
Methodology
The authors benchmarked TabPFN on 66 NIR datasets, covering 54 regression and 12 classification tasks. They utilized a unified validation framework where preprocessing and model selection were performed exclusively on calibration data before external testing. Comparisons were made against PLS, Ridge, CatBoost, and one-dimensional convolutional neural networks.
Results
Preprocessing-optimized TabPFN achieved the best overall average rank in regression tasks, significantly outperforming PLS, CatBoost, and CNN-1D, while being statistically comparable to Ridge. In classification tasks, TabPFN on raw spectra provided the best average rank, closely matching the optimized variant. However, its performance decreased in the presence of spectral outliers and extrapolated samples.
Implications
The findings suggest that tabular foundation models like TabPFN can enhance the calibration of NIR chemical sensing data, particularly in small to medium-sized datasets. This could lead to more robust and efficient analytical techniques in food, pharmaceutical, and environmental applications, while also indicating the need for specific strategies to address the unique challenges of spectroscopy.
The Value of Covariance Matching in Gaussian DDPMs and the Lanczos Sampler
Generative Models
Theory
Efficient ML
- Matching full posterior covariance reduces path-KL error from Ω(1/T) to O(1/T²).
- The Lanczos Gaussian Sampler (LGS) enables practical covariance matching without dense storage.
- LGS improves sample quality over strong diagonal-covariance baselines with minimal computational overhead.
- The method leverages covariance-vector products through Jacobian-vector products for efficient sampling.
Read more
The Value of Covariance Matching in Gaussian DDPMs and the Lanczos Sampler
Summary
This paper addresses the limitations of Gaussian Denoising Diffusion Probabilistic Models (DDPMs) regarding the path-space KL divergence between the exact reverse chain and the learned Gaussian reverse process. The authors demonstrate that standard isotropic reverse covariances incur a path-KL error that cannot decay faster than Ω(1/T) as the number of denoising steps T increases. They propose that matching the full posterior covariance can significantly improve this error, reducing it to O(1/T²). To facilitate practical implementation of full covariance matching, the authors introduce the Lanczos Gaussian Sampler (LGS), a matrix-free method that samples from the optimal reverse covariance using covariance-vector products derived from Jacobian-vector products of the posterior mean. The LGS method avoids the need for dense covariance storage and auxiliary models, proving that its approximation error decays exponentially with the number of Lanczos steps. Empirical results show that using just three Lanczos steps enhances sample quality compared to existing diagonal-covariance baselines, establishing full covariance matching as both theoretically beneficial and practically feasible for efficient DDPM sampling.
Methodology
The authors introduce the Lanczos Gaussian Sampler (LGS), which utilizes covariance-vector products computed via Jacobian-vector products to sample from the optimal reverse covariance without explicitly storing it. The method employs a small, fixed number of Lanczos iterations to achieve efficient sampling.
Results
The implementation of LGS shows that using just three Lanczos steps significantly improves sample quality compared to existing methods like OCM-DDPM, confirming the theoretical reduction of path-KL error and demonstrating practical advantages in sampling efficiency.
Implications
The findings suggest that full covariance matching can enhance the performance of Gaussian DDPMs, particularly in applications requiring accurate reverse trajectories, such as classifier-guided sampling. This could lead to advancements in generative modeling and improved sampling techniques in various domains.
Convex Compositional Reasoning Models
Optimization
Theory
Efficient ML
- Identifies non-convex energy composition as a source of spurious minima in compositional energy-based models.
- Introduces Convex Compositional Energy Minimization (CCEM) framework that maintains convexity in energy composition.
- Develops a deterministic optimization pipeline for training and inference on tight convex relaxations.
- Provides theoretical guarantees that convex composition prevents spurious local minima and certifies global optimality.
Read more
Convex Compositional Reasoning Models
Summary
This paper addresses the challenges of compositional reasoning in energy-based models, particularly the issue of non-convex energy landscapes that lead to spurious local minima. The authors introduce a novel framework called Convex Compositional Energy Minimization (CCEM), which utilizes input-convex neural networks to parameterize local energy factors. By ensuring that the composed energy remains convex, the framework allows for efficient deterministic optimization through projected first-order methods. The training process consists of two stages: first, factor-level contrastive learning shapes the local energy basins, followed by end-to-end refinement using an unrolled projected solver. The authors demonstrate that models trained on smaller problems can effectively transfer to larger instances without retraining, showcasing the practical benefits of their approach. The theoretical guarantees provided in the paper indicate that convex composition eliminates the risk of spurious minima, making the optimization process more reliable and efficient.
Methodology
The CCEM framework parameterizes local energy factors using input-convex neural networks, ensuring that the overall energy remains convex. The training process involves two stages: local contrastive learning for shaping energy basins and end-to-end refinement through an unrolled projected solver. The optimization is performed directly on tight convex relaxations of feasible sets.
Results
The experiments show that the CCEM models, trained on smaller subproblems, effectively transfer to larger instances without the need for retraining. The proposed method matches or improves the performance of existing particle-based compositional energy-based models while simplifying the inference process.
Implications
The findings suggest that convex compositional reasoning can enhance the efficiency and reliability of solving complex combinatorial problems in various applications, potentially impacting fields such as optimization, operations research, and artificial intelligence.
LLM-driven design of physics-constrained constitutive models: two agents are better than one
Large Language Models
Generative Models
Theory
- Introduction of a dual-agent system for constitutive model generation using LLMs.
- The Creator agent proposes models while the Inspector agent ensures compliance with physical laws.
- Significant improvement in model validity, achieving 100% compliance with physical constraints.
- Models maintain high accuracy and generalization to unseen loading conditions.
Read more
LLM-driven design of physics-constrained constitutive models: two agents are better than one
Summary
This paper presents a novel approach to developing constitutive models for materials using large language models (LLMs) that incorporates a dual-agent system. The traditional process of creating these models is complex and requires extensive expertise in multiple fields. The authors propose a two-agent framework consisting of a Creator agent that generates constitutive models based on data and an Inspector agent that audits these models against nine fundamental physical constraints. This dual approach addresses the limitations of existing single-agent systems, which often fail to ensure compliance with physical laws. The methodology is demonstrated using constitutive artificial neural networks (CANNs) and is benchmarked on various materials, including brain tissue and rubber. The results show that the inclusion of the Inspector agent significantly improves the physical validity of the generated models, achieving a 100% compliance rate for one LLM and a notable increase for another. The models not only meet physical constraints but also maintain high accuracy and generalization capabilities, making them practical for real-world applications. This technique-agnostic framework is scalable with advancements in LLMs, paving the way for automated and physics-aware model discovery.
Methodology
The authors developed a two-agent system where the Creator agent generates constitutive models tailored to specific datasets, and the Inspector agent audits these models against nine physical constraints. This process ensures that only models that meet all physical requirements are exported for practical use.
Results
The implementation of the Inspector agent increased the percentage of models meeting all physical constraints from 91% to 100% for the Claude Opus 4.7 LLM and from 37% to 56% for the Kimi K2.5 LLM. The models generated were not only physically valid but also demonstrated high accuracy and the ability to generalize to new loading paths.
Implications
This research has significant implications for the field of material science and engineering, as it provides a reliable method for generating constitutive models that adhere to physical laws. The approach can facilitate the automated discovery of models, reducing the time and expertise required for model development and potentially leading to advancements in various applications, including biomedical engineering and materials design.
Cost-Effective Model Evaluation with Meta-Learning
Efficient ML
NLP
Computer Vision
- Introduction of MetaEvaluator, a cost-effective, model-agnostic evaluation framework.
- Utilization of meta-learning to transfer knowledge from reference models for performance estimation.
- Development of MetaDataset, a large-scale dataset for training and evaluating the framework.
- Demonstration of significant cost reduction in model evaluation compared to traditional methods.
Read more
Cost-Effective Model Evaluation with Meta-Learning
Summary
The paper addresses the challenge of evaluating newly released machine learning models on unseen, unlabeled data, which is increasingly important due to the rapid growth of model ecosystems. Traditional evaluation methods often require expensive annotation or fine-tuning, making them impractical for organizations needing quick assessments. The authors introduce MetaEvaluator, a model-agnostic framework that utilizes meta-learning to provide rapid, label-free evaluations of diverse model architectures. By leveraging a pool of reference models, MetaEvaluator learns transferable performance patterns, allowing it to estimate the performance of new models without the need for retraining. This approach significantly reduces evaluation costs and enables scalable benchmarking across various modalities, including Text-to-SQL and Image Classification. The paper also presents MetaDataset, a large-scale corpus of model-shift pairs that supports the evaluation process. Overall, MetaEvaluator offers a novel solution for organizations to efficiently select models for unlabeled workloads, enhancing the deployment of machine learning systems.
Methodology
MetaEvaluator employs a meta-learning approach to learn performance patterns from a pool of previously evaluated models. It distills these patterns into compact context representations, enabling rapid adaptation to new models on unlabeled datasets without requiring extensive human annotation or per-model retraining.
Results
Extensive experiments show that MetaEvaluator provides stable and accurate performance estimates for unseen models while significantly reducing evaluation costs compared to conventional methods. The framework demonstrates its effectiveness across different tasks, including Text-to-SQL and Image Classification.
Implications
MetaEvaluator has the potential to streamline the model selection process in organizations by enabling quick and reliable evaluations of new models on unlabeled data. This can lead to faster deployment cycles and improved trust in machine learning systems, particularly in scenarios where labeled data is scarce or unavailable.
Three Costs of Amortizing Gaussian Process Inference with Neural Processes
Theory
Generative Models
Efficient ML
- Decomposes KL divergence between GP and LNP into three interpretable sources of error.
- Establishes bounds on the truncation component of the bottleneck term related to kernel smoothness.
- Identifies persistent costs of label contamination in neural process predictions.
- Provides architectural recommendations to enhance predictive variance estimation.
Read more
Three Costs of Amortizing Gaussian Process Inference with Neural Processes
Summary
This paper investigates the amortization of Gaussian Process (GP) inference through Neural Processes (NP), specifically focusing on Latent Neural Processes (LNP). The author identifies three primary sources of error when approximating the GP posterior: label contamination, an information bottleneck, and amortization error. The label contamination arises because the neural process uses label values to estimate a label-independent quantity in the GP. The information bottleneck is due to the finite-dimensional representation that cannot fully capture the context geometry. Amortization error results from using a single encoder network across all contexts. The paper provides bounds on the Kullback-Leibler (KL) divergence between the GP and LNP predictives, revealing that the bottleneck truncation term decays exponentially with the representation dimension for squared-exponential kernels and polynomially for Matérn kernels. The label contamination term is shown to be O(1) in general, with the observation-noise component decaying as O(1/n). The findings lead to architectural recommendations for improving predictive variance estimation in the GP-amortization regime, suggesting the use of second-order pooling instead of mean aggregation.
Methodology
The paper employs theoretical analysis to derive bounds on the KL divergence between the predictive distributions of Gaussian Processes and Latent Neural Processes. It examines the sources of error in the amortization process and provides a decomposition of these errors into distinct components. The analysis includes mathematical formulations for the predictive distributions and their variances, as well as the impact of representation dimensions on approximation quality.
Results
The analysis reveals that the KL divergence between the GP and LNP predictives can be decomposed into three terms: label contamination, information bottleneck, and amortization error. The truncation term decays exponentially with the representation dimension for squared-exponential kernels and polynomially for Matérn kernels. The label contamination term is generally constant, with a decay related to observation noise. These results inform architectural choices for improving predictive performance in neural processes.
Implications
The findings have significant implications for the design of neural process architectures, particularly in applications requiring efficient and scalable GP inference. By understanding the sources of error in amortization, practitioners can make informed decisions about model architecture and representation dimensions, potentially leading to better performance in tasks such as sequential experimental design, robotics, and simulation-based inference.
DiLaDiff: Distilled Latent-Augmented Diffusion for Language Modeling
NLP
Generative Models
Large Language Models
- DiLaDiff combines continuous latent representations with discrete decoding for improved language modeling.
- The model significantly accelerates inference while maintaining high sampling quality.
- Consistency distillation enables rapid generation of high-quality text outputs.
- DiLaDiff outperforms traditional masked diffusion models in both quality and efficiency.
Read more
DiLaDiff: Distilled Latent-Augmented Diffusion for Language Modeling
Summary
The paper introduces DiLaDiff, a novel approach to language modeling that addresses the limitations of diffusion language models in capturing token correlations. DiLaDiff integrates a continuous latent space, a latent diffusion model, and a consistency model to enhance sampling quality and throughput. The continuous latent space is learned via an auto-encoder fine-tuned from a masked diffusion model, allowing for semantic representation. The latent diffusion model learns the prior over the encoder distribution, while the consistency model distills this prior into a few-step generative model. The authors demonstrate that DiLaDiff outperforms traditional masked diffusion models in terms of quality and inference speed, achieving a sevenfold acceleration in inference at a batch size of 32. Furthermore, the distillation process allows DiLaDiff to generate high-quality outputs in just five steps, compared to 200 steps required by its predecessor, LaDiff.
Methodology
The methodology involves training a text auto-encoder with a decoder initialized from a pre-trained discrete diffusion model. A continuous diffusion model is used to learn the latent prior, creating a hybrid continuous-discrete diffusion framework. The model captures global correlations through continuous diffusion while allowing for robust decoding in the token space. The consistency model distills the latent diffusion component into a few-step generative model.
Results
DiLaDiff demonstrates superior performance compared to masked diffusion baselines, achieving a sevenfold increase in inference speed at a batch size of 32. The model generates high-quality text in just five steps, closely matching the performance of the LaDiff model, which requires 200 steps.
Implications
The advancements presented in DiLaDiff could lead to more efficient and effective language models, enhancing applications in natural language processing, text generation, and other areas where high-quality language understanding and generation are critical.
Maestro: Reinforcement Learning to Orchestrate Hierarchical Model-Skill Ensembles
Reinforcement Learning
Multimodal
Large Language Models
- MAESTRO reframes multimodal tasks as a sequential decision-making process over a hierarchical model-skill registry.
- The orchestration policy is optimized using outcome-based reinforcement learning, eliminating the need for step-level supervision.
- MAESTRO achieves an average accuracy of 70.1%, outperforming state-of-the-art models like GPT-5.
- The framework demonstrates plug-and-play generalization to unseen models and skills without retraining.
Read more
Maestro: Reinforcement Learning to Orchestrate Hierarchical Model-Skill Ensembles
Summary
The paper introduces MAESTRO, a novel orchestration framework that leverages reinforcement learning (RL) to dynamically manage heterogeneous multimodal tasks through a hierarchical registry of models and skills. Traditional frameworks often rely on monolithic large language models (LLMs) and fixed logic, which limits their ability to exploit the complementary strengths of various models and skills. MAESTRO addresses this by training a lightweight policy that orchestrates the selection of expert models and skills based on task requirements, optimizing the orchestration process without requiring step-level supervision. The framework is evaluated across ten multimodal benchmarks, demonstrating superior performance with an average accuracy of 70.1%, surpassing leading models such as GPT-5 and Gemini-2.5-Pro. Notably, MAESTRO's policy generalizes to unseen models and skills, maintaining high computational efficiency and low latency, thus providing a scalable solution for collaborative agentic ecosystems.
Methodology
The authors formalize model-skill coordination as a finite-horizon partially observable Markov decision process (POMDP) and train the orchestration policy using reinforcement learning. The framework employs a two-tier hierarchical skill library and a multi-expert model pool to facilitate dynamic task-specific orchestration.
Results
MAESTRO achieves an average accuracy of 70.1% across ten multimodal benchmarks, outperforming GPT-5 (69.3%) and Gemini-2.5-Pro (68.7%). The policy generalizes effectively to unseen models and skills, achieving a 59.5% average on four challenging benchmarks with out-of-domain experts.
Implications
The MAESTRO framework could significantly enhance the performance of autonomous agents in various domains by enabling more effective coordination of diverse models and skills. Its ability to generalize to new models and maintain efficiency suggests potential applications in real-time systems and collaborative AI environments.
Harnesses for Inference-Time Alignment over Execution Trajectories
NLP
Large Language Models
Optimization
- Harness design is framed as an inference-time alignment problem, focusing on workflow and guidance components.
- Optimal granularity in task decomposition must align with the agent's capabilities and retry budgets.
- Guidance improves performance only when it aligns with task evidence; misalignment can lead to hallucinations.
- Partial harnessing, which specifies only initial steps, can be more effective than fully structured workflows.
Read more
Harnesses for Inference-Time Alignment over Execution Trajectories
Summary
This paper investigates the design of harnesses for large language model (LLM) agents, focusing on inference-time alignment over execution trajectories. Harness engineering aims to enhance long-term performance by decomposing tasks into sub-goals and providing guidance during execution. However, the authors highlight that more complex harnesses do not always yield better results; excessive decomposition or guidance can lead to reduced task success. The study introduces a framework that separates harnesses into two components: task decomposition and guided execution, allowing for a detailed analysis of how these elements influence performance. The authors identify critical failure modes such as over-decomposition and hallucinated execution. Through controlled experiments and benchmarks, they validate the theory that partial harnessing—specifying only initial steps and allowing agents to determine subsequent actions—can outperform fully structured workflows. This approach emphasizes the importance of aligning harness design with the agent's capabilities and the task's evidence, suggesting a more nuanced strategy for harness development.
Methodology
The authors conducted theoretical modeling of harness design, separating it into workflow and guidance components. They performed controlled synthetic experiments and utilized real terminal agent benchmarks to validate their findings regarding performance limits and failure modes.
Results
The experiments demonstrated that partial harnessing outperformed fully specified workflows, supporting the notion that effective harnesses need not dictate the entire execution path. The findings also highlighted the critical alignment principles necessary for harness design.
Implications
The insights from this research can inform the development of more effective harnesses for LLM agents, potentially leading to improved performance in complex task execution. This could have applications in various domains, including software engineering, scientific discovery, and autonomous tool use.
Latent Cache Flow: Model-to-Model Communication Without Text
Large Language Models
Efficient ML
NLP
- Latent Cache Flow (LCF) reduces the size of communication adapters significantly compared to Cache-to-Cache (C2C).
- LCF allows for efficient model-to-model communication without the need for aligned contexts.
- The method improves accuracy and speed of communication between LLMs, outperforming traditional text-based methods.
- LCF-X extends the capabilities of LCF for cross-context communication by summarizing KV caches.
Read more
Latent Cache Flow: Model-to-Model Communication Without Text
Summary
The paper introduces Latent Cache Flow (LCF), a novel approach to improve communication between large language models (LLMs) by bypassing text-based communication, which is often slow and lossy. Traditional methods like Cache-to-Cache (C2C) require aligned contexts and produce large adapters for translating key-value (KV) caches, making them inefficient for multi-model systems with differing contexts. LCF addresses these issues by jointly translating and compressing keys and values, significantly reducing the size of the adapter to about 4% of C2C's size. Furthermore, LCF is designed to summarize new information for the receiver model, allowing for effective communication even when contexts differ. The paper also presents an extension, LCF-X, which enhances cross-context communication by summarizing the sharer's KV cache into a fixed-size tensor. Experimental results indicate that LCF outperforms C2C in both shared and differing contexts, achieving higher accuracy and faster communication speeds compared to text-based methods.
Methodology
The authors developed LCF by replacing the independent processing of keys and values in C2C with a shared low-dimensional cache channel. This method compresses the KV states into a latent representation, allowing for efficient communication. LCF-X further processes the sharer's KV cache to support communication across different contexts by summarizing the information into a fixed-size tensor.
Results
Early experiments demonstrated that a 13 MB LCF adapter achieved higher accuracy than a 956 MB C2C adapter in shared-context scenarios. In differing contexts, LCF showed a 23% increase in accuracy and was 8.5 times faster than traditional text-based communication methods.
Implications
The findings suggest that LCF can enhance the efficiency of multi-model systems in various applications, such as collaborative AI tasks, where different models must communicate effectively without the constraints of text generation. This could lead to faster and more accurate AI systems in real-world applications.
Relevant Walk Search for Explaining Graph Neural Networks
Graph Learning
Interpretability
- Introduces polynomial-time algorithms for identifying relevant walks in GNNs, enhancing the scalability of GNN-LRP.
- Presents two algorithms: EMP-neu for exact neuron-level walk identification and AMP-ave for approximate node-level walk identification.
- Demonstrates superior performance of the proposed methods across multiple application domains with high accuracy.
- Addresses the computational challenges of GNN-LRP, reducing complexity from exponential to polynomial.
Read more
Relevant Walk Search for Explaining Graph Neural Networks
Summary
This paper addresses the challenge of explainability in Graph Neural Networks (GNNs), which are increasingly used for graph analysis but often operate as black boxes. The authors focus on Layer-wise Relevance Propagation for GNNs (GNN-LRP), which provides higher-order explanations by evaluating the relevance of walks within the network. However, the original GNN-LRP suffers from exponential computational complexity when identifying relevant walks, limiting its scalability. To overcome this, the authors propose two polynomial-time algorithms: an exact max-product search for neuron-level walks (EMP-neu) and an approximate max-product search for node-level walks (AMP-ave). These algorithms leverage the max-product algorithm, a common method in probabilistic graphical models, to efficiently identify the most relevant walks. The paper demonstrates the effectiveness of these algorithms through various experiments across different domains, including epidemiology, molecular data, and natural language processing, showing that they maintain high accuracy while significantly reducing computational costs. The authors provide their implementation, making their approach accessible for further research and application.
Methodology
The authors propose two algorithms based on the max-product algorithm to identify relevant walks in GNNs. EMP-neu is designed for exact identification of neuron-level walks, while AMP-ave approximates node-level walks through averaging. Both algorithms reduce computational complexity from exponential to polynomial, allowing for efficient processing of larger graphs.
Results
The experiments conducted demonstrate that both EMP-neu and AMP-ave effectively identify relevant walks with high accuracy across various benchmarks, including epidemiology, molecular, and natural language processing tasks. The results indicate that the approximation error in AMP-ave is negligible, confirming its utility in practical applications.
Implications
The proposed methods enhance the explainability of GNNs, making them more interpretable and applicable in critical areas such as healthcare, chemistry, and linguistics. By improving computational efficiency, these algorithms can be integrated into larger systems that require real-time explanations of GNN predictions.
Diffusion Domain Expansion: Learning to Coordinate Pre-trained Diffusion Models
Generative Models
Audio & Speech
Computer Vision
- DDE extends the generative capabilities of pre-trained diffusion models to larger objects and complex conditioning.
- The coordinator network is designed to be parameter-efficient and can generalize beyond training sizes.
- DDE outperforms existing coordinated generation methods in both qualitative and quantitative evaluations.
- The method is applicable to various domains, including audio and image generation.
Read more
Diffusion Domain Expansion: Learning to Coordinate Pre-trained Diffusion Models
Summary
This paper introduces Diffusion Domain Expansion (DDE), a novel method aimed at enhancing the capabilities of pre-trained diffusion models for generating larger objects and managing more complex conditioning tasks. The authors propose a compact, trainable coordinator network that effectively synchronizes the outputs of existing diffusion models, allowing for the generation of larger and more intricate data samples than those originally trained on. DDE is evaluated across various domains, including long audio track generation and conditional image generation, showcasing its versatility. The results indicate that DDE not only surpasses existing methods in both qualitative and quantitative measures but also demonstrates a remarkable ability to generalize to larger domains than those encountered during training. This advancement is particularly significant given the rising costs associated with training large models from scratch, emphasizing the importance of leveraging pre-trained models for diverse applications.
Methodology
The DDE framework employs a small coordinator network that learns to coordinate the outputs of pre-trained diffusion models. It operates by decomposing larger generated objects and conditioning inputs into smaller parts, which are then processed by the pre-trained models. The coordinator is trained on a dataset of larger objects and is capable of generating outputs of even larger sizes than those seen during training. The methodology includes a composite denoiser that integrates the outputs from the pre-trained models and the coordinator to produce coherent larger objects.
Results
DDE was evaluated on tasks such as music generation and conditional image generation, including complex scenarios like CLEVR scene generation and satellite image generation. The results demonstrated that DDE significantly outperformed other coordination methods, achieving higher quality outputs and better adherence to the desired conditions.
Implications
The implications of this research are substantial for fields that rely on generative models, as DDE allows for more efficient use of pre-trained models, reducing the need for extensive retraining. This can lead to advancements in various applications, including multimedia content generation, robotics, and any domain requiring complex data synthesis.
The Implicit Bias of Depth: From Neural Collapse to Softmax Codes
Theory
Optimization
- Depth induces an implicit low-rank bias that affects the emergence of Neural Collapse.
- Low-rank structures promote efficient norm propagation through matrix multiplications.
- The study connects depth-induced biases to softmax codes, revealing a relationship with max-margin solutions.
- Training dynamics show that increasing depth can destabilize NC, favoring low-rank alternatives.
Read more
The Implicit Bias of Depth: From Neural Collapse to Softmax Codes
Summary
This paper investigates the phenomenon of Neural Collapse (NC) in deep neural networks, particularly focusing on how depth influences implicit biases during training. The authors analyze the deep unconstrained feature model (UFM), which is akin to a deep linear network with orthogonal inputs, trained without explicit regularization. They demonstrate that depth introduces an implicit low-rank bias, which affects the propagation of norms through matrix multiplications, favoring low-rank alternatives to NC. The study reveals that these low-rank structures correspond to softmax codes, which are max-margin solutions typically observed in networks with width bottlenecks. The authors provide a comprehensive characterization of the dynamics of training under spectral initialization, highlighting an early-time repulsion among singular values that leads to low-rank emergence. They also note that while increasing depth tends to shrink the basin of attraction for NC, wider networks can bias training towards higher-rank solutions. Overall, the paper offers insights into the complex interplay between depth, optimization dynamics, and initialization in shaping the implicit biases that govern the training outcomes of deep networks.
Methodology
The authors utilize theoretical analysis of the deep UFM trained with cross-entropy loss without explicit regularization. They explore both asymptotic behavior and training dynamics, employing spectral initialization to study the evolution of singular values and their impact on convergence to NC. Empirical validation is conducted on deep UFMs and neural networks to support their theoretical findings.
Results
The paper establishes that in deep architectures, NC is not the global optimum due to the implicit low-rank bias introduced by depth. It identifies that low-rank structures can emerge as stable solutions during training, particularly in the presence of depth, while also showing that wider networks can lead to higher-rank solutions. The findings provide a dynamic characterization of how depth influences the implicit biases in deep learning models.
Implications
The insights from this study could inform the design of neural network architectures and training protocols, particularly in understanding how depth and width interact to shape model performance. This understanding may lead to improved strategies for achieving desired geometric configurations in deep learning, enhancing model robustness and interpretability.
Ternary Decision Trees with Locally-Adaptive Uncertainty Zones
Interpretability
Theory
Efficient ML
- Introduction of ternary decision trees with uncertainty zones to enhance decision boundary handling.
- Five methods for local computation of uncertainty zone width δ are proposed and evaluated.
- All proposed methods significantly outperform standard CART in terms of accuracy across multiple datasets.
- The margin method achieves the highest efficiency and requires no additional hyperparameters.
Read more
Ternary Decision Trees with Locally-Adaptive Uncertainty Zones
Summary
This paper introduces a novel approach to decision trees by proposing ternary decision trees that incorporate locally-adaptive uncertainty zones. Traditional decision trees utilize hard binary thresholds, which treat instances near decision boundaries with the same confidence as those far from them. The proposed ternary decision trees address this limitation by defining an uncertainty zone around the optimal threshold, allowing for a more nuanced prediction mechanism. Instances within this zone receive predictions based on a weighted blend of the outputs from both child subtrees and are flagged as boundary-uncertain. The half-width of the uncertainty zone, δ, is computed locally at each node using five different estimation methods derived from existing split statistics, eliminating the need for external noise specifications. The paper evaluates these methods across 72 datasets from OpenML-CC18 and demonstrates that all methods significantly outperform standard CART in terms of decided accuracy. The margin method, in particular, shows the best efficiency and self-calibration on clean data. Practical applications are illustrated through experiments on medical and financial datasets, showcasing the potential for improved accuracy in real-world scenarios.
Methodology
The methodology involves creating ternary decision trees that define an uncertainty zone around the optimal threshold at each split node. Five methods for estimating the width of this zone (δ) are proposed, including quality-plateau, class-overlap, gain-ratio, node-bootstrap, and margin. The performance of these methods is evaluated through extensive empirical testing on various datasets, utilizing 5-fold cross-validation.
Results
The evaluation shows that all five δ-estimation methods with probabilistic routing significantly outperform standard CART in decided accuracy (p ≤ 0.001). The margin method achieves the best performance, winning on 42 out of 72 datasets and demonstrating self-calibration on clean data. In practical applications, the node-bootstrap method improves accuracy by +0.71% on mammography data by flagging 10.8% of cases as boundary-uncertain.
Implications
The findings suggest that ternary decision trees can provide more reliable predictions in scenarios where decision boundaries are uncertain, making them applicable in fields such as healthcare and finance where accurate decision-making is critical. The ability to flag boundary-uncertain instances allows for tailored downstream processing, potentially leading to better outcomes.
Anytime Training with Schedule-Free Spectral Optimization
Optimization
Theory
Efficient ML
- SF-NorMuon outperforms SF-AdamW and matches or exceeds tuned AdamW optimizers.
- The method allows for high-quality checkpoints at any training stage without predefined horizons.
- Theoretical guarantees for stability in long-horizon training are established.
- Weight decay is identified as crucial for maintaining performance during extended training periods.
Read more
Anytime Training with Schedule-Free Spectral Optimization
Summary
This paper addresses the limitations of standard neural network training methods that rely on fixed learning-rate schedules, which can lead to inefficiencies and require extensive retuning as data availability changes. The authors introduce SF-NorMuon, a novel schedule-free spectral optimizer that outperforms the current state-of-the-art SF-AdamW optimizer. SF-NorMuon achieves comparable or superior performance to tuned AdamW optimizers across various parameter sizes and training horizons without the need for explicit learning-rate schedules. The paper also provides theoretical guarantees for the stability of schedule-free spectral dynamics and emphasizes the importance of weight decay for maintaining long-horizon stability. By enabling practitioners to obtain high-quality model checkpoints at any point during training, SF-NorMuon facilitates more practical horizon-free optimization, paving the way for continual learning in dynamic environments.
Methodology
The authors propose SF-NorMuon, a schedule-free spectral optimizer that utilizes spectral norms to optimize weight matrices in neural networks. The method operates without explicit learning-rate schedules, relying instead on a single hyperparameter configuration. The optimization process is grounded in theoretical principles that ensure stability and convergence, particularly emphasizing the role of weight decay in long-horizon training.
Results
SF-NorMuon consistently outperformed SF-AdamW and matched the performance of well-tuned AdamW optimizers across various models (125M and 772M parameters) and training horizons (1–8× Chinchilla). The results demonstrate that the proposed method effectively closes the performance gap with traditional optimizers while maintaining the flexibility of schedule-free training.
Implications
The development of SF-NorMuon has significant implications for practitioners in machine learning, particularly in scenarios involving continual learning and dynamic data environments. By removing the need for fixed training horizons, this method allows for more adaptive and efficient training processes, which can lead to improved model performance in real-world applications.
Noise Schedule Design for Diffusion Models: An Optimal Control Perspective
Generative Models
Optimization
Theory
- The paper formulates noise schedule design as an optimal control problem, enhancing theoretical understanding.
- It establishes that a broader class of noise schedules can achieve O(d/n) sampling error bounds.
- The introduction of Affine-Coupled Schedules (ACS) allows for systematic tuning of noise schedules.
- Empirical results show that optimized schedules outperform traditional heuristic methods in image generation tasks.
Read more
Noise Schedule Design for Diffusion Models: An Optimal Control Perspective
Summary
This paper presents a novel framework for analyzing and designing noise schedules in diffusion models through the lens of optimal control theory. The authors reformulate the noise schedule design problem as an optimal control problem, where the state is represented by the Fisher information of the diffusion process, evolving according to an ordinary differential equation (ODE), and the control input corresponds to the noise schedule. The objective is to minimize a functional involving Fisher information, which serves as an upper bound on the Kullback-Leibler (KL) sampling error. By solving this optimal control problem, the authors derive sufficient conditions for noise schedules that can achieve a sampling error of O(d/n), where d is the data dimension and n is the number of discretization steps. This result extends existing theoretical guarantees to a broader class of noise schedules, including those commonly used in practice. The authors also introduce a family of noise schedules termed Affine-Coupled Schedules (ACS), which generalize existing empirical schedules and allow for parameter tuning. Empirical results demonstrate that these newly designed schedules yield improved Fréchet Inception Distance (FID) scores on image generation tasks compared to standard heuristic approaches.
Methodology
The authors utilize optimal control theory to reformulate the noise schedule design problem, employing tools from probability and information theory, such as Bochner's formula and the concavity of entropy power. They derive a one-dimensional ODE governing the Fisher information trajectory and analyze it under specific data distribution assumptions to obtain closed-form expressions for noise schedules.
Results
The study demonstrates that the proposed optimal control framework leads to noise schedules that achieve O(d/n) sampling error bounds. The Affine-Coupled Schedules (ACS) derived from this framework generalize existing empirical schedules and allow for parameter tuning, resulting in improved FID scores in image generation benchmarks.
Implications
The findings suggest that a principled approach to noise schedule design can bridge the gap between theoretical guarantees and practical implementations in diffusion models, potentially leading to advancements in generative modeling and applications in image synthesis and other domains.
What Linear Probes Miss: Multi-View Probing for Weight-Space Learning
Theory
Efficient ML
Computer Vision
- Identifies limitations of first-order single-view probing methods, which can lead to indistinguishable representations for distinct weight matrices.
- Introduces MVProbe, a multi-perspective framework that integrates first-order and Gram-based interaction views for enhanced representation learning.
- Demonstrates state-of-the-art performance on the Model Jungle benchmark across diverse architectures, including ResNet and Stable Diffusion LoRA.
- Provides a principled per-sample standardization scheme to balance contributions from different probing views.
Read more
What Linear Probes Miss: Multi-View Probing for Weight-Space Learning
Summary
The paper addresses the challenges of model selection in the rapidly growing landscape of open-source model repositories, where many models lack adequate documentation. It critiques existing probing methods that rely on single-view designs, which capture first-order structures but miss higher-order correlations in weight matrices. To overcome these limitations, the authors introduce MVProbe, a multi-view probing framework that combines first-order projections with interaction-aware Gram-based views. This approach is grounded in theoretical analysis and provides a principled standardization and fusion strategy to ensure balanced contributions from different probing branches. The authors validate MVProbe on the Model Jungle benchmark, demonstrating its superiority over the state-of-the-art ProbeX across various architectures, including both discriminative and generative models. The results indicate that MVProbe not only captures richer representations of weight matrices but also performs robustly across layers that are typically less informative for probing.
Methodology
The authors developed MVProbe, which incorporates multiple probing branches: direct projections from rows and columns, as well as Gram-based views that capture pairwise interactions. A theoretical framework was established to guide the standardization and fusion of these views, ensuring that the contributions from each perspective are balanced. The approach was validated using the Model Jungle dataset, focusing on single-layer probing.
Results
MVProbe consistently outperformed existing methods, particularly ProbeX, across various architectures. It demonstrated robust performance even in layers where traditional probing methods struggled, achieving significant improvements in representation quality and model characterization.
Implications
The findings suggest that MVProbe can enhance model selection processes in open-source repositories by providing deeper insights into model properties directly from weight parameters. This could facilitate better utilization of shared models in practical applications, especially in scenarios where documentation is lacking.
Leveraging Foundation Models for Causal Generative Modeling
Generative Models
Computer Vision
Multimodal
- Introduction of FM-CGM, a modular framework for causal generative modeling using foundation models.
- Development of Causal Semantic Guidance (CSG) to ensure accurate propagation of semantic interventions.
- Demonstration of the framework's ability to perform zero-shot causal discovery and counterfactual generation.
- Empirical validation showing the framework's effectiveness in generating visually plausible counterfactual images.
Read more
Leveraging Foundation Models for Causal Generative Modeling
Summary
This paper introduces FM-CGM, a modular framework for causal generative modeling that leverages pretrained foundation models to enhance visual causal reasoning. The authors identify a gap in existing methods, which often lack a unified approach to integrate zero-shot reasoning capabilities of foundation models into causal generative modeling. FM-CGM consists of three main components: a concept extractor that identifies causal relationships from images, a concept manipulator that performs interventions on these concepts, and a counterfactual generator that creates images based on these interventions. The framework employs a reasoning model (Qwen3-VL) for causal inference and a text-to-image diffusion model (Stable Diffusion XL) for generating counterfactual images. A novel mechanism called Causal Semantic Guidance (CSG) is developed to ensure that interventions on concepts propagate correctly to related concepts while preserving invariant regions. The empirical results demonstrate that FM-CGM can effectively identify plausible causal structures and generate faithful counterfactual images, showcasing its potential for applications in reliable AI systems capable of counterfactual reasoning.
Methodology
The FM-CGM framework is structured around three core components: a concept extractor (Qwen3-VL) for inferring causal relationships, a concept manipulator for performing interventions, and a counterfactual generator (Stable Diffusion XL) for creating images based on these interventions. Causal Semantic Guidance (CSG) is employed to manage the propagation of changes across related concepts while maintaining the integrity of non-descendant concepts.
Results
The authors empirically demonstrate that FM-CGM can identify plausible causal structures and generate counterfactual images that are visually coherent and semantically aligned with the intended causal interventions. The results indicate that the framework is capable of minimal and faithful counterfactual edits.
Implications
The proposed framework has significant implications for developing AI systems that require reliable and transparent causal reasoning capabilities. It can be applied in various domains, including scientific discovery, automated reasoning, and any context where understanding causal relationships is critical.
It's the humans, not the data: Geopolitical bias in LLMs originates in post-training, amplified by the language of the prompt
Large Language Models
NLP
- Geopolitical bias in LLMs originates in the post-training phase, not pre-training.
- Bias shifts are influenced by the nationality of the model developers.
- The language used to prompt the model can amplify existing biases.
- Significant bias shifts were observed across multiple LLMs from different labs.
Read more
It's the humans, not the data: Geopolitical bias in LLMs originates in post-training, amplified by the language of the prompt
Summary
This paper challenges the prevailing assumption that geopolitical bias in language models (LLMs) arises primarily from the pre-training data. The authors conducted experiments on seven pairs of open-weight LLMs, comparing their base models (pre-training only) with their chat models (post-training included) across three languages: English, French, and Chinese. The findings reveal that geopolitical biases are introduced during the post-training phase, with significant shifts in bias direction corresponding to the nationality of the model developers. For instance, Alibaba's Qwen 2.5 model exhibited an 18-fold increase in pro-China bias after post-training. Additionally, the language of the prompt was found to amplify these biases, as demonstrated by the Mistral model becoming pro-France only when prompted in French. These results underscore the importance of transparency and oversight in the alignment processes of LLMs, as biases are not merely inherited from training data but actively shaped during post-training.
Methodology
The study utilized a paired-scenario forced-choice probe across seven open-weight LLM pairs from various labs. Each pair consisted of a base model and a post-trained chat variant, tested on 28 geopolitical scenarios in English, French, and Chinese. The models were selected based on specific criteria to ensure comparability, and the scenarios were designed to cover a range of geopolitical issues.
Results
The results indicated that six out of seven models showed a shift in geopolitical bias after post-training, with the most pronounced shift observed in Alibaba's Qwen 2.5 model. The magnitude of bias change was also dependent on the language of the prompt, with the Mistral model displaying pro-France bias only when prompted in French. Overall, the study found that biases are actively shaped during post-training rather than being solely inherited from pre-training data.
Implications
These findings suggest that the development and deployment of LLMs require careful consideration of post-training processes and the potential for biases to be introduced or amplified. There is a pressing need for transparency and auditing in the alignment processes of LLMs to ensure fair and unbiased representation of nations and cultures.
Convex Low-resource Accent-Robust Language Detection in Speech Recognition
Audio & Speech
Optimization
Efficient ML
- Introduction of Convex Language Detection (CLD) framework for robust language identification.
- Utilization of convex optimization techniques to ensure global optimality and fast training.
- Theoretical guarantees of robustness and stability against feature perturbations.
- Empirical validation showing high accuracy in low-resource dialect identification tasks.
Read more
Convex Low-resource Accent-Robust Language Detection in Speech Recognition
Summary
This paper addresses the challenges faced by spoken dialogue systems in accurately detecting languages, particularly underrepresented dialects and accents, which often lead to misidentification and subsequent failures in downstream tasks. The authors propose a novel framework called Convex Language Detection (CLD) that utilizes convex optimization techniques to enhance language detection in low-resource settings. The CLD framework is implemented using the Alternating Direction Method of Multipliers (ADMM) in JAX, ensuring efficient training and global optimality. The theoretical contributions include proving certified margin stability and robustness against feature perturbations, which guarantees stable predictions within a defined radius. Empirical evaluations demonstrate that CLD achieves 97-98% accuracy across five languages and twenty-four sub-dialects, showcasing its effectiveness in handling dialectical variations. This work aims to democratize access to robust spoken dialogue systems and improve user experiences across multicultural backgrounds.
Methodology
The CLD framework employs convex optimization techniques, specifically reformulating the language detection task as a convex program. It is implemented using the Alternating Direction Method of Multipliers (ADMM) in JAX, allowing for efficient training and inference. The authors derive logit-Lipschitz constants to ensure certified robustness and stability of predictions against perturbations.
Results
The CLD framework achieved an impressive accuracy of 97-98% in language detection tasks across five languages and twenty-four sub-dialects, demonstrating significant improvements in sample efficiency and robustness to dialectical variations in low-resource settings.
Implications
The development of the CLD framework has the potential to enhance the performance of spoken dialogue systems in diverse linguistic contexts, improving accessibility and user trust in technology across multicultural societies. It also opens avenues for further research in robust language detection and optimization techniques in speech recognition.
CoSPlay: Cooperative Self-Play at Test-Time with Self-Generated Code and Unit Test
Large Language Models
Reinforcement Learning
Generative Models
- CoSPlay is a GT-free and training-free framework for code generation.
- It employs a cooperative self-play mechanism to improve both code candidates and self-generated unit tests.
- The framework demonstrates significant performance improvements over existing RLVR models.
- CoSPlay shows scalability and generalizability across different model backbones.
Read more
CoSPlay: Cooperative Self-Play at Test-Time with Self-Generated Code and Unit Test
Summary
The paper introduces CoSPlay, a novel framework designed to enhance code generation in large language models (LLMs) without relying on ground-truth unit tests (GT UTs). Traditional reinforcement learning with verifiable rewards (RLVR) and test-time scaling (TTS) methods depend heavily on GT UTs, which are costly to curate and limit scalability. CoSPlay addresses this by enabling a cooperative self-play mechanism that allows self-generated unit tests to evolve alongside code candidates. The framework operates in three stages: first, it generates diverse solution ideas and corresponding unit test ideas based on potential failure modes; second, it employs an execution matrix to iteratively refine both codes and unit tests through mutual evaluation; and finally, it selects the best code from clusters based on output consensus when multiple codes are tied. Experimental results demonstrate that CoSPlay significantly improves the average best of N (BoN) from 22.1% to 33.2% and unit test accuracy from 14.6% to 78.3% on challenging benchmarks, outperforming existing RLVR models and showing scalability across various backbones. This work presents a promising direction for competitive code generation without the need for GT data.
Methodology
CoSPlay operates through a three-stage pipeline: (1) Exploration-Attack-Guided Code-UT Idea Generation, which generates diverse solution and unit test ideas; (2) Execution-Matrix-Driven Iterative Self-Play, where codes and unit tests evaluate each other to refine their quality; and (3) Output-Consensus-Based Cluster Selection, which resolves ties among code candidates by selecting the most reliable based on output agreement.
Results
CoSPlay improved average BoN from 22.1% to 33.2% and unit test accuracy from 14.6% to 78.3% on four benchmarks. It also outperformed the RLVR model CURE-7B and demonstrated continued performance gains with increased output-token budgets.
Implications
The findings suggest that CoSPlay could lead to more efficient and scalable code generation systems that do not require extensive ground-truth data, making it applicable in various coding and software development contexts.
Expand More, Shrink Less: Shaping Effective-Rank Dynamics for Dense Scaling in Recommendation
Optimization
- RankMixer suffers from embedding collapse, limiting its expressivity and scalability.
- RankElastor introduces parameterized full mixing and GLU-improved P-FFNs to enhance representation quality.
- Empirical results show that RankElastor consistently outperforms RankMixer in CTR prediction tasks.
- The architecture demonstrates improved effective rank, indicating better mitigation of representation collapse.
Read more
Expand More, Shrink Less: Shaping Effective-Rank Dynamics for Dense Scaling in Recommendation
Summary
This paper addresses the challenge of scaling recommendation models, focusing on the limitations of the RankMixer architecture, which suffers from embedding collapse due to low effective rank in learned representations. The authors identify that rigid token mixing and per-token feedforward networks (P-FFNs) contribute to a damped oscillatory trajectory in effective-rank evolution across layers. To overcome these issues, they propose RankElastor, a novel architecture that enhances spectral robustness and mitigates collapse. RankElastor introduces parameterized full mixing for expressive token mixing and GLU-improved P-FFNs to stabilize representation spectra. Extensive experiments on large-scale datasets demonstrate that RankElastor significantly improves recommendation performance, mitigates embedding collapse, and exhibits robust scaling behavior, outperforming RankMixer and other baselines in terms of accuracy and effective rank.
Methodology
The authors conducted a theoretical analysis of the RankMixer architecture to understand the causes of embedding collapse. They then designed RankElastor, incorporating parameterized full mixing and GLU-improved P-FFNs to enhance spectral robustness. Extensive experiments were performed on industrial-scale datasets (Criteo and Avazu) to evaluate the performance of RankElastor against strong baselines.
Results
RankElastor achieved over 0.001 AUC gain in CTR prediction compared to the strongest baseline, demonstrating significant improvements in recommendation accuracy. Additionally, it produced representations with higher effective rank, indicating effective mitigation of embedding collapse and better scaling behavior.
Implications
The findings suggest that RankElastor can be effectively utilized in large-scale recommender systems, enhancing their performance and scalability. This architecture could be applied in various domains requiring robust recommendation capabilities, such as e-commerce and content platforms.
MedExpMem: Adapting Experience Memory for Differential Diagnosis
Multimodal
- MedExpMem enables VLMs to accumulate differential diagnosis expertise through an experience memory framework.
- The framework employs a two-phase memory construction process that mimics clinical learning.
- It organizes knowledge around diagnosis pairs, enhancing the model's ability to differentiate between similar conditions.
- Evaluation on a radiology benchmark shows consistent accuracy improvements, validating the approach's effectiveness.
Read more
MedExpMem: Adapting Experience Memory for Differential Diagnosis
Summary
The paper introduces MedExpMem, an innovative experience memory framework designed to enhance the capabilities of vision-language models (VLMs) in the context of differential diagnosis in medical settings. Unlike traditional models that rely on static knowledge, MedExpMem allows diagnostic agents to accumulate and refine their expertise through a structured memory system that captures insights from their own diagnostic failures. The framework operates in two phases: the first phase identifies knowledge gaps through zero-shot diagnosis, while the second phase involves reflective re-diagnosis to consolidate learning and correct reasoning errors. This approach mirrors the clinical learning process of physicians, enabling agents to retrieve relevant experiences during new diagnostic encounters. The authors evaluate MedExpMem on a radiology benchmark, demonstrating consistent accuracy improvements across various models, highlighting its effectiveness in addressing the dynamic needs of medical diagnosis without requiring parameter updates.
Methodology
The methodology involves a two-phase construction process for experience memory. In Phase I, the diagnostic agent performs zero-shot diagnosis to identify reasoning blind spots. In Phase II, it revisits erroneous cases to refine its understanding and consolidate reliable patterns. The experience memory is structured around diagnosis pairs, capturing key discriminators, decision rules, and reasoning error patterns derived from the agent's own diagnostic failures.
Results
The evaluation of MedExpMem on a radiology benchmark demonstrated consistent accuracy improvements of up to 7.0% across various VLM models. Analytical experiments confirmed the quality and robustness of the accumulated experience, showcasing the framework's effectiveness in enhancing differential reasoning capabilities.
Implications
MedExpMem has significant implications for the development of medical AI systems, particularly in enhancing diagnostic accuracy and adaptability in clinical settings. By enabling VLMs to learn from their experiences, it addresses the challenges of static knowledge representation and supports continuous learning in privacy-sensitive environments.
Alike Parts: A Feature-Informed Approach to Local and Global Prototype Explanations
Interpretability
- Introduction of 'alike parts' for local explanations that highlights shared features between instances and prototypes.
- Augmentation of global prototype selection with a feature importance term to enhance diversity in feature attributions.
- Evaluation on six benchmark datasets showing that diversity in feature importance does not compromise model fidelity.
- Significant extension of previous work with a broader evaluation of algorithms and a more extensive experimental analysis.
Read more
Alike Parts: A Feature-Informed Approach to Local and Global Prototype Explanations
Summary
This paper presents a novel framework for enhancing the interpretability of machine learning models through prototype-based explanations. The authors address the limitations of existing prototype methods, which often lack feature-level granularity. They introduce 'alike parts', a method that utilizes feature importance scores to identify and highlight the most relevant shared features between a classified instance and its nearest prototype, thereby guiding user attention for local explanations. Additionally, they augment the global prototype selection process by incorporating a feature importance term into the objective function, promoting diversity in the feature attributions of selected prototypes. The proposed methods were evaluated on six benchmark datasets, demonstrating that the augmented selection process can maintain or even improve the prediction fidelity of the surrogate model, indicating that feature diversity does not compromise model fidelity. This work significantly extends previous research by evaluating multiple prototype generation algorithms, exploring a broader range of feature importance algorithms, and conducting a comprehensive experimental analysis.
Methodology
The authors employed a model-agnostic explanation method to compute feature importance for black-box models. They developed 'alike parts' to identify overlapping features between instances and prototypes for local explanations. For global explanations, they modified the prototype selection objective function to include a feature importance term, promoting diversity in the selected prototypes.
Results
The experiments revealed that the proposed methods maintained or improved the prediction fidelity of the surrogate model across six benchmark datasets, demonstrating that incorporating feature diversity into prototype selection does not detract from the model's performance.
Implications
This research has significant implications for the field of Explainable Artificial Intelligence (XAI), particularly in enhancing the interpretability of black-box models in high-stakes applications. By improving the granularity of prototype explanations, it fosters greater trust and understanding in AI systems, which is crucial for their safe deployment in critical areas.
Adaptive Mass-Segmented KV Compression for Long-Context Reasoning
Large Language Models
NLP
Efficient ML
- Identifies 'Region Wipe-out' as a critical failure mode in KV cache compression.
- Proposes AMS KV Compression, focusing on region-aware quota allocation.
- Introduces EMA-based stabilization to maintain coherence in reasoning.
- Demonstrates strong performance improvements across diverse reasoning tasks.
Read more
Adaptive Mass-Segmented KV Compression for Long-Context Reasoning
Summary
The paper addresses the challenge of linear growth in Key-Value (KV) cache during long-form reasoning in large language models (LLMs), which can significantly degrade inference performance. Existing KV compression methods often rely on global Top-k selection, leading to a phenomenon termed 'Region Wipe-out,' where essential reasoning segments are evicted, disrupting logical coherence. To counter this, the authors propose Adaptive Mass-Segmented (AMS) KV Compression, which shifts the focus from token-level competition to region-aware quota allocation. AMS partitions the KV cache based on the spatial distribution of attention mass, ensuring that critical reasoning segments are preserved. Additionally, an Exponential Moving Average (EMA)-based smoothing mechanism is introduced to stabilize segment boundaries during iterative decoding. The AMS framework is designed to be modular and can be integrated with existing KV compression methods without architectural changes. Extensive experiments demonstrate that AMS effectively reduces structural fragmentation and enhances model performance across various tasks, including mathematical reasoning, code completion, and open-domain question answering, particularly under constrained KV budgets.
Methodology
The AMS framework employs a region-aware quota allocation strategy, where the KV cache is adaptively segmented based on the distribution of attention mass. It allocates retention quotas to segments before performing local token selection, thus protecting important reasoning regions. The EMA-based smoothing mechanism is used to stabilize allocation decisions across compression events.
Results
The AMS approach shows significant improvements in performance metrics across various tasks. For instance, on the MATH500 dataset, AMS-Expected outperforms previous methods by up to 16.0 points and enhances the performance of existing scorers like TriAttention by up to 6.4 points. AMS also maintains or improves latency compared to existing gather-based policies while reducing issues like repetition collapse.
Implications
The AMS framework has the potential to enhance the efficiency and effectiveness of long-context reasoning in LLMs, making it applicable in various domains requiring coherent and logical reasoning over extended text, such as automated tutoring systems, advanced code generation, and complex question answering systems.
Learning Individual Dynamics from Sparse Cross-Sectional Snapshots
Time Series
Theory
Efficient ML
- Introduces CADENCE, a framework for inferring individual dynamics from sparse data.
- Establishes identifiability guarantees for single-timepoint trajectory inference.
- Combines a score-based spatial encoder with a Soft Mixture-of-Experts router.
- Demonstrates superior performance compared to state-of-the-art sequential models.
Read more
Learning Individual Dynamics from Sparse Cross-Sectional Snapshots
Summary
This paper addresses the challenge of predicting individual dynamics from sparse cross-sectional data, which is common in various fields such as biomedical studies and engineering. Traditional methods either require dense longitudinal data or lose individual dynamics by focusing on aggregate populations. The authors introduce CADENCE, a probabilistic framework that recovers continuous individual trajectories by leveraging static individual-level contexts. CADENCE combines a score-based spatial encoder with a Soft Mixture-of-Experts (SMoE) router, allowing for the joint identification of individual dynamical parameters. The paper provides identifiability guarantees for single-timepoint trajectory inference and demonstrates that CADENCE can effectively learn from extremely sparse data, outperforming state-of-the-art sequential models that rely on dense datasets across multiple benchmarks, including epidemiology and ecology.
Methodology
The methodology involves a two-stage training scheme that decouples the learning process, reducing computational costs. CADENCE employs a score-based bijective Probability Flow ODE to eliminate ambiguities and a SMoE router to identify individual dynamics. The framework is grounded in theoretical guarantees that allow for effective inference from minimal observations.
Results
CADENCE matches or exceeds the performance of existing sequential models trained on dense, full-trajectory datasets across seven benchmarks. The framework successfully infers individual trajectories from as few as three observations per unit, demonstrating its effectiveness in handling sparse data.
Implications
The findings suggest that CADENCE can be applied in fields where longitudinal data is scarce or difficult to obtain, such as personalized medicine, epidemiology, and engineering diagnostics. This approach could enhance predictive modeling and decision-making in various quantitative sciences.
Uncovering the Latent Potential of Deep Intermediate Representations
NLP
Large Language Models
Multimodal
- Task-specific subspaces formed by intermediate layers often outperform the final layer in downstream tasks.
- LOES effectively identifies optimal layer combinations, enhancing performance by minimizing residual error under geometric constraints.
- The proposed GeoReg loss function stabilizes representation geometry during fine-tuning, preventing feature collapse.
- Performance improvements scale with model depth and are applicable across various architectures and data modalities.
Read more
Uncovering the Latent Potential of Deep Intermediate Representations
Summary
This paper investigates the hierarchical nature of representations learned by foundational models and challenges the conventional practice of relying solely on the final layer for downstream tasks. The authors demonstrate that task-relevant information is distributed non-monotonically across layers, necessitating a more nuanced approach to layer selection. They introduce Layer-wise Optimal Embedding Selection (LOES), a spectral method that identifies task-discriminative subspaces by minimizing residual error while adhering to orthogonality and isotropy constraints. Additionally, they propose Geometric Regularization Loss (Geo-Reg) to stabilize representation geometry during fine-tuning. The empirical results show that LOES consistently outperforms standard baselines across various architectures and modalities, revealing the distribution of semantic factors across layers and enhancing interpretability in cross-lingual and cross-modal contexts. The findings suggest that the geometry of layer-wise embeddings is crucial for effective knowledge representation and transfer in deep learning models.
Methodology
The authors developed LOES, a constructive spectral framework that formulates layer selection as a ridge-residual optimization problem. This method analyzes the eigenspectrum of layer-wise representations to identify optimal combinations of embeddings. GeoReg is introduced as an auxiliary loss to enforce a simplicial structure on class manifolds, enhancing stability during fine-tuning.
Results
LOES consistently outperformed standard baselines, with performance gains increasing with model depth. The method revealed how semantic factors are distributed across layers, enabling improved interpretability in cross-lingual and cross-modal analyses. The findings indicate that effective transfer learning relies on a careful selection of intermediate layers rather than solely the final layer.
Implications
The insights from this research could lead to more effective transfer learning strategies in various applications, enhancing model performance across different tasks and modalities. The methods proposed may also contribute to advancements in interpretability, allowing practitioners to better understand how models encode information.
Quantitative coronary calcification analysis for prediction of myocardial ischemia using non-contrast CT calcium scoring
Interpretability
- Developed a machine learning framework for predicting myocardial ischemia from non-contrast CT calcium scoring.
- Analyzed 1,375 patients, identifying 74 variables including clinical data and calcium-omics features.
- Achieved high precision (98.9%) and significant improvement in predictive performance with calcium-omics features.
- Demonstrated the strong association of calcified arteries with myocardial ischemia through logistic regression analysis.
Read more
Quantitative coronary calcification analysis for prediction of myocardial ischemia using non-contrast CT calcium scoring
Summary
This study presents a novel machine learning framework aimed at predicting myocardial ischemia using non-contrast computed tomography calcium scoring (CTCS). The research analyzed data from 1,375 patients who underwent both CTCS and regadenoson stress cardiac positron emission tomography (PET) myocardial perfusion imaging. A total of 74 variables, including clinical data, Agatston scores, and calcium-omics features, were evaluated. The authors employed XGBoost with Shapley Additive exPlanations (SHAP) to identify relevant features for model training. The final predictive model included the Agatston score, eight calcium-omics features, and age, achieving a precision of 98.9%, sensitivity of 79.2%, and an F1 score of 87.7%. Notably, the inclusion of calcium-omics features significantly enhanced predictive performance compared to models using only clinical variables or the Agatston score. The study highlights the strong association between the number of calcified arteries and myocardial ischemia, despite its lower ranking in SHAP analysis. This research is pioneering in its approach to utilize quantitative coronary calcification analysis from routine CTCS scans for predicting myocardial ischemia, suggesting that calcium-omics features can improve cardiovascular risk stratification.
Methodology
The study utilized a retrospective analysis of patient data, employing XGBoost for feature selection and model training. Relevant features were identified using SHAP, and predictive models were evaluated through 5-fold cross-validation.
Results
The final model achieved a precision of 98.9%, sensitivity of 79.2%, and an F1 score of 87.7%. The addition of calcium-omics features significantly improved predictive performance compared to traditional models. The number of calcified arteries was found to have a strong association with myocardial ischemia.
Implications
The findings suggest that quantitative analysis of coronary calcifications from routine CTCS scans can enhance cardiovascular risk stratification, potentially leading to more accessible and effective screening for myocardial ischemia.
Structured-Sparse Attention for Entity Tracking with Subquadratic Sequence Complexity
NLP
Efficient ML
- Introduces a structured-sparse view of attention in entity tracking.
- Develops a blockwise inverse formulation that achieves subquadratic sequence complexity.
- Demonstrates practical speedups in inference time while maintaining accuracy.
- Identifies a limitation related to the capacity of attention heads in tracking multiple properties.
Read more
Structured-Sparse Attention for Entity Tracking with Subquadratic Sequence Complexity
Summary
This paper addresses the challenge of entity tracking in long sequences, which involves maintaining and updating latent states for entities and their attributes. Traditional attention mechanisms in Transformers incur quadratic complexity due to dense evaluations over all token pairs. The authors observe that learned attention maps exhibit a structured and sparse nature, with significant mass concentrated in local block-diagonal neighborhoods. To leverage this structure, they propose a blockwise evaluation of a resolvent-style operator that maintains exact within-block interactions while efficiently routing cross-block interactions through a reduced system. This approach results in a subquadratic complexity of O(n^(4/3)d) for sequence length n and dimension d, and O(n^(7/3)) when d is approximately equal to n. Empirical evaluations demonstrate that their method achieves comparable accuracy to dense operators while reducing wall-clock time by 12-29% and being up to 2.4 times faster than compact dense Transformers. The paper also discusses the limitations of the approach, particularly when the number of evolving properties exceeds the number of attention heads.
Methodology
The authors propose a blockwise evaluation scheme for attention that decomposes the attention matrix into a block-diagonal term capturing local interactions and an off-block residual for cross-block interactions. They utilize a resolvent-style operator to aggregate multi-hop state propagation within a single layer, allowing for efficient computation while preserving the semantics of attention.
Results
The proposed method matches the accuracy of dense attention operators while achieving a 12-29% reduction in wall-clock time and up to 2.4 times faster inference compared to compact dense Transformers. Ablation studies reveal the impact of block size and model capacity on performance.
Implications
This work has significant implications for improving the efficiency of entity tracking in various applications such as document-level information extraction, dialogue state tracking, and long-context question answering. The structured-sparse attention mechanism can enhance the performance of models dealing with long sequences while reducing computational costs.
Memory-R2: Fair Credit Assignment for Long-Horizon Memory-Augmented LLM Agents
Large Language Models
Reinforcement Learning
Optimization
- Memory-R2 addresses the challenge of fair credit assignment in memory-augmented LLM agents.
- The LoGo-GRPO algorithm combines local and global optimization for improved trajectory comparisons.
- A shared-parameter architecture enables efficient co-learning between memory extraction and management.
- Memory construction is formulated as a multi-step decision process, enhancing flexibility and accuracy.
Read more
Memory-R2: Fair Credit Assignment for Long-Horizon Memory-Augmented LLM Agents
Summary
The paper introduces Memory-R2, a training framework designed for long-horizon memory-augmented large language model (LLM) agents. The challenge addressed is the non-stationarity introduced by memory in multi-session environments, which complicates trajectory-level comparisons and credit assignment in reinforcement learning (RL). The core algorithm, LoGo-GRPO, combines local and global group-relative optimization to provide fairer credit assignment by ensuring that comparisons are made from identical intermediate memory states. This approach allows for more accurate supervision of memory operations. Additionally, Memory-R2 optimizes the entire memory lifecycle, including memory formation and evolution, through a shared-parameter architecture that consists of a fact extractor and a memory manager. The framework treats memory construction as a multi-step decision process, allowing for iterative refinement. To enhance training stability, a progressive curriculum is implemented, gradually increasing the training horizon from 8 to 32 sessions. The results demonstrate that Memory-R2 significantly improves the performance of memory-augmented LLM agents in long-horizon settings, achieving better accuracy and reduced inference latency.
Methodology
The methodology involves the development of the Memory-R2 framework, which utilizes the LoGo-GRPO algorithm for credit assignment. It incorporates a shared-parameter architecture for a fact extractor and memory manager, and employs a progressive curriculum to scale training sessions. The framework emphasizes local rerollouts for fair comparisons and joint optimization of memory operations.
Results
The implementation of Memory-R2 resulted in improved accuracy and reduced inference latency for memory-augmented LLM agents. The framework demonstrated high data efficiency and effective long-horizon performance, validating the proposed methods for credit assignment and memory optimization.
Implications
The findings suggest that Memory-R2 can enhance the capabilities of LLM agents in applications requiring long-term memory retention and complex decision-making across multiple interactions. This has potential implications for areas such as conversational agents, personalized AI systems, and any application where maintaining context over time is critical.
ModeSwitch-LLM: A Lightweight Phase-Aware Controller for Cross-Mode LLM Inference on a Single GPU
Large Language Models
Efficient ML
Optimization
- Introduces ModeSwitch-LLM, a controller for optimizing LLM inference on a single GPU.
- Achieves 2.10× latency speedup and 51.7% lower energy per token compared to FP16.
- Maintains accuracy close to FP16 with minimal increase in error.
- Demonstrates that rule-based routing is more effective than learned routing policies.
Read more
ModeSwitch-LLM: A Lightweight Phase-Aware Controller for Cross-Mode LLM Inference on a Single GPU
Summary
ModeSwitch-LLM introduces a lightweight request-boundary controller designed to enhance the efficiency of large language model (LLM) inference on a single GPU by dynamically routing requests to appropriate fixed inference modes. The system evaluates various configurations, including FP16, quantized modes, speculative decoding, and hybrid approaches, based on simple workload-level features without modifying the model architecture or requiring retraining. The authors benchmark ModeSwitch-LLM using the Meta-Llama-3.1-8B-Instruct model on an NVIDIA A100 GPU, demonstrating a mean latency speedup of 2.10× over the FP16 baseline and a significant reduction in energy consumption per token (51.7% lower). The accuracy of the model remains comparable to FP16, with only a slight increase in error. The study also compares rule-based routing with learned routing policies, finding that the former outperforms the latter due to lower overhead and better adherence to quality and resource constraints. Overall, the paper highlights the potential of simple request-aware routing to significantly improve inference efficiency while maintaining model quality.
Methodology
The methodology involves benchmarking various fixed LLM inference modes under a common single-GPU setup, utilizing a lightweight request-boundary controller that selects the optimal mode based on workload features such as prompt length and memory pressure. The evaluation is conducted on synthetic deployment-style workloads for efficiency and automatic benchmark workloads for quality preservation.
Results
The results indicate that ModeSwitch-LLM provides a substantial improvement in inference efficiency, achieving a 2.10× mean latency speedup and a 0.48× mean energy ratio compared to the FP16 baseline. The accuracy remains close to FP16, with a mean delta of +0.17 percentage points, suggesting that the controller effectively balances efficiency and quality.
Implications
The findings suggest that implementing request-aware routing can significantly enhance the efficiency of LLM inference systems without the need for architectural changes or retraining. This approach can be particularly beneficial in environments where resource constraints are critical, such as cloud-based services or edge computing.
A mathematical theory of balancing relational generalization and memorization
Theory
- Introduction of a novel task paradigm, transitive inference with exceptions, to study relational generalization and memorization.
- Analytical characterization of kernel ridge regression models shows sensitivity to representational geometry in generalization tasks.
- Validation of theoretical insights in pretrained language models indicates systematic errors aligned with the proposed theory.
- Emphasis on the need for task paradigms that capture the complexity of balancing generalization and memorization.
Read more
A mathematical theory of balancing relational generalization and memorization
Summary
This paper addresses the challenge of how learning systems, including humans and machine learning models, balance between relational generalization and memorization of exceptions. The authors introduce a novel task paradigm called 'transitive inference with exceptions' to investigate this balance. They analytically characterize the behavior of a simple neural network learning model, specifically kernel ridge regression, across various representations and task parameters. The findings reveal that while these models can successfully generalize, their performance is sensitive to the representational geometry. The authors validate their theoretical insights using pretrained language models fine-tuned on relational tasks, demonstrating that these models can generalize according to the transitive rule but also exhibit systematic mistakes predicted by the theory. The study emphasizes the need for new task paradigms that probe the ability to balance generalization and memorization, highlighting the complexity of real-world learning scenarios.
Methodology
The authors developed a theoretical model using kernel ridge regression to analyze how learning systems can balance relational generalization and memorization. They introduced a new task paradigm and conducted experiments with pretrained language models to validate their theoretical findings.
Results
The study found that kernel ridge regression models can balance between relational generalization and memorization, but their success is highly dependent on the specific representational geometry. The pretrained language models demonstrated the ability to generalize according to the transitive rule while also making systematic errors, confirming the predictions of the theoretical framework.
Implications
The findings suggest that understanding the balance between generalization and memorization is crucial for developing more robust learning systems. The proposed task paradigm can serve as a foundation for future research aimed at improving machine learning models' performance in complex decision-making scenarios.
Implicit Regularization of Mini-Batch Training in Graph Neural Networks
Graph Learning
Optimization
Efficient ML
- Random Node Sampling (RNS) can outperform full-graph training on 8 out of 10 datasets.
- Backward error analysis reveals that mini-batch SGD implicitly minimizes a modified objective with a gradient-variance regularizer.
- RNS offers significant computational savings, achieving 2× to 12× speedups and up to 3× lower peak GPU memory usage.
- The study highlights the importance of the sampler choice in shaping the effective learning objective in GNNs.
Read more
Implicit Regularization of Mini-Batch Training in Graph Neural Networks
Summary
This paper investigates the implications of mini-batch training in Graph Neural Networks (GNNs), focusing on the Random Node Sampling (RNS) method. Unlike traditional i.i.d. data training, mini-batch training on sampled subgraphs alters the topology and introduces boundary effects, which can complicate the optimization process. The authors demonstrate that RNS, which samples nodes uniformly at random, can match or outperform full-graph training across various datasets while being computationally efficient. Through backward error analysis, the paper reveals that mini-batch Stochastic Gradient Descent (SGD) implicitly minimizes a modified objective that combines the sampled loss with a regularization term based on the variance of mini-batch gradients. This finding reframes the choice of graph sampler as a form of implicit regularization, positioning RNS as a theoretically sound and practical approach for scalable GNN training. The authors benchmark RNS against other sampling strategies, showing its superior performance in terms of speed and memory usage, making it a strong candidate for default GNN training methods.
Methodology
The authors applied backward error analysis to the mini-batch training of GNNs, specifically focusing on the Random Node Sampling (RNS) method. They compared RNS with other sampling strategies across various datasets and GNN architectures, analyzing the effects on optimization dynamics and predictive performance.
Results
RNS matched or outperformed full-graph training on 8 out of 10 benchmarks, demonstrating lower variance in per-batch gradients and a more stable implicit objective compared to structure-aware samplers. The method also provided substantial improvements in training speed and memory efficiency.
Implications
The findings suggest that RNS can be effectively utilized for scalable GNN training in practical applications, potentially leading to broader adoption of mini-batch training methods in graph learning tasks. The insights on implicit regularization may also influence future research on optimization strategies in structured data settings.
CALAD: Channel-Aware contrastive Learning for multivariate time series Anomaly Detection
Time Series
- CALAD introduces a channel-aware approach to anomaly detection, enhancing the relevance of different channels.
- The framework uses reconstruction errors to estimate channel importance without requiring labeled anomalies.
- A channel-wise contrastive augmentation strategy is employed to align learning with anomaly semantics.
- Combining contrastive learning with an auxiliary reconstruction head allows for the preservation of normal patterns.
Read more
CALAD: Channel-Aware contrastive Learning for multivariate time series Anomaly Detection
Summary
The paper presents CALAD, a novel framework for multivariate time series anomaly detection that addresses the limitations of existing unsupervised methods which often treat all channels equally. Recognizing that different channels contribute variably to anomaly detection, CALAD incorporates a channel-aware approach by estimating channel relevance based on reconstruction errors from a transformer-based autoencoder. This relevance guides the construction of contrastive samples, allowing the model to focus on anomaly-relevant channels while being invariant to irrelevant variations. The framework employs a channel-wise augmentation strategy to create positive and negative samples, enhancing the learning process by aligning it with anomaly semantics. Additionally, CALAD integrates contrastive learning with an auxiliary reconstruction head, ensuring that normal temporal structures are preserved. Experimental results demonstrate that CALAD outperforms existing methods, particularly in scenarios with distribution shifts, showcasing its effectiveness in real-world applications.
Methodology
CALAD employs a transformer-based autoencoder to estimate channel relevance through reconstruction errors. It constructs contrastive samples based on this relevance, using a channel-wise augmentation strategy that preserves or perturbs anomaly-relevant channels. The framework combines contrastive learning with an auxiliary reconstruction objective to ensure the retention of normal temporal structures.
Results
Experiments conducted on various real-world datasets indicate that CALAD consistently outperforms existing anomaly detection methods, particularly in scenarios where there are shifts in data distribution. The results highlight the framework's ability to effectively identify anomalies by leveraging channel-specific information.
Implications
The proposed CALAD framework has significant implications for real-world applications requiring reliable anomaly detection in multivariate time series data, such as industrial monitoring, predictive maintenance, and system reliability enhancement. Its channel-aware approach could lead to more robust and interpretable models in various domains.
Strong Teacher Not Needed? On Distillation in LLM Pretraining
NLP
Large Language Models
Theory
- Weaker teachers can improve stronger students under certain conditions.
- Stronger teachers do not always lead to better student performance; overtraining can be detrimental.
- Distillation enhances generalization more effectively than in-domain fitting.
- Teacher-student compatibility is crucial for effective knowledge transfer.
Read more
Strong Teacher Not Needed? On Distillation in LLM Pretraining
Summary
This paper challenges the conventional wisdom in knowledge distillation, particularly in the context of large language model (LLM) pretraining, which posits that stronger teachers always yield better students. The authors systematically investigate various teacher-student configurations by varying model sizes and training token budgets to explore the effectiveness of distillation. They find that weaker teachers can still enhance the performance of stronger students when appropriate loss mixing is applied. Additionally, they demonstrate that overly strong teachers may degrade the learning process, contradicting the traditional strong-to-weak assumption. The study reveals that distillation is more effective for improving generalization performance on out-of-distribution tasks compared to in-domain fitting, suggesting that the knowledge transfer during distillation is beneficial for broader applications beyond mere model compression.
Methodology
The authors conducted a systematic study by varying the architecture sizes and training token budgets of both teachers and students to create different teacher-student relationships, including strong-to-weak, same-level, and weak-to-strong configurations. They analyzed the impact of these configurations on the effectiveness of knowledge distillation through empirical experiments.
Results
The results indicate that weak-to-strong and same-level distillation configurations can yield performance improvements. Conversely, stronger teachers, when pushed beyond a certain threshold in terms of parameters or training tokens, can saturate or reverse the expected gains. Furthermore, the study found that improvements in generalization performance were more pronounced than those in in-domain fitting, highlighting the potential for distillation to facilitate broader knowledge transfer.
Implications
These findings suggest that the selection of teachers in LLM pretraining should not solely focus on their strength but also consider their compatibility with the student model. This could lead to more efficient training practices and better performance in real-world applications, especially as the demand for smaller, efficient models grows.
The Matching Principle: A Geometric Theory of Loss Functions for Nuisance-Robust Representation Learning
Theory
- Introduces the Matching Principle, linking various robustness challenges in representation learning.
- Establishes a unified statistical framework for estimating nuisance covariances and regularizing encoder Jacobians.
- Presents empirical validation across multiple domains, confirming theoretical predictions on deployment drift.
- Introduces the Trajectory Deviation Index (tdi) as a new measure of embedding sensitivity.
Read more
The Matching Principle: A Geometric Theory of Loss Functions for Nuisance-Robust Representation Learning
Summary
This paper introduces the Matching Principle, a unified geometric theory that addresses various challenges in representation learning, such as robustness, domain adaptation, and invariance to nuisances. The core argument is that many of these issues can be framed as a single statistical problem: estimating the covariance of label-preserving deployment nuisances and regularizing the encoder Jacobian accordingly. The author identifies a specific covariance matrix (Σtask) that captures the deployment variations and proposes a corresponding regularization matrix (Σ′) that must cover the range of Σtask to prevent representation drift. The paper presents several theorems demonstrating the necessity of this matching for achieving optimal performance in various scenarios, including linear-Gaussian models and deep learning contexts. Additionally, the Trajectory Deviation Index (tdi) is introduced as a measure of embedding sensitivity. Empirical tests across multiple domains validate the theoretical predictions, showing that the proposed method outperforms traditional approaches in most cases, except for one specific instance where the theory also predicts failure. Overall, the paper provides a closed-form, falsifiable theory that reorganizes existing methods under a common framework, emphasizing the importance of the loss function in representation learning.
Methodology
The paper employs a theoretical framework grounded in statistical analysis, presenting the Matching Principle as a geometric approach to loss functions. It derives several theorems and lemmas to establish necessary conditions for optimal regularization of the encoder Jacobian. Empirical validation is conducted through thirteen pre-registered experiments across various machine learning tasks, testing the predictions of the theory against established methods.
Results
The results indicate that the proposed matched regularization significantly reduces deployment drift in most tested scenarios, outperforming traditional methods like CORAL and adversarial training in twelve out of thirteen empirical blocks. The only exception, Office-31, aligns with a predicted failure mode due to eigengap issues. The findings collectively reject the null hypothesis that different regularization strategies yield similar effects on deployment metrics.
Implications
The Matching Principle has the potential to streamline the development of robust machine learning models by providing a unified framework for addressing nuisance-related challenges. It encourages researchers to reconsider the design of loss functions and regularization strategies, potentially leading to more effective and interpretable models across various applications.
The Readout Shortcut: Positional Number Copying Dominates Arithmetic CoT Readout in Small Language Models
NLP
Large Language Models
- CoT prompting is necessary for arithmetic tasks in small LMs, but its step order is less important than previously believed.
- Models often rely on a positional shortcut, copying the last number in the answer context, which significantly impacts accuracy.
- The presence of the correct answer can account for 54-92 percentage points of accuracy, demonstrating a strong dependency on positional information.
- Different models exhibit varying degrees of content gating, affecting their ability to reject distractor numbers.
Read more
The Readout Shortcut: Positional Number Copying Dominates Arithmetic CoT Readout in Small Language Models
Summary
This paper investigates the effectiveness of Chain-of-Thought (CoT) prompting in small language models (LMs) for arithmetic tasks, particularly focusing on the readout mechanism during answer generation. The author demonstrates that while CoT prompting is essential for achieving high accuracy, the logical sequencing of steps is less critical than previously thought. Instead, the models often rely on a positional shortcut where they copy the number in the trailing position before the answer delimiter, irrespective of the reasoning steps taken. This positional copying accounts for a significant portion of the models' accuracy, with the presence of the correct answer leading to a substantial increase in performance. The study employs three instruction-tuned models (Qwen, Llama, and Gemma) on the GSM8K dataset and reveals that the readout mechanism is heavily influenced by the last number in the context, which can lead to high accuracy even when intermediate reasoning is flawed. The findings suggest that the readout process can bypass genuine computation, raising questions about the faithfulness of CoT-based evaluations and the implications for model oversight.
Methodology
The study utilized three instruction-tuned language models (Qwen, Llama, and Gemma) and analyzed their performance on the GSM8K arithmetic dataset. The author isolated the answer-readout stage through prefix completion and conducted various experiments, including corruption decomposition, shuffle hierarchy, and head-level ablation, to understand the mechanisms behind the readout process.
Results
The results indicated that the models predominantly relied on the last number in the answer context for generating responses, achieving high accuracy even when intermediate reasoning was incorrect. The accuracy dropped significantly when the trailing number was replaced with a distractor, highlighting the importance of positional copying. The models showed varying levels of content gating, with Qwen and Llama copying distractors 87-95% of the time, while Gemma displayed stronger content-selective gating.
Implications
These findings have significant implications for the understanding of how small language models process arithmetic tasks and the reliability of CoT prompting as a method for evaluating model performance. The results suggest a need for more rigorous oversight mechanisms that account for the potential bypassing of genuine computation in model evaluations.
Hierarchical Variational Policies for Reward-Guided Diffusion
Generative Models
Computer Vision
Efficient ML
- Introduces a unified framework for test-time guidance in diffusion models using hierarchical variational policies.
- Develops Amortized HVP (AHVP) for efficient generation of high-quality reward-aligned samples.
- Presents Semi-Amortized HVP (SHVP) that combines amortized proposals with test-time refinement.
- Demonstrates significant improvements in perceptual quality and inference speed over state-of-the-art methods.
Read more
Hierarchical Variational Policies for Reward-Guided Diffusion
Summary
This paper presents a novel framework for adapting pretrained diffusion models to downstream tasks, particularly focusing on inverse problems, through the use of Hierarchical Variational Policies (HVP). The authors propose a method that significantly reduces the computational cost associated with test-time adaptation by formulating it as a hierarchical variational model. This model employs a lightweight stochastic policy that amortizes control, allowing for few-step diffusion sampling. The approach enables fast inference with large step sizes while maintaining sample quality through structured per-step control. The authors introduce two key methods: Amortized HVP (AHVP), which learns an initial noise distribution and per-step stochastic policies for efficient sampling, and Semi-Amortized HVP (SHVP), which combines amortized proposals with limited test-time optimization. Empirical results demonstrate that AHVP achieves superior perceptual quality and speed, outperforming existing baselines in terms of quality-speed tradeoff across various inverse problems.
Methodology
The proposed framework employs a hierarchical variational model that incorporates latent variables into the policy, allowing for flexible control over the denoising process. The method consists of a two-stage learning approach: first, it learns an initial noise distribution that maximizes the reward, followed by training per-step stochastic controllers. This setup enables a single forward pass for inference, replacing the need for expensive optimization at each step.
Results
The results indicate that the AHVP method achieves better perceptual quality in tasks such as 4× super-resolution compared to leading baselines, with more than 5× faster inference times. The SHVP method further enhances quality at a modest additional computational cost, establishing state-of-the-art performance on several challenging inverse problems.
Implications
The proposed methods have significant implications for real-time applications in computer vision and generative modeling, particularly in scenarios where computational resources are limited. The ability to efficiently adapt diffusion models to various tasks could enhance their applicability in fields such as image restoration, inpainting, and other inverse problems.
Understanding Multimodal Failure in Action-Chunking Behavioral Cloning
Robotics
Generative Models
Theory
- Multimodality in behavioral cloning presents significant challenges when multiple valid actions correspond to the same observation.
- Posterior-prior regularization can enhance sampling reliability but may also lead to loss of critical action-conditioned information.
- The Lipschitz constant of the mapping from base to action space affects the multimodality of action-space generative policies.
- Empirical experiments validate the theoretical findings regarding multimodality collapse and its implications for policy performance.
Read more
Understanding Multimodal Failure in Action-Chunking Behavioral Cloning
Summary
This paper investigates the challenges of multimodality in action-chunking behavioral cloning, particularly when the same observation can lead to multiple valid actions. The authors analyze how different parameterizations of multimodal policies fail in distinct ways, focusing on latent-variable and action-space generative methods. They establish that posterior-prior regularization in latent-variable policies can enhance deployment-time sampling reliability but may also suppress essential action-conditioned information, leading to a collapse of multimodality. Conversely, for action-space generative policies, the smoothness of the mapping from base to action space constrains multimodality, with a small Lipschitz constant potentially limiting the ability to assign probabilities to well-separated modes. The paper provides empirical validation through experiments on synthetic multimodal tasks and robotic simulation benchmarks, demonstrating the practical implications of their findings. The authors also release the code used for their experiments, contributing to the reproducibility of their results.
Methodology
The authors define multimodal behavioral cloning and analyze two main approaches: latent-variable methods and action-space generative methods. They derive theoretical results regarding the effects of posterior-prior regularization and Lipschitz continuity on multimodality. Empirical evaluations are conducted using controlled synthetic benchmarks and robotic simulation tasks to test their hypotheses.
Results
The study reveals that excessive regularization in latent-variable policies can suppress multimodality, while insufficient regularization may lead to successful mode preservation, contingent on the prior's coverage of relevant latent regions. For action-space generative policies, a small Lipschitz constant is shown to limit the ability to assign probabilities to multiple modes, thus affecting multimodality. The experiments confirm these theoretical insights and demonstrate the practical failures of existing methods in multimodal settings.
Implications
The findings have significant implications for the design of imitation learning algorithms, particularly in robotics, where understanding and preserving multimodality can enhance the performance of learned policies. The insights on regularization and mapping smoothness can guide future research in developing more robust multimodal behavioral cloning techniques.
SepsisAI Orchestrator: A Containerized and Scalable Platform for Deploying AI Models and Real-Time Monitoring in Early Sepsis Detection
Efficient ML
- Introduction of SepsisAI Orchestrator as a modular, open-source platform for early sepsis detection.
- Integration of HL7 FHIR-inspired preprocessing, NoSQL storage, and a containerized LightGBM classifier.
- Empirical characterization of horizontal scaling under realistic concurrency, revealing a U-shaped latency curve.
- Provision of a reproducible deployment recipe for clinical prediction tasks beyond sepsis.
Read more
SepsisAI Orchestrator: A Containerized and Scalable Platform for Deploying AI Models and Real-Time Monitoring in Early Sepsis Detection
Summary
The SepsisAI Orchestrator is an open-source modular platform designed to bridge the gap between clinical machine learning models and their deployment in hospital settings, particularly for early sepsis detection. The authors identify significant barriers to the bedside use of predictive models, including heterogeneous data representations and the lack of standardized deployment workflows. The platform integrates HL7 FHIR-inspired Clinical Document Architecture (CDA) preprocessing, NoSQL storage, a containerized LightGBM classifier served via REST APIs, and a Streamlit clinical dashboard, all orchestrated with Docker and Kubernetes. The study empirically characterizes the platform's performance under load, revealing that the number of replicas must align with the physical CPU thread count to optimize latency and eliminate request failures. The findings demonstrate a U-shaped scaling behavior, which has not been previously quantified for clinical AI inference workloads. The authors provide a reproducible deployment recipe and emphasize that their contribution focuses on deployment systems rather than predictive modeling, with future work planned for prospective clinical validation.
Methodology
The methodology involves creating a containerized platform using Docker and Kubernetes, integrating various components such as HL7 FHIR-inspired CDA preprocessing, NoSQL database storage, and a LightGBM classifier served via REST APIs. Load testing was conducted using k6 to evaluate the system's performance under varying levels of concurrent users.
Results
The study found that scaling the number of replicas from 3 to 12 on a 12-thread CPU reduced the 95th percentile latency from 3.3 seconds to 1.41 seconds (a 57.3% reduction) and eliminated request failures. Over-provisioning beyond 12 replicas led to performance degradation due to scheduler contention, highlighting a U-shaped scaling behavior.
Implications
The SepsisAI Orchestrator provides a framework for deploying clinical AI models in real-time settings, potentially improving early sepsis detection and patient outcomes. The findings on scaling behavior can inform future deployments of clinical AI systems, ensuring optimal performance in hospital environments.
Optimal Dimension-Free Sampling for Regularized Classification
Theory
Optimization
Efficient ML
- Establishes optimal sampling bounds for Lipschitz continuous classification loss functions.
- Demonstrates k²/ε² and k/ε² bounds for different regularization terms.
- Identifies conditions under which linear sampling complexity can be achieved.
- Improves upon existing sensitivity sampling bounds through refined analytical techniques.
Read more
Optimal Dimension-Free Sampling for Regularized Classification
Summary
This paper addresses the problem of sample complexity in regularized classification, focusing on achieving optimal sampling bounds for a variety of Lipschitz continuous classification loss functions, including logistic, sigmoid, hinge, and ReLU losses. The authors establish upper and lower bounds for sampling complexity, specifically demonstrating k²/ε² bounds for L²/k regularization and k/ε² for L¹/k regularization. They highlight that for L²²/k regularization, the sampling complexity is influenced by the bounded derivative property of the loss function. If certain conditions on the derivative are met, linear sampling complexity can be achieved; otherwise, the general bound remains k²/ε². The paper also discusses the impossibility of dimension-free bounds when g(0) = 0. The authors improve upon previous sensitivity sampling bounds by employing refined arguments involving higher moment bounds and empirical process analyses, thus avoiding overcounting issues typical in VC-dimension frameworks. This work contributes significantly to the understanding of sample complexity in machine learning, particularly in the context of regularized classifiers.
Methodology
The authors utilize theoretical analysis to derive sampling bounds, employing concepts from empirical processes and higher moment bounds. They analyze the properties of Lipschitz continuous loss functions and their derivatives to establish conditions for optimal sampling complexity. The methodology includes both upper and lower bound proofs, ensuring that the results are robust and applicable across various scenarios.
Results
The paper presents matching upper and lower bounds for sampling complexity, confirming that for L²/k regularization, the bounds are k²/ε², while for L¹/k regularization, they are k/ε². The results indicate that under specific conditions related to the bounded derivative of the loss function, linear sampling complexity can be achieved, otherwise, the general bound applies. The findings also establish that dimension-free bounds are not possible when g(0) = 0.
Implications
The results have significant implications for the design of efficient sampling strategies in machine learning, particularly for regularized classifiers. By understanding the conditions under which optimal sampling can be achieved, practitioners can better allocate resources and improve the performance of classification algorithms. This work also lays the groundwork for future research in sample complexity and regularization techniques.
Why SGD is not Brownian Motion: A New Perspective on Stochastic Dynamics
Optimization
Theory
- Proposes a new framework for SGD dynamics that accounts for finite learning rates and minibatch sampling.
- Derives a discrete Fokker-Planck equation that reveals discrepancies with standard Langevin approximations.
- Identifies distinct dynamical regimes based on the curvature of the loss landscape, with implications for optimization behavior.
- Provides empirical evidence supporting the theoretical framework through analysis of neural network models.
Read more
Why SGD is not Brownian Motion: A New Perspective on Stochastic Dynamics
Summary
This paper challenges the conventional modeling of Stochastic Gradient Descent (SGD) as a Langevin process, which assumes that minibatch noise behaves like Brownian motion. The authors argue that this approximation is flawed, particularly at finite learning rates, and propose a new framework that treats SGD as deterministic dynamics within a fluctuating loss landscape caused by minibatch sampling. By deriving a discrete Fokker-Planck equation from the exact SGD update, they highlight that standard Langevin approximations neglect terms of order η², which can lead to incorrect predictions. The analysis reveals distinct dynamical regimes based on the curvature of the loss landscape, showing that nearly-flat directions do not reach a stationary distribution and exhibit diffusive behavior. Empirical validation on neural network models in computer vision and natural language processing supports the theoretical predictions, demonstrating a clear distinction between confined and diffusive modes in SGD dynamics.
Methodology
The authors start from the discrete update of SGD to derive a master equation for parameter distribution, leading to a discrete Fokker-Planck equation. They analyze SGD dynamics near critical points using a quadratic approximation and validate their findings through empirical studies on neural network models.
Results
The study finds that the behavior of SGD near critical points decomposes along the eigenbasis of the mean Hessian into distinct regimes. In nearly-flat directions, the variance of parameter trajectories grows over time, indicating effective diffusion. Empirical results corroborate these theoretical predictions, showing a separation between confined and diffusive modes in the dynamics of SGD.
Implications
This work provides a deeper understanding of SGD dynamics, which could inform the design of more effective optimization algorithms in machine learning. By recognizing the limitations of traditional models, researchers and practitioners can better tailor their approaches to training deep learning models, potentially improving convergence rates and model performance.
Non-normal spectral signatures of instability in neural network training dynamics
Optimization
Theory
- Linearized update operators for Adam and SGD with momentum are generically non-normal.
- Non-normality leads to transient amplification of perturbations during training.
- The eigenvector condition number κ(V) serves as a more effective stability measure than the spectral radius.
- Numerical experiments confirm κ(V) can separate stable and unstable training phases.
Read more
Non-normal spectral signatures of instability in neural network training dynamics
Summary
This paper addresses the training instabilities commonly observed in deep learning, such as loss spikes and oscillatory convergence, which have not been rigorously explained through existing optimization theories. The author demonstrates that the linearized update operators for popular optimizers like Adam and SGD with momentum are generically non-normal. This non-normality leads to transient amplification of perturbations, which can occur even when the spectral radius of the update operator remains below one, a condition typically associated with stability. The paper introduces a new stability measure, the eigenvector condition number κ(V), which serves as an early-warning indicator for transient amplification, outperforming the traditional spectral radius criterion. Numerical experiments validate that κ(V) effectively distinguishes stable from unstable training phases, providing a continuous measure of non-normal amplification severity. The findings suggest that non-Hermitian operator theory can enhance our understanding of optimization stability in neural networks, offering a new diagnostic framework for adaptive optimization stability.
Methodology
The author employs operator-theoretic analysis to study the linearized update dynamics of neural network optimizers. By deriving conditions for non-normality in the update operators and establishing a precursor bound for stability, the paper connects theoretical insights with numerical experiments on two-layer networks to validate the proposed measures.
Results
The analysis reveals that the update operators for Adam and SGD with momentum are non-normal, which allows for transient amplification of perturbations. The eigenvector condition number κ(V) is shown to effectively predict instability before the spectral radius indicates a problem, providing a significant improvement in understanding training dynamics.
Implications
The findings suggest that incorporating non-normality into the analysis of neural network training could lead to better optimizer designs and improved stability in training deep learning models. This work opens avenues for further research into the application of operator theory in machine learning optimization.
MARS: Magnitude-Aware Rank Statistics
Theory
- MARS addresses the issue of magnitude-blindness in traditional rank statistics.
- It incorporates performance metric values into the ranking process for more accurate evaluations.
- MARS uses a dynamic regularization of the Critical Difference to reflect performance volatility.
- The methodology includes a non-parametric permutation test for stability assessment.
Read more
MARS: Magnitude-Aware Rank Statistics
Summary
The paper introduces Magnitude-Aware Rank Statistics (MARS), a novel framework designed to enhance the evaluation of machine learning models by addressing the limitations of traditional Critical Difference (CD) diagrams. Traditional methods often suffer from 'magnitude-blindness', where the performance gaps between models are ignored, treating marginal victories as equivalent to significant ones. This can lead to misleading conclusions, particularly in high-stakes domains like medical diagnostics and autonomous systems. MARS incorporates a relative margin coefficient to weight discrete ranks, allowing for a more nuanced representation of model performance. By transforming discrete ranks into continuous magnitude-aware rank scores and employing a dynamic regularization of the CD value, MARS provides a more accurate statistical representation of performance differences. The methodology includes a non-parametric permutation test to evaluate the stability of these scores, bridging the gap between ordinal stability and metric sensitivity. The paper demonstrates that MARS can yield more realistic insights and improve the identification of the best-performing models in extensive experimental settings.
Methodology
MARS integrates performance metric values into the ranking process, transforming discrete ordinal ranks into continuous magnitude-aware rank scores. It employs a dynamic regularization of the Critical Difference based on observed performance gaps and utilizes a non-parametric permutation test to evaluate the stability of the scores.
Results
The results indicate that MARS offers a more realistic and statistically significant representation of model performance differences compared to traditional methods. It successfully identifies true performance winners in scenarios where traditional rank statistics fail.
Implications
The implications of MARS are significant for the evaluation of machine learning models, particularly in high-stakes applications where accurate performance assessment is critical. It can lead to better decision-making in model selection and improve the reliability of machine learning systems.