AI-generated summaries
Today's ML research,
without the noise.
Daily summaries of the latest machine learning papers from arXiv, processed every 8 hours.
61
Papers today
8h
Update frequency
7
Days of history
How Many Different Outputs Can a Transformer Generate?
Theory
Large Language Models
NLP
- Transformers can only generate a finite set of output sequences, with many sequences being fundamentally inaccessible.
- The maximum length of accessible sequences grows linearly with prompt length, but the proportion of accessible sequences decays exponentially beyond a critical threshold.
- The authors provide a theoretical upper bound for the relationship between prompt length and accessible sequence length.
- Empirical validation of theoretical predictions across various transformer architectures and model sizes.
Read more
How Many Different Outputs Can a Transformer Generate?
Summary
This paper investigates the limitations of transformer architectures in generating distinct output sequences. The authors derive an upper bound on the number of different sequences a transformer can produce based on prompt length, demonstrating that this bound is empirically tight across various architectures and model sizes. They provide a theoretical framework explaining the observed failures of transformers on simple tasks, such as copying sequences, by proving that the maximum length of accessible sequences grows linearly with prompt length. Beyond a certain threshold, the proportion of accessible sequences decays exponentially with sequence length. The authors also establish a formula to predict these thresholds based on transformer architecture, which aligns closely with empirical observations. The study highlights intrinsic architectural constraints that affect all transformers, regardless of size or training data, and suggests that many sequences remain fundamentally inaccessible to these models.
Methodology
The authors conducted a theoretical analysis of transformer architectures, focusing on the constraints imposed by finite precision and the geometry of internal representations. They defined 'accessible sequences' and derived mathematical proofs to establish the relationships between prompt length and sequence accessibility. Additionally, they performed empirical experiments to validate their theoretical predictions across different transformer models.
Results
The study reveals that the maximum length of accessible sequences increases linearly with prompt length, while the proportion of accessible sequences declines exponentially after a certain threshold. The theoretical predictions closely match empirical performance across various transformer architectures, confirming the intrinsic limitations of these models in generating outputs.
Implications
The findings have significant implications for understanding the capabilities and limitations of transformer models in sequence generation tasks. They provide insights into why transformers struggle with simple tasks as sequence lengths increase, guiding future research on model design and training strategies to mitigate these limitations.
Detecting Atypical Clients in Federated Learning via Representation-Level Divergence
Federated Learning
- Introduces a geometric perspective to analyze client behavior in Federated Learning.
- Proposes a lightweight metric for quantifying functional divergence between client and global models.
- Provides an interpretable signal for monitoring data heterogeneity in FL systems.
- Distinguishes between stable heterogeneous clients and those with atypical updates.
Read more
Detecting Atypical Clients in Federated Learning via Representation-Level Divergence
Summary
This paper addresses the challenges posed by heterogeneous data in Federated Learning (FL), which can lead to unstable updates and degraded global model performance. The authors propose a novel approach to detect atypical client behavior by introducing a lightweight geometric signal that quantifies the functional divergence of a client from the global model. Unlike traditional methods that compare model parameters or gradients, this approach focuses on how local training alters the activation-induced partition of the input space, evaluated on a shared probe set. This results in a permutation-invariant and interpretable metric that captures differences in data processing by the model. The proposed metric effectively distinguishes between stable heterogeneous clients and those whose updates significantly diverge from the global regime, providing a tool for monitoring client behavior and enabling risk-aware aggregation strategies in FL systems. The paper also discusses related work and evaluates the proposed metric on different models, demonstrating its utility in identifying atypical client contributions.
Methodology
The authors develop a geometric signal that measures how local client training alters the activation patterns of a neural network, leading to a representation-level divergence metric. This metric is evaluated on a shared probe set to assess client behavior in relation to the global model.
Results
The proposed metric successfully identifies clients that induce atypical functional changes, demonstrating its effectiveness in distinguishing stable clients from those whose updates diverge significantly from the global model. The evaluation on different models confirms the utility of the metric in practical federated learning scenarios.
Implications
This work has significant implications for improving the reliability and stability of federated learning systems by enabling better monitoring of client behavior and facilitating more robust aggregation strategies. It can enhance the security and integrity of federated learning deployments in dynamic environments.
CogAdapt: Transferring Clinical ECG Foundation Models to Wearable Cognitive Load Assessment via Lead Adaptation
Time Series
Multimodal
- CogAdapt effectively bridges the gap between clinical ECG models and wearable devices for cognitive load assessment.
- LeadBridge transforms 3-lead ECG signals into anatomically-consistent 12-lead representations.
- ProFine allows for gradual fine-tuning of model layers, reducing the risk of catastrophic forgetting.
- The framework shows significant performance improvements over traditional models in cognitive load classification.
Read more
CogAdapt: Transferring Clinical ECG Foundation Models to Wearable Cognitive Load Assessment via Lead Adaptation
Summary
The paper presents CogAdapt, a novel framework designed to adapt clinical ECG foundation models for real-time cognitive load assessment using wearable devices. Traditional cognitive load assessment methods face challenges such as limited labeled data and poor generalization across subjects. The authors leverage pre-trained ECG models, which have been trained on extensive clinical datasets, and introduce two key innovations: LeadBridge, a learnable adapter that converts 3-lead wearable ECG signals into 12-lead representations, and ProFine, a progressive fine-tuning strategy that mitigates catastrophic forgetting during model adaptation. Evaluations on two public datasets, CLARE and CL-Drive, demonstrate that CogAdapt significantly outperforms baseline models trained from scratch, achieving macro-F1 scores of 0.626 and 0.768, respectively. These findings underscore the potential of adapting foundation models for effective cognitive load assessment in real-world scenarios.
Methodology
The methodology involves the development of LeadBridge, which adapts 3-lead ECG data to 12-lead formats, and ProFine, a fine-tuning approach that incrementally unfreezes model layers while managing learning rates to prevent forgetting. The framework is evaluated using leave-one-subject-out cross-validation on two datasets, CLARE and CL-Drive.
Results
CogAdapt achieved macro-F1 scores of 0.626 on the CLARE dataset and 0.768 on the CL-Drive dataset, significantly outperforming baseline models that were trained from scratch. These results demonstrate the effectiveness of the proposed adaptations in enhancing cognitive load assessment.
Implications
The findings suggest that adapting clinical foundation models for wearable applications can improve real-time cognitive load assessment, which has implications for adaptive human-computer interaction in various fields such as education, training, and driving.
Abstraction for Offline Goal-Conditioned Reinforcement Learning
Reinforcement Learning
Robotics
Theory
- Introduces the concept of relativised options for better experience reuse in GCRL.
- Demonstrates that hierarchical policies can provide both temporal and absolute abstraction.
- Presents two algorithms that outperform traditional flat and hierarchical policies in offline settings.
- Highlights the importance of action similarity over value similarity in learning options.
Read more
Abstraction for Offline Goal-Conditioned Reinforcement Learning
Summary
This paper addresses the challenges in Offline Goal-Conditioned Reinforcement Learning (GCRL) by introducing a framework that leverages hierarchical policies for both temporal and absolute abstraction. The authors argue that existing methods struggle with low-quality transitions in datasets, which can hinder effective policy learning. They propose the concept of 'relativised options' that allow agents to reuse experiences across similar contexts in the state-space, thus improving generalization. The paper presents two algorithms that learn these relativised options and abstract from the absolute frame of reference, demonstrating that this approach significantly enhances performance in high-dimensional offline GCRL tasks. The findings suggest that hierarchical structures can effectively mitigate issues related to dataset coverage and distribution shift, leading to more robust policy extraction.
Methodology
The authors develop a framework called Abstractive Reinforcement Learning (ARL) that learns relativised options based on action similarity. They propose two algorithms: one that implicitly learns relativised options through action similarity, and another that explicitly enforces translational invariance in the low-level MDP to enhance generalization. The experiments are conducted in high-dimensional manipulation tasks to evaluate the effectiveness of these methods.
Results
The experiments show that the proposed algorithms significantly improve performance in offline GCRL tasks compared to both flat policies and traditional hierarchical policies. The use of relativised options and the inductive biases introduced lead to better policy extraction and generalization, particularly in regions of the state space with low-quality transitions.
Implications
The findings suggest that incorporating hierarchical structures and relativised options can enhance the robustness of offline reinforcement learning agents, making them more effective in real-world applications where data may be sparse or of low quality. This work opens avenues for further research into flexible inductive biases and methods for learning options in complex environments.
ARC-STAR: Auditable Post-Hoc Correction for PDE Foundation Models
Efficient ML
Theory
Optimization
- ARC-STAR effectively addresses the spatial concentration of errors in PDE foundation models.
- The framework is structured into three stages: global correction, local refinement, and budget-aware routing.
- It achieves significant error reduction without the need for retraining the underlying model.
- ARC-STAR outperforms existing methods across multiple benchmarks, demonstrating its effectiveness.
Read more
ARC-STAR: Auditable Post-Hoc Correction for PDE Foundation Models
Summary
The paper introduces ARC-STAR, a novel framework designed for correcting predictions made by frozen partial differential equation (PDE) foundation models, which often exhibit significant errors when deployed in unfamiliar flow regimes. The authors identify that traditional methods of post-hoc correction, such as uniform correction across the entire field, fail to address the spatial concentration of errors, leading to inefficiencies. ARC-STAR organizes the correction process into three distinct stages: a global corrector that addresses broad solver bias, a blockwise local refiner that targets specific high-error regions, and a label-free scoring mechanism that allocates computational resources effectively. This approach allows for significant error reduction without the need for retraining the underlying model. The authors validate their framework across five flow benchmarks, demonstrating that ARC-STAR consistently outperforms existing methods, achieving a reduction in velocity rollout error by at least 36 times compared to the raw predictions from the Poseidon model. The framework's design emphasizes auditability and budget-awareness, making it a practical solution for real-world applications.
Methodology
The ARC-STAR framework employs a frozen PDE foundation model and introduces a two-stage correction process. The first stage is a global corrector that reduces broad biases, while the second stage is a blockwise local refiner that focuses on high-error regions. A label-free scoring mechanism is used to determine which blocks to refine based on their error contribution, optimizing computational resource allocation.
Results
ARC-STAR demonstrated a reduction in velocity rollout error by at least 36 times over the raw Poseidon model across all tested regime cells. The global correction stage reduced raw host error by 91-99%, and the local refinement stage further decreased the remaining error by up to 94.4%. The framework was found to be the best performer in seven out of ten regime cells.
Implications
The ARC-STAR framework has significant implications for the deployment of PDE foundation models in various applications, particularly in fields requiring accurate predictions of physical phenomena. Its ability to correct errors efficiently without retraining makes it suitable for real-time applications in engineering, meteorology, and other domains reliant on PDE models.
From Snapshots to Trajectories: Learning Single-Cell Gene Expression Dynamics via Conditional Flow Matching
Generative Models
Time Series
- Introduces Single-Cell Flow Matching (scFM) for modeling gene expression dynamics from sparse scRNA-seq data.
- Addresses challenges of ambiguous transitions and compounding errors in long-horizon predictions.
- Combines optimal transport alignment with generative modeling to enhance temporal coherence.
- Demonstrates improved accuracy in trajectory reconstruction and distributional predictions on real datasets.
Read more
From Snapshots to Trajectories: Learning Single-Cell Gene Expression Dynamics via Conditional Flow Matching
Summary
This paper addresses the challenge of modeling single-cell gene expression dynamics from sparse time-resolved single-cell RNA sequencing (scRNA-seq) data, which typically consists of unpaired snapshot populations collected at discrete time points. The authors propose a novel framework called Single-Cell Flow Matching (scFM) that integrates optimal transport (OT) alignment with continuous-time generative modeling to infer trajectories at unmeasured time points. The scFM framework tackles two main challenges: the ambiguity in local transitions due to unpaired snapshots and the compounding errors in long-horizon predictions. By computing entropically regularized OT couplings between adjacent snapshots, scFM constructs soft, weighted flow-matching targets to learn time-dependent velocity fields. The framework also learns bidirectional velocity fields to enhance temporal coherence and introduces distribution-level alignment and latent dynamic regularization to stabilize long-term predictions. Experimental results on real-world scRNA-seq datasets demonstrate that scFM significantly improves the accuracy of distributional predictions for both interpolation and extrapolation, yielding more coherent trajectory reconstructions and better visualizations of gene expression dynamics, even in the absence of intermediate time points.
Methodology
The scFM framework utilizes entropically regularized optimal transport to compute couplings between adjacent snapshots, which are then used to create soft flow-matching targets for learning velocity fields. Bidirectional velocity fields are learned to ensure consistency and improve temporal coherence. Additionally, distribution-level alignment and latent dynamic regularization are employed to anchor long rollouts and mitigate drift during predictions.
Results
The experiments conducted on real-world scRNA-seq datasets show that scFM consistently outperforms existing methods in terms of distributional prediction accuracy for both temporal interpolation and extrapolation. The framework also achieves more accurate trajectory reconstructions and temporally coherent visualizations, indicating a better recovery of the underlying gene expression dynamics.
Implications
The scFM framework has significant implications for biological research, particularly in understanding cellular dynamics and transitions in complex biological systems. It can facilitate the study of developmental processes and disease progression by providing insights into gene expression changes over time.
Symbolic Density Estimation for Discrete Distributions
Theory
Interpretability
Generative Models
- Introduction of Symbolic Density Estimation (SDE) for automatic recovery of closed-form PMFs.
- Integration of validity checks for PMFs into the symbolic discovery process.
- Development of SDEBench, a benchmark dataset for evaluating discrete distribution models.
- Demonstration of SDE's capability to recover both classical and complex distributions accurately.
Read more
Symbolic Density Estimation for Discrete Distributions
Summary
This paper introduces Symbolic Density Estimation (SDE), an innovative unsupervised framework designed to automatically recover closed-form probability mass functions (PMFs) for discrete distributions. The authors highlight the limitations of traditional parametric modeling, which often suffers from model mis-specification due to the restricted catalog of interpretable distributions. SDE addresses this challenge by employing a structured search space that combines domain-specific structural priors with evolutionary search techniques and a validity-aware inference stage. The method is capable of extending to complex distribution families, including zero-inflated and finite mixtures. To facilitate systematic evaluation, the authors present SDEBench, a benchmark dataset encompassing a wide array of commonly used discrete distributions. Empirical results demonstrate that SDE effectively recovers both standard and non-standard discrete distributions, achieving accurate parameter estimates. Additionally, a real data application illustrates that SDE identifies concise and interpretable mixture models that enhance goodness-of-fit compared to conventional models. Overall, the proposed framework bridges the gap between rigid parametric approaches and flexible nonparametric methods, offering a powerful tool for modeling discrete data.
Methodology
The SDE framework operates by minimizing a weighted reconstruction error on the log-PMF domain, utilizing a custom operator set tailored for discrete functional forms. It incorporates automated validity checks to ensure non-negativity and normalization of proposed PMFs, guiding the symbolic search towards interpretable solutions. The method employs a symbolic discovery approach that generates candidate log-PMF expressions through the composition of basic arithmetic operators and log-domain combinatorial primitives, addressing the combinatorial complexity of the search space.
Results
The empirical evaluation indicates that SDE successfully recovers correct symbolic PMFs from synthetic data, including well-known distributions like Poisson and Negative Binomial. In more complex scenarios involving noisy or composite data, SDE identifies concise expressions that outperform standard parametric families, demonstrating its robustness and flexibility.
Implications
The SDE framework has significant implications for statistical modeling of discrete data, offering a method to automatically derive interpretable models without the need for prespecified families. This can enhance the analysis of count data and categorical outcomes across various fields, including social sciences, healthcare, and economics.
Expectation Consistency Loss: Rethink Confidence Calibration under Covariate Shift
Theory
- Introduces the Expectation Consistency Condition for confidence calibration under covariate shifts.
- Proposes Expectation Consistency Loss (ECL) as a new unsupervised domain adaptation loss.
- ECL is compatible with various calibration types, enhancing flexibility in application.
- Demonstrates that ECL maintains sample complexity comparable to existing calibration methods.
Read more
Expectation Consistency Loss: Rethink Confidence Calibration under Covariate Shift
Summary
This paper addresses the challenge of confidence calibration in classification models under covariate shift, a common issue where the distribution of input data changes between training and testing phases. Traditional calibration methods assume that training and test data are independent and identically distributed (i.i.d.), which limits their effectiveness when this assumption is violated. The authors introduce the Expectation Consistency Condition, a necessary and sufficient condition for confidence calibration that indicates covariate shifts do not necessarily lead to miscalibrated confidence. Building on this condition, they propose the Expectation Consistency Loss (ECL), an unsupervised domain adaptation loss that can be applied to various calibration types, including canonical, class-wise, and top-label calibration. The paper also provides a mini-batch training scheme for ECL, ensuring efficient computation. The effectiveness of ECL is validated through experiments on both simulated and real-world datasets exhibiting covariate shifts, demonstrating its robustness compared to existing methods.
Methodology
The authors derive the Expectation Consistency Condition to establish a framework for confidence calibration under covariate shifts. They then develop the Expectation Consistency Loss (ECL) based on this condition, which can be applied to different calibration scenarios. A mini-batch training scheme is also introduced to facilitate efficient computation of ECL during training.
Results
The proposed ECL was shown to effectively calibrate confidence in classification models across various datasets with covariate shifts. The results indicate that ECL outperforms traditional calibration methods that rely on importance weighting, particularly in scenarios where the density ratios are large or unbounded.
Implications
The findings suggest that ECL can significantly improve the reliability of confidence estimates in safety-critical applications, such as medical diagnosis and autonomous systems, where accurate uncertainty quantification is essential. This method could enhance the integration of machine learning models with other probabilistic frameworks and improve decision-making processes.
Temporal Contrastive Transformer for Financial Crime Detection: Self-Supervised Sequence Embeddings via Predictive Contrastive Coding
Time Series
- Introduction of the Temporal Contrastive Transformer (TCT) for financial crime detection.
- TCT utilizes self-supervised contrastive learning to generate sequence embeddings.
- Embeddings achieved an AUC of 0.8644, indicating effective capture of temporal dynamics.
- No significant improvement was observed when embeddings were combined with engineered features.
Read more
Temporal Contrastive Transformer for Financial Crime Detection: Self-Supervised Sequence Embeddings via Predictive Contrastive Coding
Summary
This paper introduces the Temporal Contrastive Transformer (TCT), a novel representation learning framework aimed at capturing the contextual temporal dynamics in sequences of financial transactions. The TCT is trained using a self-supervised contrastive objective, which allows it to produce embeddings that encode behavioral patterns over time, facilitating downstream fraud detection tasks. The authors evaluate TCT by using the learned embeddings as input features for a gradient boosting classifier. The results indicate that the embeddings achieve a meaningful predictive performance (AUC 0.8644), demonstrating the model's ability to capture non-trivial temporal structures. However, when combined with domain-engineered features, there is no significant improvement over the baseline (AUC 0.9205 vs. 0.9245), suggesting that the learned representations largely overlap with existing feature abstractions. This highlights the challenges in achieving additive value over strong domain features. The findings suggest that while TCT is a promising approach for representation learning in financial crime detection, further research is needed to enhance model architecture, training objectives, and integration strategies. The results indicate that learned representations can approximate domain-specific features without manual engineering, marking a significant step towards reducing reliance on feature engineering in this domain.
Methodology
The TCT framework integrates a Temporal Fusion Transformer for structured multi-horizon temporal modeling and employs a Contrastive Predictive Coding objective to learn embeddings from unlabeled transaction sequences. This self-supervised approach contrasts future latent representations against negative samples to capture predictive temporal structures.
Results
The TCT embeddings achieved an AUC of 0.8644 in fraud detection tasks, indicating meaningful predictive performance. However, when combined with domain-engineered features, the performance did not significantly exceed the baseline, suggesting substantial overlap between learned and engineered representations.
Implications
The TCT framework presents a promising direction for financial crime detection by potentially reducing the need for extensive manual feature engineering. It highlights the importance of self-supervised learning in capturing temporal patterns, which could lead to more adaptable and generalizable models in fraud detection and anti-money laundering efforts.
SeqLoRA: Bilevel Orthogonal Adaptation for Continual Multi-Concept Generation
Generative Models
Optimization
Computer Vision
- SeqLoRA introduces a bilevel optimization framework for efficient multi-concept image generation.
- The method enforces orthogonality between adaptation subspaces to mitigate representation interference.
- Theoretical analyses confirm convergence and reduced catastrophic forgetting compared to frozen-basis methods.
- SeqLoRA demonstrates state-of-the-art performance in identity preservation across multiple concepts.
Read more
SeqLoRA: Bilevel Orthogonal Adaptation for Continual Multi-Concept Generation
Summary
The paper introduces Sequential regularized LoRA (SeqLoRA), a novel framework for parameter-efficient fine-tuning of text-to-image diffusion models that addresses the challenge of generating images with multiple personalized concepts. Traditional methods face issues of representation interference, where adaptations for different concepts overlap in the model's parameter space, leading to attribute entanglement and degraded image quality. SeqLoRA employs a bilevel optimization approach that jointly optimizes both LoRA factors while enforcing orthogonality between subspaces, thus balancing expressiveness and interference. The authors provide theoretical guarantees for convergence and demonstrate that learning the LoRA basis from data significantly reduces residual interference compared to methods that freeze the basis. Experiments show that SeqLoRA achieves superior identity preservation and scalability, successfully generating images with up to 101 distinct concepts without the need for costly post-hoc fusion. This advancement opens new avenues for modular composition in multi-concept image generation, enhancing the capabilities of diffusion models in creative applications.
Methodology
SeqLoRA utilizes a constrained continual learning framework that optimizes LoRA factors through bilevel optimization. It models residual layer activations as a matrix sub-Gaussian process to derive high-probability bounds on catastrophic forgetting, ensuring that the learning process minimizes residual interference effectively.
Results
The experiments conducted on multi-concept image generation reveal that SeqLoRA outperforms existing methods in terms of identity preservation and scalability, successfully generating coherent images with up to 101 concepts while avoiding the pitfalls of attribute entanglement.
Implications
The proposed SeqLoRA framework has significant implications for the field of generative models, particularly in enhancing the personalization and modularity of text-to-image diffusion systems. It enables users to create complex visual narratives by combining multiple concepts seamlessly, which could benefit applications in artistic creation, media production, and data augmentation.
VeriScale: Adversarial Test-Suite Scaling for Verifiable Code Generation
Large Language Models
Theory
Optimization
- VeriScale is a framework that systematically expands and reduces test suites for verifiable code generation using adversarial implementations.
- The framework significantly increases the diversity of test cases, expanding original test suites by over 83 times.
- Experiments show that existing benchmarks overestimate model capabilities, as demonstrated by sharp performance drops on new test suites.
- VERINALITE offers a lightweight alternative that preserves discriminative power while reducing evaluation costs.
Read more
VeriScale: Adversarial Test-Suite Scaling for Verifiable Code Generation
Summary
The paper introduces VeriScale, a novel framework aimed at enhancing the evaluation of verifiable code generation by addressing the limitations of existing benchmarks. Current benchmarks often lack sufficient positive and negative test cases, leading to an inflated perception of model capabilities in generating specifications and implementations. VeriScale operates in two stages: first, it expands test suites by generating diverse and challenging test cases through adversarial implementations; second, it reduces these test cases into a compact, discriminative suite. The framework was instantiated on VERINA, resulting in the creation of VERINAPLUS, which expands the original test suites by over 83 times, and VERINALITE, a lightweight variant that maintains discriminative power at a lower evaluation cost. Experiments with state-of-the-art large language models (LLMs) reveal that VERINAPLUS exposes significant weaknesses in model performance, with notable score drops on SpecGen and CodeGen tasks, while VERINALITE retains effectiveness with reduced resources. The findings underscore the importance of adversarial test scaling in accurately assessing model capabilities in verifiable code generation.
Methodology
VeriScale employs a two-stage approach: test-suite expansion and reduction. It generates a large pool of candidate inputs using LLM-based seed generation and type-aware mutation, classifying them into expected and unexpected outputs. Adversarial implementations are synthesized to expose weaknesses in LLM-generated specifications, and a reduction strategy is applied to distill the expanded test cases into a compact suite without losing discriminative power.
Results
The application of VeriScale to VERINA resulted in the creation of VERINAPLUS and VERINALITE benchmarks. Notably, the performance of GPT-5.5 dropped significantly on VERINAPLUS, indicating that the original benchmarks overestimated model capabilities. The performance on VERINALITE closely matched that of VERINAPLUS, confirming the effectiveness of the reduction strategy.
Implications
The findings suggest that adversarial test scaling is crucial for accurately evaluating the capabilities of LLMs in verifiable code generation. This approach can lead to more reliable benchmarks, ultimately enhancing the trustworthiness of LLM-generated software in high-stakes applications.
The Neural Compiler: Program-to-Network Translation for Hybrid Scientific Machine Learning
Theory
- The Neural Compiler translates symbolic physics programs into differentiable PyTorch modules, ensuring exact computation and gradients.
- Compiled models achieve superior parameter recovery and lower error rates compared to traditional methods like PINNs and neural ODEs.
- The system supports systematic composability, allowing for easy chaining and recombination of compiled modules.
- The compiler provides formal guarantees of correctness and error bounds, enhancing reliability in scientific machine learning applications.
Read more
The Neural Compiler: Program-to-Network Translation for Hybrid Scientific Machine Learning
Summary
The paper introduces The Neural Compiler, a novel system designed to translate programs written in a first-order expression language with Scheme syntax into frozen, differentiable PyTorch modules. This approach addresses the challenge in scientific machine learning of integrating known physics with unknown components, where traditional methods either discard known structures or require extensive manual coding. The Neural Compiler ensures exact computation and gradients, producing outputs that match the source program with floating-point precision and contributing zero approximation error within its safe domain. The system supports 51 primitive operations, allowing for complex operations like vector and matrix algebra, and demonstrates systematic composability, enabling the chaining and recombination of compiled modules without additional coding. The evaluation across six experiments, including algebraic equations and differential equations, shows that compiled models outperform traditional methods in parameter recovery and error accumulation, particularly in complex scenarios. The paper highlights the potential of using large language models to translate physics descriptions into compilable programs, paving the way for automated scientific model generation.
Methodology
The Neural Compiler translates first-order arithmetic expressions into PyTorch modules, ensuring exact computation and gradients. It evaluates its performance across various scientific domains, comparing against baseline methods like hand-coded PyTorch implementations and physics-informed neural networks.
Results
The experiments confirm that compiled models produce numerically identical results to hand-coded implementations, recover physical constants with less than 1% error using only a few trainable parameters, and maintain zero error in compositional settings, while traditional neural approximations accumulate significant errors.
Implications
The Neural Compiler could significantly streamline the integration of known physics into machine learning models, reducing the need for manual coding and enhancing the reliability of scientific predictions. Its potential integration with large language models may further facilitate the automatic generation of complex scientific models from natural language descriptions.
Disentanglement Beyond Generative Models with Riemannian ICA
Theory
Interpretability
Generative Models
- Introduces Riemannian ICA (RICA) as a local geometric alternative to traditional ICA.
- Defines pointwise disentanglement and introduces the disentanglement tensor.
- Demonstrates that RICA can recover sources effectively across different manifolds.
- Provides a theoretical basis for interpreting features learned by modern pretrained models.
Read more
Disentanglement Beyond Generative Models with Riemannian ICA
Summary
This paper addresses the gap between the theoretical foundations of disentanglement and modern representation learning practices. Traditional methods like Independent Component Analysis (ICA) rely on generative models with independent latent variables, which can be limiting in practical applications. The author introduces Riemannian ICA (RICA), a novel approach that shifts the focus from a global generative model to local geometric structures. RICA utilizes Riemannian geometry to interpret factors of variation as radial curves emanating from data points. The main contribution is the disentanglement tensor, which captures a second-order notion of disentanglement termed pointwise disentanglement. This tensor is derived from the Hessian of the data log likelihood and the Ricci curvature of the model. In experiments, RICA demonstrates effective source recovery across various manifolds, outperforming traditional ICA methods that depend heavily on the choice of coordinates. This work lays the groundwork for understanding local disentanglement without the constraints of a global generative framework.
Methodology
The paper employs Riemannian geometry to redefine the factors of variation in data as local geometric structures. It introduces the disentanglement tensor, which is computed using the Hessian of the data log likelihood and Ricci curvature. The methodology includes diagonalizing this tensor to assess pointwise disentanglement, validated through a controlled source recovery setting.
Results
RICA successfully recovers sources in a controlled setting, showing robustness across various manifolds. The results indicate that RICA's performance is less dependent on coordinate representation compared to traditional ICA methods, which often struggle with rigid modeling assumptions.
Implications
The findings suggest that RICA could enhance the interpretability and utility of features learned by modern machine learning models, particularly in applications requiring disentangled representations, such as controllable editing and scientific analysis. This approach may also inspire new methods for representation learning that do not rely on strict generative assumptions.
Prototype-Guided Classification Sub-Task Decoupling Framework: Enhancing Generalization and Interpretability for Multivariate Time Series
Time Series
- PDFTime decouples temporal representation learning from decision-making, enhancing interpretability.
- The framework employs a hierarchical organization of prototypes for structured, similarity-driven inference.
- PDFTime achieves state-of-the-art results on 80 out of 128 datasets in the UCR archive.
- The proposed method significantly improves generalization capabilities compared to traditional TSC models.
Read more
Prototype-Guided Classification Sub-Task Decoupling Framework: Enhancing Generalization and Interpretability for Multivariate Time Series
Summary
The paper introduces PDFTime, a novel framework for Time Series Classification (TSC) that addresses the challenges of generalization and interpretability in existing models. Traditional TSC approaches often collapse high-dimensional temporal embeddings into class logits through a single linear projection, which obscures the decision-making process and limits interpretability. PDFTime reformulates TSC as a multi-stage decision process that utilizes learned prototypes to approximate class-conditional feature distributions in the latent space. This decoupling allows for progressive discrimination through classification sub-tasks of varying granularity. The framework not only enhances the model's generalization capabilities but also provides a more interpretable reasoning process by structuring the latent space into class-conditional prototype clusters. Extensive evaluations demonstrate that PDFTime achieves state-of-the-art performance across multiple benchmarks, significantly outperforming existing methods in both consistency and generalization.
Methodology
PDFTime utilizes a prototype-guided approach to reformulate time series classification as a multi-stage decision process. It employs a novel prototype-based classification head that organizes prototypes hierarchically, allowing for progressive refinement of classification tasks. This method contrasts with traditional direct feature-to-label mapping by structuring the latent space into class-conditional clusters, facilitating a more interpretable decision-making process.
Results
PDFTime achieved state-of-the-art performance on 80 out of 128 datasets in the UCR archive, demonstrating superior generalization capabilities and consistency compared to recent competitive baselines. The framework sets a new performance benchmark for the community in time series classification.
Implications
The proposed framework has significant implications for various applications of time series classification, including healthcare, human action recognition, and IoT, where interpretability and generalization are critical. By providing a more transparent decision-making process, PDFTime can enhance trust and usability in real-world applications.
Boundary-targeted Membership Inference Attacks on Safety Classifiers
NLP
Large Language Models
Theory
- Introduces a boundary-targeted selection strategy for membership inference attacks on safety classifiers.
- Demonstrates that low-confidence examples are more revealing for MIAs than high-confidence examples.
- Achieves a 19% recovery rate of flagged conversations at a 5% false-positive rate, significantly outperforming existing methods.
- Shows that content-based filtering is ineffective for protecting against MIAs in safety classifiers.
Read more
Boundary-targeted Membership Inference Attacks on Safety Classifiers
Summary
This paper addresses the privacy concerns surrounding safety classifiers used in generative AI systems, particularly in the context of membership inference attacks (MIAs). Safety classifiers play a crucial role in filtering harmful content and identifying at-risk users, yet they are trained on sensitive datasets, raising the risk of exposing personal information. The authors propose a novel boundary-targeted selection strategy that focuses on low-confidence examples, which are hypothesized to be more informative for adversaries attempting to infer membership. By analyzing these boundary cases, the study demonstrates that adversaries can recover a significant portion of flagged conversations indicating user distress, outperforming existing MIA methods. The findings reveal that traditional content-based filtering is ineffective as a defense, while noise-injection strategies can mitigate the vulnerability of these examples. This work represents the first systematic examination of MIAs against safety classifiers, highlighting the need for improved privacy protections in AI systems.
Methodology
The authors developed a boundary-targeted selection strategy that isolates membership evaluation to examples near the classifier's decision boundary. They evaluated this approach using two MIA scoring functions, four language models, and five datasets related to toxicity detection and mental health classification, comparing the effectiveness of their method against standard MIA techniques.
Results
The experimental results indicated that the boundary-targeted approach significantly improved the effectiveness of membership inference attacks, allowing adversaries to recover 19% of conversations flagged by the safety classifier at a 5% false-positive rate. This performance was 3.5 times better than that achieved using state-of-the-art MIA methods alone.
Implications
The findings underscore the vulnerabilities of safety classifiers in generative AI systems, suggesting that privacy protections need to be enhanced. The proposed boundary-targeted strategy could inform future research on MIA defenses and the design of safer AI systems that handle sensitive user data.
Clipping Bottleneck: Stabilizing RLVR via Stochastic Recovery of Near-Boundary Signals
Reinforcement Learning
Large Language Models
Optimization
- Identified hard clipping as a major source of instability in RLVR training.
- Proposed Near-boundary Stochastic Rescue (NSR) to recover lost signals near the clipping threshold.
- NSR improves training stability and convergence without altering the core optimization algorithm.
- Extensive experiments show NSR consistently outperforms strong baselines across various model sizes.
Read more
Clipping Bottleneck: Stabilizing RLVR via Stochastic Recovery of Near-Boundary Signals
Summary
This paper addresses the challenges of training stability and convergence in Reinforcement Learning with Verifiable Rewards (RLVR), particularly focusing on the limitations imposed by hard clipping mechanisms in GRPO-style objectives. The authors identify that the hard clipping decision leads to the loss of informative signals that lie just beyond the clipping threshold, which contributes to training instability. To mitigate this issue, they propose a novel approach called Near-boundary Stochastic Rescue (NSR). NSR stochastically retains tokens that fall slightly outside the clipping boundary, transforming the rigid binary decision-making process into a probabilistic one. This allows the model to recover valuable learning signals that would otherwise be discarded. The authors validate NSR through extensive experiments across various model sizes and architectures, demonstrating that it significantly enhances training stability and outperforms existing baselines like DAPO and GSPO. The findings suggest that addressing the clipping decision is crucial for improving the robustness of RLVR setups.
Methodology
The authors conducted a systematic analysis of GRPO-style clipping objectives to diagnose the issues caused by hard clipping. They introduced NSR, which employs stochastic sampling to retain tokens that are slightly out-of-bound, thus preserving informative gradients. The effectiveness of NSR was validated through controlled experiments and extensive evaluations on models ranging from 7B to 30B parameters.
Results
NSR demonstrated substantial improvements in training stability and performance across various architectures, yielding consistent gains over strong baselines such as DAPO and GSPO. The stochastic mechanism of NSR was shown to be more effective than deterministic gradient decay methods, highlighting the importance of probabilistic decision-making near the clipping boundary.
Implications
The findings suggest that refining the clipping mechanism in RLVR can lead to more stable and effective training processes for large language models. This has potential applications in enhancing the performance of RLVR systems in complex reasoning tasks, such as mathematics and coding.
Toward Understanding Adversarial Distillation: Why Robust Teachers Fail
Theory
- Identifies the 'Robustly Unlearnable Set' as a key factor in the failure of robust teachers in Adversarial Distillation.
- Develops a theoretical framework explaining how teacher confidence on unlearnable samples leads to robust overfitting.
- Empirical validation of the theory through experiments on synthetic and real-world datasets.
- Proposes predictive entropy on unlearnable samples as a criterion for selecting effective robust teachers.
Read more
Toward Understanding Adversarial Distillation: Why Robust Teachers Fail
Summary
This paper investigates the inconsistencies in Adversarial Distillation (AD), a method designed to enhance the robustness of student models by leveraging the soft labels from robust teacher models within a min-max adversarial training framework. The authors identify a critical issue: the misalignment between the teacher's supervisory confidence and the student's representational limitations, particularly concerning a subset of training data termed the 'Robustly Unlearnable Set.' Through a theoretical framework analyzing the dynamics of a two-layer neural network, the authors demonstrate that when a teacher provides confident supervision on unlearnable samples, it leads the student to memorize irrelevant noise patterns, resulting in robust overfitting. Conversely, when the teacher exhibits uncertainty on these samples, the student can focus on learnable signals, improving robust generalization. The paper empirically validates this theory through simulations and real-image classification datasets, establishing that a teacher's predictive entropy on unlearnable samples is a strong indicator of student robustness. This work not only clarifies the mechanisms behind the teacher-student relationship in AD but also offers practical guidelines for selecting effective robust teachers.
Methodology
The authors present a theoretical analysis of feature learning dynamics in a two-layer neural network and conduct extensive empirical experiments on both synthetic and real datasets to validate their findings regarding teacher-student interactions in adversarial training.
Results
The study confirms that robust overfitting is significantly influenced by the teacher's interaction with unlearnable samples. It establishes a clear relationship between a teacher's predictive confidence on these samples and the robustness of the student model, providing a principled approach for robust teacher selection.
Implications
The findings have significant implications for improving the robustness of machine learning models in adversarial settings, particularly in safety-critical applications. The proposed guidelines for teacher selection can enhance the effectiveness of Adversarial Distillation, leading to better generalization in adversarial environments.
From Sequential Nodes to GPU Batches: Parallel Branch and Bound for Optimal k-Sparse GLMs
Optimization
Efficient ML
Theory
- Introduction of a hybrid CPU-GPU framework for optimizing k-sparse GLMs.
- Development of GPU-efficient routines and a padding strategy for handling irregular data structures.
- Demonstration of significant speedups in runtime and optimality certification on complex instances.
- Ability to extend the framework for Rashomon set collection, facilitating model comparison and selection.
Read more
From Sequential Nodes to GPU Batches: Parallel Branch and Bound for Optimal k-Sparse GLMs
Summary
This paper addresses the challenges of optimizing cardinality-constrained generalized linear models (GLMs) using GPUs, which have been effective for continuous optimization but struggle with discrete variables and combinatorial structures. The authors propose a novel CPU-GPU framework that processes multiple branch and bound (BnB) nodes in batches on GPUs, overcoming the limitations of sequential node processing and frequent CPU-GPU data transfers. The framework includes GPU-efficient routines and a padding strategy to manage irregular node data structures. Key contributions include a hybrid CPU-GPU architecture, efficient GPU routines for multi-node BnB computation, and a method for selecting support variables directly from relaxed coefficient values. Experimental results demonstrate significant speedups of one to two orders of magnitude and zero optimality gap on challenging instances, showcasing the framework's potential for collecting the Rashomon set of near-optimal models for further statistical analysis.
Methodology
The authors developed a modular CPU-GPU framework where the CPU manages the irregular logic of the BnB search tree, while the GPU handles batched numerical computations. They introduced GPU routines for matrix operations and a padding strategy to standardize node data structures, allowing for efficient processing of multiple nodes simultaneously. The framework also includes a method for selecting branching variables directly from the relaxed coefficients, minimizing the need for CPU-side operations.
Results
The proposed framework achieved speedups of one to two orders of magnitude compared to traditional single-node processing methods. It successfully certified optimality on difficult sparse GLM instances where previous methods left significant optimality gaps. Additionally, the framework was adapted to collect the Rashomon set, enabling comprehensive model analysis.
Implications
This work has significant implications for fields requiring high-performance optimization of sparse models, such as scientific research, finance, and medical applications. The ability to efficiently certify optimal solutions and collect the Rashomon set can enhance model interpretability and facilitate better decision-making based on variable importance and performance metrics.
Objective-Induced Bias and Search Dynamics in Multiobjective Unsupervised Feature Selection
Optimization
Theory
- The choice of evaluation objective critically influences the search dynamics and quality of feature subsets in multiobjective UFS.
- Silhouette-based formulations exhibit a bias toward trivial low-cardinality solutions, making them less effective for predictive performance.
- The proposed PCA reconstruction loss objective produces compact feature subsets with test accuracy comparable to supervised methods.
- Subset-size regularization and initial population strategies significantly shape the structure of the Pareto front.
Read more
Objective-Induced Bias and Search Dynamics in Multiobjective Unsupervised Feature Selection
Summary
This paper investigates the dynamics of multiobjective unsupervised feature selection (UFS) by analyzing how different evaluation objectives and subset-size regularization strategies influence the search process and the quality of selected feature subsets. The authors conduct experiments using a synthetic dataset designed with known feature types—informative, redundant, and irrelevant—to explore the impact of various formulations on the Pareto front. They compare three evaluation objectives: accuracy, silhouette score, and PCA reconstruction loss, combined with subset-size minimization or maximization. The findings reveal that the choice of objective significantly affects both the search dynamics and the resulting Pareto front's quality. Specifically, silhouette-based formulations tend to favor low-cardinality solutions and poorly correlate with predictive performance, while the PCA loss objective yields compact subsets with competitive test accuracy. The study emphasizes the importance of objective design in effective multiobjective UFS and provides insights into how different objectives select feature types, highlighting the need for careful consideration of evaluation criteria in feature selection tasks.
Methodology
The authors employed a controlled experimental framework using a synthetic dataset that includes informative, redundant, and irrelevant features. They compared six formulations of multiobjective UFS by varying evaluation objectives (accuracy, silhouette score, PCA reconstruction loss) and subset-size regularization strategies (minimization vs. maximization). The performance of these formulations was assessed based on their impact on the Pareto front and the composition of selected feature subsets.
Results
The study found that different formulations lead to varying search dynamics and quality of the Pareto front. Silhouette-based objectives were shown to favor low-cardinality solutions, while PCA loss objectives produced more compact subsets with competitive accuracy. The analysis also revealed how subset-size regularization and initial population sampling influence the exploration of the search space and the attainable trade-offs.
Implications
The findings underscore the necessity of careful objective design in multiobjective unsupervised feature selection, which could enhance the effectiveness of feature selection methods in practical applications. The insights gained from the study can inform the development of more robust UFS algorithms that better balance subset quality and size, potentially improving model performance in high-dimensional datasets.
Don't Collapse Your Features: Why CenterLoss Hurts OOD Detection and Multi-Scale Mahalanobis Wins
Computer Vision
- Introduction of the Geometry-Optimised Epistemic Network (GOEN) for OOD detection.
- CenterLoss, a popular regularizer, is shown to hinder OOD detection performance.
- GOEN-NoCenterLoss achieves state-of-the-art OOD AUROC on CIFAR-10 benchmarks.
- The study emphasizes the importance of feature geometry in relation to epistemic uncertainty.
Read more
Don't Collapse Your Features: Why CenterLoss Hurts OOD Detection and Multi-Scale Mahalanobis Wins
Summary
This paper addresses the critical issue of out-of-distribution (OOD) detection in machine learning systems, which is essential for their safe deployment in real-world applications. The author introduces the Geometry-Optimised Epistemic Network (GOEN), a novel pipeline that integrates multi-scale features, L2 normalization, Mahalanobis distance, and a calibration head trained on challenging OOD examples. A key finding of the study is that the commonly used CenterLoss regularizer, which aims to enhance feature compactness, actually degrades OOD detection performance, lowering the average OOD AUROC from 0.9483 to 0.9366, despite improving classification accuracy. The GOEN-NoCenterLoss variant achieves an impressive average OOD AUROC of 0.9483, outperforming established methods such as deep ensembles, KNN, and ODIN on CIFAR-10 benchmarks while maintaining competitive in-distribution accuracy. This research challenges the assumption that optimizing classification geometry inherently improves epistemic uncertainty and highlights the importance of maintaining inter-class margins and covariance structures for effective OOD detection. The GOEN pipeline is efficient, requiring less than 20 minutes of training on a single GPU, and serves as a practical framework for developing AI systems capable of recognizing their limitations.
Methodology
The methodology involves training a multi-scale ResNet-18 backbone without CenterLoss, fitting class-conditional Mahalanobis densities on L2-normalized features, and training a lightweight calibration head using a mix of in-distribution and hard OOD examples. Systematic ablation studies were conducted to evaluate the impact of various components.
Results
The GOEN-NoCenterLoss variant achieved an average OOD AUROC of 0.9483, outperforming several baseline methods including deep ensembles (0.8827), KNN (0.8967), and ODIN (0.8870). The introduction of CenterLoss was found to decrease OOD detection performance despite improvements in classification accuracy.
Implications
The findings suggest that machine learning practitioners should reconsider the use of feature compactness regularizers like CenterLoss when developing models for OOD detection. The GOEN framework can be applied in various domains requiring reliable OOD detection, such as autonomous driving and medical diagnosis.
What are the Right Symmetries for Formal Theorem Proving?
Theory
Large Language Models
- Introduces rewriting categories as a framework for modeling transformations in formal theorem proving.
- Formalizes two symmetry concepts: proof equivariance and success invariance.
- Demonstrates that LLM-based provers exhibit significant performance variability across equivalent formulations.
- Proposes a model-agnostic test-time procedure to improve robustness and proof success rates.
Read more
What are the Right Symmetries for Formal Theorem Proving?
Summary
This paper addresses the sensitivity of formal theorem provers based on large language models (LLMs) to variations in problem representation, which can lead to significant differences in proof success rates for semantically equivalent statements. The authors introduce a category-theoretic framework called rewriting categories to formalize two key symmetry notions: proof equivariance and success invariance. They demonstrate that while state-based next-tactic provers naturally satisfy proof equivariance, LLM-based provers do not, leading to performance variability. To address this issue, the authors propose test-time methods that aggregate over equivalent rewritings of input statements, theoretically proving that this approach can recover success invariance in the sampling limit. Empirical results show that these methods enhance robustness and performance under fixed inference budgets, highlighting the importance of symmetry as an inductive bias in LLM-based theorem proving and suggesting practical strategies for its approximation.
Methodology
The authors develop a category-theoretic framework called rewriting categories to model transformations between theorem statements. They analyze the performance of LLM-based provers on a benchmark of semantically equivalent reformulations and propose a test-time aggregation method to improve success invariance.
Results
The study finds that LLM-based provers do not satisfy the properties of proof equivariance and success invariance, exhibiting large performance variations. The proposed test-time aggregation method recovers success invariance in the sampling limit and empirically improves robustness and performance under fixed inference budgets.
Implications
The findings suggest that incorporating symmetry as an inductive bias could enhance the performance of LLM-based theorem provers. The proposed methods may lead to more reliable and robust formal theorem proving systems, with potential applications in automated reasoning and formal verification.
Value-Gradient Hypothesis of RL for LLMs
Large Language Models
Reinforcement Learning
Theory
- Critic-free RL methods like PPO and GRPO can effectively leverage value-gradient-like signals during training.
- The actor update in these methods propagates costates that approximate the value gradient, enabling effective credit assignment.
- Empirical costates in discrete transformers approximate the continuous value gradient signal, with controlled error.
- The paper introduces a predictive decomposition of RL impact that aids in checkpoint selection during LLM pretraining.
Read more
Value-Gradient Hypothesis of RL for LLMs
Summary
This paper investigates the effectiveness of critic-free reinforcement learning (RL) methods, such as Proximal Policy Optimization (PPO) and Group Relative Policy Optimization (GRPO), in the context of post-training large language models (LLMs). The authors propose a value-gradient perspective that explains why these methods perform well despite classical RL theory suggesting they should struggle with long-horizon credit assignment. They demonstrate that the actor update in these methods behaves like a value-gradient in expectation, particularly under a differentiable rollout and additive noise parameterization. Additionally, they show that for discrete transformer policies, the empirical costates derived from autodifferentiation approximate the value gradient signal, with the error influenced by sampling gaps and policy entropy. This leads to a decomposition of RL impact into usable value-gradient signals and reachable reward headroom, providing a criterion for identifying optimal points in the pretraining trajectory where RL is most beneficial. The findings offer practical insights into checkpoint selection during pretraining, suggesting that RL is most effective when checkpoints are close enough to the value-gradient regime while still allowing for reward improvement.
Methodology
The authors utilize a theoretical framework based on differentiable rollouts and additive noise parameterization to analyze the actor updates in PPO and GRPO. They derive relationships between empirical costates and value gradients in discrete transformer policies, employing autodifferentiation techniques to approximate these signals.
Results
The study confirms that the actor updates in critic-free RL methods are value-gradient-like in expectation. It also establishes that the empirical costates computed through autodifferentiation closely approximate the value gradient, with the accuracy of this approximation dependent on the sampling gap and policy entropy. Furthermore, the proposed RL-impact decomposition effectively predicts when RL will be most beneficial during the pretraining of LLMs.
Implications
The findings suggest that RL can be strategically applied to enhance LLM performance, particularly at specific checkpoints during pretraining. This could lead to more efficient training processes and improved model capabilities in reasoning and language tasks.
Holomorphic Neural ODEs with Kolmogorov-Arnold Networks for Interpretable Discovery of Complex Dynamics
Interpretability
- Introduces Holomorphic KAN-ODE, combining KANs with Neural ODEs for complex dynamics.
- Achieves high accuracy in modeling complex systems with significantly fewer parameters than MLPs.
- Demonstrates robustness to noise and effective transfer learning between different dynamical systems.
- Provides interpretable symbolic equations, enhancing understanding of the underlying dynamics.
Read more
Holomorphic Neural ODEs with Kolmogorov-Arnold Networks for Interpretable Discovery of Complex Dynamics
Summary
This paper introduces Holomorphic KAN-ODE, a novel framework that integrates Kolmogorov-Arnold Networks (KANs) with Neural Ordinary Differential Equations (Neural ODEs) to model complex dynamical systems governed by holomorphic maps. Traditional Multi-Layer Perceptrons (MLPs) used in Neural ODEs do not respect the complex-analytic geometry and fail to enforce Cauchy-Riemann conditions, leading to opaque models that do not yield interpretable governing equations. The proposed framework replaces MLPs with KANs, which utilize learnable B-spline activations and incorporate Cauchy-Riemann equations as a differentiable regularization. The authors evaluate the framework on six families of complex dynamical systems, demonstrating its ability to reconstruct fractal boundaries, achieve high velocity-field R2 scores, and recover symbolic governing equations. The results show that Holomorphic KAN-ODE is not only parameter-efficient but also exhibits superior noise resilience and transfer learning capabilities compared to MLPs, making it a promising approach for the interpretable discovery of complex dynamics.
Methodology
The methodology involves replacing the MLP in Neural ODEs with a Kolmogorov-Arnold Network (KAN) that uses learnable B-spline activations. The framework applies Cauchy-Riemann equations as a differentiable regularization to ensure the preservation of holomorphic properties during training. The evaluation is conducted across six families of complex dynamical systems, including both polynomial and transcendental classes.
Results
The Holomorphic KAN-ODE framework achieved velocity-field R2 scores greater than 0.95 across all tested systems, successfully identified governing symbolic families, and reconstructed Julia set fractal boundaries with up to 98.0% agreement. The model showed only 4% mean squared error degradation under 10% observation noise, compared to 15.2 times worse performance for MLPs. Additionally, it demonstrated a 90.4% improvement in transfer learning from quadratic to cubic dynamics.
Implications
The findings suggest that Holomorphic KAN-ODE can serve as a powerful tool for the interpretable modeling of complex dynamical systems, with potential applications in fields such as fluid mechanics, chaos theory, and any domain requiring the understanding of complex dynamics. Its ability to provide symbolic representations of learned dynamics could facilitate further research and applications in physics-informed learning.
Implicit Regularization of Mini-Batch Training in Graph Neural Networks
Graph Learning
Optimization
Efficient ML
- Random Node Sampling (RNS) can outperform full-graph training in GNNs while being computationally efficient.
- Backward error analysis reveals that mini-batch SGD implicitly minimizes a modified objective that includes a gradient-variance regularizer.
- RNS produces lower-variance gradients compared to structure-aware samplers, leading to a more stable implicit objective.
- The method requires only a single hyperparameter, making it easy to implement in practice.
Read more
Implicit Regularization of Mini-Batch Training in Graph Neural Networks
Summary
This paper investigates the mini-batch training of Graph Neural Networks (GNNs), highlighting the unique challenges posed by sampling subgraphs, which alters the topology and introduces boundary effects. The authors focus on Random Node Sampling (RNS), a simple yet effective method that samples nodes uniformly to create induced subgraphs for training. Surprisingly, RNS matches or outperforms full-graph training across 8 out of 10 datasets while significantly reducing wall-clock time and memory usage. The authors employ backward error analysis to demonstrate that mini-batch Stochastic Gradient Descent (SGD) implicitly minimizes a modified objective that combines the sampled loss with a regularizer based on the variance of mini-batch gradients. This analysis reframes the choice of graph sampler as a form of implicit regularization, positioning RNS as a theoretically grounded and practical approach for scalable GNN training. The findings suggest that despite its simplicity, RNS effectively captures the necessary statistical properties for training, making it a strong candidate for large-scale GNN applications.
Methodology
The authors applied backward error analysis to the mini-batch training of GNNs, specifically focusing on the effects of Random Node Sampling (RNS). They compared RNS with other sampling strategies across various datasets and GNN architectures to evaluate performance in terms of predictive accuracy, computational efficiency, and memory usage.
Results
RNS matched or outperformed full-graph training on 8 out of 10 datasets, demonstrating its effectiveness across diverse graph sizes, densities, and homophily levels. The method provided 2× to 12× speedups in wall-clock time and up to 3× lower peak GPU memory usage compared to traditional full-graph training.
Implications
The findings suggest that RNS can serve as a robust and efficient alternative for training GNNs, particularly in scenarios involving large-scale graphs. This could lead to broader adoption of GNNs in real-world applications where computational resources are limited.
Energy-Gated Attention: Spectral Salience as an Inductive Bias for Transformer Attention
NLP
Large Language Models
Theory
- Introduction of Energy-Gated Attention (EGA) to improve transformer attention mechanisms.
- EGA leverages principles from turbulence theory to prioritize tokens based on their informational density.
- Achieved validation loss improvements of +0.103 on TinyShakespeare and +0.101 on Penn Treebank with minimal parameter overhead.
- Identified learned wavelet packets as a promising direction for optimizing energy gating.
Read more
Energy-Gated Attention: Spectral Salience as an Inductive Bias for Transformer Attention
Summary
This paper introduces Energy-Gated Attention (EGA), a novel modification to the standard transformer attention mechanism that incorporates spectral salience as an inductive bias. Traditional transformer attention treats all tokens equally, failing to account for their intrinsic informational content. Drawing parallels from turbulent fluid dynamics, where coherent structures dominate energy distribution, the author posits that certain tokens in language models (e.g., morphological boundaries and discourse markers) should attract more attention due to their higher informational density. EGA gates value aggregation based on the spectral energy of key token embeddings, computed through a learned linear projection that identifies the dominant spectral mode. The proposed method shows significant improvements in validation loss on datasets like TinyShakespeare and Penn Treebank, with minimal parameter overhead and no additional computational cost. The study also explores the effectiveness of different wavelet families for energy gating and identifies a stable linguistic property regarding the fraction of tokens with above-average spectral energy, suggesting that EGA can enhance long-context efficiency in future applications.
Methodology
The methodology involves modifying the standard transformer attention mechanism by introducing a learned energy gate that adjusts the attention weights based on the spectral energy of token embeddings. This is achieved through a linear projection that identifies the dominant spectral mode, allowing the model to suppress low-energy background tokens while amplifying high-energy informative tokens. The study also conducts systematic ablation studies across various wavelet families to assess their effectiveness in energy gating.
Results
The implementation of EGA resulted in a validation loss improvement of +0.103 on the TinyShakespeare dataset and +0.101 on the Penn Treebank dataset, demonstrating consistent performance across different datasets and initializations. The method incurs less than 0.26% additional parameters and no measurable computational cost. The study also reveals that the optimal energy direction is data-adaptive and non-sinusoidal, with a stable threshold of τ ≈0.35 for identifying salient tokens.
Implications
The findings suggest that incorporating spectral salience into transformer models can enhance their attention mechanisms, particularly in contexts with longer sequences. This could lead to more efficient language modeling and improved performance in various NLP tasks. The identification of learned wavelet packets as a promising direction opens avenues for further research in optimizing attention mechanisms.
Multiple Neural Operators Achieve Near-Optimal Rates for Multi-Task Learning
Theory
- Derivation of near-optimal approximation rates for MNO on Lipschitz multiple operator maps.
- Refinement of statistical learning rates for MNO, improving upon previously known rates.
- Establishment of lower complexity bounds for multiple operator learning, indicating intrinsic complexity barriers.
- Comparison of MNO with DeepONet, showing similar asymptotic rates in multi-task learning.
Read more
Multiple Neural Operators Achieve Near-Optimal Rates for Multi-Task Learning
Summary
This paper investigates the approximation and statistical complexity of learning collections of operators in a shared multi-task setting, focusing on the Multiple Neural Operators (MNO) architecture. The authors derive near-optimal upper bounds for approximation and statistical generalization for broad classes of Lipschitz multiple operator maps. They establish a curse of parametric complexity and prove corresponding minimax rates, demonstrating that shared representations across tasks do not increase the overall cost, as multi-task operator learning follows the same scaling laws as single operator learning. The paper also compares MNO with a multi-task extension of DeepONet, showing that both architectures exhibit similar asymptotic rates from a worst-case approximation-complexity perspective. The authors present refined approximation-theoretic upper bounds and improved statistical learning rates, indicating that the complexity of multi-task operator learning aligns with single-task operator learning rates. Furthermore, they extend the lower-complexity framework to multiple operator learning, providing insights into the intrinsic barriers of the target class.
Methodology
The authors utilize a theoretical framework to analyze the approximation and statistical learning rates of the MNO architecture. They derive upper bounds through a refined error analysis and extend existing lower-complexity frameworks to the multi-task operator setting. The analysis includes the exploration of Lipschitz multiple operator maps and the construction of MNO approximators with explicit bounds on network parameters.
Results
The paper presents several key results: (1) Near-optimal constructive approximation rates for MNO, with explicit bounds on network complexity; (2) Improved statistical learning rates that align with single-task operator learning; (3) Lower complexity bounds for multiple operator learning, confirming that the derived upper bounds are close to sharp and indicating unavoidable parametric complexity barriers.
Implications
The findings suggest that multi-task operator learning can be effectively achieved without incurring additional complexity costs, which has implications for various applications in machine learning, including parameterized kernel operators and solution operators of parameterized PDEs. The results may influence the design of neural network architectures for multi-task learning scenarios.
Dynamic Mixture of Latent Memories for Self-Evolving Agents
NLP
Large Language Models
Generative Models
- Introduction of MoLEM, a dynamic mixture of latent memory framework for self-evolving agents.
- Utilization of multiple expert models to generate latent memory, avoiding catastrophic forgetting.
- Implementation of a task-ID-free domain-aware routing mechanism for enhanced adaptability.
- Significant improvement in accuracy over baseline models in continual learning settings.
Read more
Dynamic Mixture of Latent Memories for Self-Evolving Agents
Summary
This paper addresses the challenge of enabling intelligent agents to self-evolve by continually acquiring new knowledge across diverse tasks while avoiding catastrophic forgetting of previously learned abilities. The authors propose a novel framework called MoLEM (Mixture of Latent Memories), which utilizes a dynamic mixture-of-experts (MoE) architecture. In this framework, multiple expert models serve as independent memory carriers, and a router selects and weights these experts based on key-query matching to generate latent memory. This latent memory is then integrated into the reasoning process of a frozen base model, ensuring that the core reasoning capabilities remain intact while allowing for the internalization of new experiences. The framework employs lightweight autoencoders to facilitate task-ID-free domain-aware routing, enhancing flexibility and adaptability in continual learning scenarios. Experimental results demonstrate that MoLEM significantly improves average accuracy by 10.40% over a vanilla pretrained baseline across various task sequences in math, science, and code domains, while outperforming competing methods in preserving competence across different training orders.
Methodology
The MoLEM framework employs a dynamic mixture-of-experts architecture where multiple experts generate latent memory. A router selects relevant experts based on key-query matching, and a lightweight autoencoder is used to determine the most compatible routing group during inference. The base reasoning model remains frozen, allowing for the internalization of new knowledge without overwriting previous capabilities.
Results
The MoLEM framework achieved an average accuracy improvement of 10.40% over the vanilla pretrained baseline after a full continual-learning sequence. It consistently outperformed competing methods in terms of competence preservation across various training orders.
Implications
The proposed framework has significant implications for developing versatile intelligent agents capable of adapting to new tasks without losing previously acquired knowledge. This could enhance applications in fields requiring continual learning, such as robotics, natural language processing, and adaptive systems.
Three Costs of Amortizing Gaussian Process Inference with Neural Processes
Theory
Efficient ML
Generative Models
- Decomposes the KL divergence between GP and LNP into three interpretable components.
- Identifies label contamination, information bottleneck, and amortization error as key sources of approximation error.
- Provides upper bounds on the truncation component of the bottleneck term, linking it to kernel smoothness.
- Recommends architectural changes to improve predictive variance estimation in GP-amortization.
Read more
Three Costs of Amortizing Gaussian Process Inference with Neural Processes
Summary
This paper explores the amortization of Gaussian Process (GP) inference using Neural Processes (NP), specifically focusing on Latent Neural Processes (LNP). The author identifies three main sources of error when approximating the exact GP posterior with a learned mapping: label contamination, an information bottleneck, and amortization error. The label contamination arises because the neural process uses label values to estimate a label-independent quantity, while the information bottleneck is due to the finite-dimensional representation that cannot fully capture the context geometry. The amortization error is introduced by using a single encoder network across all contexts. The paper provides bounds on the Kullback-Leibler (KL) divergence between the GP and LNP predictives, linking the representation dimension to kernel smoothness and input dimension. The results yield architectural recommendations for improving predictive variance estimation and suggest replacing mean aggregation with second-order pooling to mitigate the dominant amortization gap. Overall, the paper quantitatively characterizes the costs associated with amortization in the context of neural processes, offering insights into architectural choices for better performance.
Methodology
The paper employs a theoretical analysis of the KL divergence between the predictive distributions of Gaussian Processes and Latent Neural Processes. It decomposes this divergence into three components, providing bounds for each term based on the representation dimension and kernel properties. The analysis includes a focus on the architecture of neural processes and their implications for predictive performance.
Results
The main results include a detailed decomposition of the KL divergence, revealing that the label contamination term is generally O(1) with a noise component decaying as O(1/n). The bottleneck truncation term decays exponentially for squared-exponential kernels and polynomially for Matérn kernels, establishing a direct relationship between representation dimension and kernel smoothness. The findings suggest that architectural choices significantly impact the performance of neural processes in approximating Gaussian processes.
Implications
The insights from this paper can guide the design of neural process architectures, particularly in applications requiring efficient and scalable GP inference. The recommendations for variance prediction and aggregation methods can enhance the performance of neural processes in real-time tasks, such as robotics and sequential experimental design.
ChronoMedicalWorld: A Medical World Model for Learning Patient Trajectories from Longitudinal Care Data
Time Series
- Introduction of the ChronoMedicalWorld Model (CMWM) for predicting patient trajectories in chronic disease care.
- Integration of structured interventions and free-text communication as primary action inputs.
- Use of a six-term training objective with physiology-aware shape priors to enhance model stability and accuracy.
- Demonstrated superior performance in eGFR trajectory forecasting compared to a baseline model.
Read more
ChronoMedicalWorld: A Medical World Model for Learning Patient Trajectories from Longitudinal Care Data
Summary
The paper introduces the ChronoMedicalWorld Model (CMWM), a novel framework designed to predict patient trajectories in chronic disease management using longitudinal care data. Unlike existing models that primarily focus on discriminative predictions, CMWM employs an action-conditioned latent world-model approach, which integrates both structured intervention indicators and free-text communication embeddings. The model is built on a joint-embedding state encoder and a wide action encoder, and it is trained using a six-term objective that includes next-observation supervision and physiology-aware shape priors. A closed-loop rollout-prefix protocol ensures that the model is optimized for multi-step predictions, addressing the common issue of drift seen in traditional models. The authors validate CMWM through a case study on annual estimated glomerular filtration rate (eGFR) forecasting in chronic kidney disease (CKD), demonstrating its effectiveness against a baseline model. The results indicate that CMWM outperforms a tuned GPT-5.5 structured-prompting baseline in terms of mean absolute error (MAE) and root-mean-square error (RMSE), particularly benefiting from the integration of patient-health-coach communication as a primary action input. This framework is not limited to CKD but is applicable to various chronic conditions, making it a versatile tool for long-term patient trajectory forecasting.
Methodology
The CMWM framework combines a joint-embedding state encoder with a wide action encoder to process both structured and unstructured data. It employs a six-term training objective that includes next-observation supervision and physiology-aware shape priors, and utilizes a closed-loop rollout-prefix protocol to align training with deployment conditions.
Results
In the case study on a 2,232-patient nephrology cohort, CMWM achieved a mean absolute error (MAE) of 7.384 and a root-mean-square error (RMSE) of 10.256 for eGFR forecasting, outperforming the tuned GPT-5.5 baseline which had an MAE of 7.964 and RMSE of 11.069. The gains were primarily attributed to the effective integration of communication data.
Implications
The CMWM framework has the potential to significantly enhance personalized chronic disease management by enabling accurate long-term trajectory forecasting and counterfactual intervention evaluation. Its adaptability to various chronic conditions could lead to broader applications in clinical settings.
Cross-Species RSA Reveals Conserved Early Visual Alignment but Divergent Higher-Area Rankings Across Human fMRI and Macaque Electrophysiology
Computer Vision
- Early visual alignment is conserved across human fMRI and macaque electrophysiology.
- Local learning rules (STDP, PC) outperform backpropagation in macaque V1/V2.
- No detectable correlation in higher-area (IT) learning rule rankings across species.
- Model capacity and training data richness significantly affect IT alignment.
Read more
Cross-Species RSA Reveals Conserved Early Visual Alignment but Divergent Higher-Area Rankings Across Human fMRI and Macaque Electrophysiology
Summary
This paper investigates the generalizability of learning rules and brain alignment across species by comparing human fMRI data with macaque electrophysiology. The study extends previous findings that untrained convolutional neural networks (CNNs) match backpropagation in human V1 by testing five learning rules—backpropagation (BP), feedback alignment (FA), predictive coding (PC), spike-timing-dependent plasticity (STDP), and a random-weights baseline—against macaque data. Using Representational Similarity Analysis (RSA), the study finds that all models achieve higher alignment with macaque early visual cortex compared to human fMRI, indicating a stronger signal-to-noise ratio in electrophysiological data. Notably, STDP and PC yield the highest alignment scores in macaque V1/V2, while no correlation in learning rule rankings is observed at IT across species. Furthermore, a pretrained ResNet-50 model outperforms all custom CNN conditions at macaque IT, suggesting that alignment is influenced more by model capacity and training data than by the learning rule itself. The findings highlight the robustness of early visual alignment across species and measurement modalities, while indicating that higher-area alignment is modulated by factors such as model architecture and stimulus domain.
Methodology
The study employs Representational Similarity Analysis (RSA) to compare the alignment of neural responses from macaque electrophysiology with outputs from CNNs trained using various learning rules. The analysis utilizes two publicly available macaque datasets and identical model weights from a prior human study, ensuring consistency in evaluation across species.
Results
The results indicate that all tested models achieve higher alignment with macaque early visual cortex (ρ = 0.15–0.30) than with human fMRI (ρ = 0.01–0.08). STDP and PC yield the highest alignment scores in macaque V1/V2 (ρ ≈0.30 and 0.28). However, no correlation is found in learning rule rankings at IT across species (Kendall’s τ = 0.00, p = 1.00). The pretrained ResNet-50 model achieves a significantly higher alignment score (ρ = 0.25) at macaque IT compared to custom CNN conditions (ρ = 0.07–0.14).
Implications
The findings suggest that while early visual processing mechanisms are conserved across species, the divergence in higher-area rankings may inform future research on neural architecture and learning algorithms. This could lead to improved models for understanding visual processing in both biological and artificial systems.
Decomposing Ensemble Spread in Lorenz '96 With Learned Stochastic Parameterizations
Time Series
Theory
- The paper rigorously defines and decomposes the sources of uncertainty in weather forecasting.
- It systematically compares various parameterization strategies, highlighting the benefits of stochastic methods.
- Stochastic parameterizations with temporally persistent structures enhance spread growth and improve forecast accuracy.
- The study clarifies how different sources of uncertainty interact in chaotic systems.
Read more
Decomposing Ensemble Spread in Lorenz '96 With Learned Stochastic Parameterizations
Summary
This paper addresses the inherent uncertainties in weather and climate forecasting, which arise from chaotic dynamics, imperfect initial conditions, and model deficiencies. The authors utilize the Lorenz 1996 system as a controlled testbed to systematically analyze and decompose the sources of uncertainty contributing to ensemble spread. They categorize uncertainty into three components: intrinsic variability, initial-condition uncertainty, and model uncertainty, and propose a framework to evaluate their interactions. The study compares various parameterization strategies, including deterministic, autoregressive, Bayesian, and novel machine learning approaches. The findings reveal that ensemble perturbations do not enhance long-term variance but instead influence the rate at which trajectories decorrelate. Stochastic parameterizations, particularly those with persistent structures, significantly improve early spread growth and enhance the consistency between spread and forecast error. The work provides valuable insights into the interplay of different uncertainty sources and offers guidance for developing effective stochastic parameterizations in operational weather and climate models.
Methodology
The authors employed the two-scale Lorenz 1996 system to analyze ensemble configurations and parameterization strategies. They defined variance components for different uncertainty sources and conducted a systematic comparison of deterministic, autoregressive, Bayesian, and machine learning-based stochastic parameterizations. The evaluation included both stationary and dynamical diagnostics to assess spread growth, forecast skill, and ensemble calibration.
Results
The study found that ensemble perturbations regulate the decorrelation of trajectories rather than increasing long-term variance. Stochastic parameterizations, particularly those with temporally persistent structures, were shown to enhance early spread growth and improve the consistency between ensemble spread and forecast error, addressing the issue of underdispersion in operational forecasting systems.
Implications
The insights from this research can inform the development of more reliable weather and climate models by improving the representation of uncertainties through advanced stochastic parameterizations. This could lead to better forecasting accuracy and reliability in operational settings.
LABO: LLM-Accelerated Bayesian Optimization through Broad Exploration and Selective Experimentation
Optimization
Large Language Models
Efficient ML
- LABO integrates LLM predictions with real experiments to enhance Bayesian optimization.
- The framework employs a gating criterion to balance exploration and exploitation effectively.
- Theoretical guarantees demonstrate improved sample efficiency and robustness against misleading LLM signals.
- Empirical results indicate LABO outperforms traditional methods under fixed experimental budgets.
Read more
LABO: LLM-Accelerated Bayesian Optimization through Broad Exploration and Selective Experimentation
Summary
The paper introduces LABO, a novel framework that enhances Bayesian optimization (BO) by integrating large language models (LLMs) to improve sample efficiency in scientific exploration. Traditional BO methods often struggle with high costs and limited data, particularly in scientific domains where each evaluation is expensive. LABO addresses these challenges by employing a dual-fidelity approach that combines LLM predictions with real-world experimental observations. The framework utilizes a gating criterion to dynamically balance reliance on LLM predictions and actual experiments, allowing for broad exploration of the search space while reserving costly experiments for areas of high uncertainty. The authors provide a theoretical analysis that includes a cumulative regret bound, demonstrating the efficiency gains of LABO. Empirical results across various scientific tasks show that LABO consistently outperforms existing methods within the same experimental budget, suggesting it is a practical and theoretically sound approach for integrating LLMs into scientific discovery workflows.
Methodology
LABO implements a dual-fidelity Bayesian optimization workflow that leverages LLMs for low-cost predictions during the warm-start phase and integrates these predictions with real-fidelity observations. It uses a Kennedy–O’Hagan joint Gaussian process surrogate model to decompose the objective function and applies a discrepancy-dominance gating criterion to decide when to conduct real experiments.
Results
LABO demonstrated superior performance in identifying high-potential candidates compared to existing BO methods, achieving better results under the same real-fidelity experimental budget. The theoretical analysis confirmed that LABO maintains sample efficiency even when LLM predictions are noisy or misaligned.
Implications
LABO offers a promising approach for optimizing scientific discovery processes, particularly in fields where experimental evaluations are costly. Its integration of LLMs can facilitate more informed decision-making and efficient exploration in high-dimensional design spaces.
Partial Fusion of Neural Networks: Efficient Tradeoffs Between Ensembles and Weight Aggregation
Efficient ML
Theory
Optimization
- Partial fusion allows for a flexible trade-off between ensemble accuracy and computational efficiency.
- The method utilizes neuron-level similarity and partial optimal transport for weight aggregation.
- Generalized pruning offers a new perspective on model aggregation by enabling linear combinations of neurons.
- Experimental results show that partial fusion achieves accuracy close to ensembles with significantly fewer parameters.
Read more
Partial Fusion of Neural Networks: Efficient Tradeoffs Between Ensembles and Weight Aggregation
Summary
This paper introduces a novel approach called partial fusion of neural networks, which seeks to balance the trade-offs between the computational costs of ensemble methods and the accuracy of weight aggregation techniques. Traditional ensemble methods improve performance but require significant computational resources, while weight aggregation is less costly but often results in lower accuracy. The authors propose a method that selectively aggregates weights of neurons based on their similarity across different networks, utilizing partial optimal transport to achieve this. This approach allows for a flexible combination of ensemble and weight aggregation strategies, enabling a more efficient model that retains high accuracy with fewer parameters. The paper also discusses generalized pruning, which extends the concept of weight aggregation by allowing for the linear combination of neurons rather than merely deleting them. The results demonstrate that partial fusion can yield performance close to that of ensembles while maintaining a lower parameter count, thus providing a practical solution for model aggregation in neural networks.
Methodology
The authors developed a method for partial fusion that aggregates weights of similar neurons from different networks using partial optimal transport. They also explored generalized pruning, which allows for both deletion and linear combination of neurons based on similarity, providing a more flexible approach to model aggregation.
Results
The experiments conducted on neural networks trained on MNIST demonstrated that the partial OT fusion method achieved test accuracies closer to ensemble models while only requiring approximately 1.45 times the parameters of the original networks, compared to 2 times for traditional ensembles. Generalized pruning also showed similar benefits, allowing for effective model size reduction while maintaining performance.
Implications
The findings suggest that partial fusion and generalized pruning can significantly enhance the efficiency of neural network models, making them more practical for deployment in resource-constrained environments. This approach could be particularly beneficial in applications where computational resources are limited, such as mobile devices or edge computing.
stable-worldmodel: A Platform for Reproducible World Modeling Research and Evaluation
Reinforcement Learning
Robotics
Optimization
- Introduction of stable-worldmodel (swm) as a unified platform for world modeling research.
- High-performance data layer that supports various dataset formats, eliminating I/O bottlenecks.
- Well-tested implementations of modern world model baselines and planning solvers.
- Comprehensive benchmarking suite for diverse environments, enabling systematic evaluation.
Read more
stable-worldmodel: A Platform for Reproducible World Modeling Research and Evaluation
Summary
The paper introduces stable-worldmodel (swm), an open-source platform designed to enhance reproducibility in world modeling research. World models are essential for developing agents capable of reasoning, planning, and generalizing beyond their training data. However, the current landscape of world modeling research is fragmented, with various codebases, data pipelines, and evaluation protocols that hinder reproducibility and fair comparisons. The authors identify three primary bottlenecks: fragile codebases, slow data loading, and a lack of standardized benchmarks. To address these issues, swm provides a high-performance data layer with support for multiple dataset formats, clean implementations of state-of-the-art world model baselines, and a diverse suite of environments for systematic evaluation. The platform aims to unify the research pipeline, significantly reducing overhead and accelerating progress towards reliable world models. By offering a modular test-bed, swm facilitates the assessment of new algorithms and enables clear identification of model limitations, promoting trustworthy comparisons against established baselines.
Methodology
The stable-worldmodel platform is built on PyTorch and Gymnasium, providing a modular framework that encompasses the entire world model pipeline, from data collection and training to evaluation. It includes a high-performance data layer, implementations of world model baselines, and a suite of environments with controllable factors for evaluation.
Results
The platform successfully integrates various components necessary for world modeling research, allowing for efficient data handling and standardized evaluation protocols. It provides a robust environment for testing and comparing different algorithms, ultimately facilitating reproducible research.
Implications
The stable-worldmodel platform has the potential to significantly advance the field of world modeling by providing researchers with the tools needed for reproducible experiments and fair comparisons. This could lead to more reliable advancements in the development of intelligent agents capable of complex reasoning and planning.
DualOptim+: Bridging Shared and Decoupled Optimizer States for Better Machine Unlearning in Large Language Models
NLP
Large Language Models
Optimization
- DualOptim+ introduces a shared base state and decoupled delta states for improved machine unlearning.
- The framework is adaptable to any optimizer with stored states, functioning as an intermediate between shared and decoupled states.
- Extensive experiments validate that DualOptim+ achieves a better trade-off between forgetting efficacy and model utility.
- The quantized version, DualOptim+ 8bit, significantly reduces memory overhead while maintaining performance.
Read more
DualOptim+: Bridging Shared and Decoupled Optimizer States for Better Machine Unlearning in Large Language Models
Summary
The paper introduces DualOptim+, an innovative optimization framework designed to enhance machine unlearning in large language models (LLMs). The framework incorporates a shared base state to capture common representations from both forgetting and retaining objectives, alongside decoupled delta states that maintain objective-specific residuals. This architecture allows for adaptive bridging between shared and decoupled states, effectively addressing the directional conflicts between forgetting and retaining gradients. Additionally, the authors present DualOptim+ 8bit, a quantized version that minimizes memory usage without sacrificing performance. Through extensive experiments on various unlearning tasks, including fictitious and real-world datasets, the authors demonstrate that DualOptim+ achieves a superior balance between forgetting efficacy and model utility, outperforming existing methods. The findings suggest that DualOptim+ is a versatile framework applicable to broader optimization challenges beyond machine unlearning, such as multi-objective learning and LLM alignment.
Methodology
The authors propose DualOptim+, which utilizes a shared base state updated by gradients from both forgetting and retaining objectives, and decoupled delta states that capture the residuals of each objective. This allows the optimizer to adaptively adjust based on the conflict between gradients. The framework is compatible with existing optimizers and includes a quantized variant to reduce memory usage.
Results
The experiments conducted across various LLMs and datasets show that DualOptim+ consistently outperforms existing unlearning methods, achieving a superior balance between the efficacy of forgetting specific information and maintaining the overall utility of the model. The quantized version, DualOptim+ 8bit, effectively reduces memory consumption without compromising performance.
Implications
The proposed DualOptim+ framework has significant implications for the field of machine unlearning, particularly in the context of large language models. Its adaptability suggests potential applications in multi-objective optimization tasks and LLM alignment, making it a valuable tool for researchers and practitioners in optimizing model performance while ensuring compliance with data privacy requirements.
Reading Task Failure Off the Activations: A Sparse-Feature Audit of GPT-2 Small on Indirect Object Identification
NLP
Large Language Models
Interpretability
- Developed a model-agnostic audit pipeline for analyzing language model failures.
- Identified feature 17,491 as a strong correlate of failure in the IOI task, but not a causal factor.
- Demonstrated the importance of robust controls in feature analysis to avoid misinterpretation of results.
- Provided a transparent report of findings, including negative results, to enhance understanding of model behavior.
Read more
Reading Task Failure Off the Activations: A Sparse-Feature Audit of GPT-2 Small on Indirect Object Identification
Summary
This paper presents an audit of the GPT-2 Small model's performance on the Indirect Object Identification (IOI) task, focusing on the internal activations of the model to understand failure modes. The authors conducted a detailed analysis using a sparse-autoencoder (SAE) to identify features that correlate with successful and failed task executions. They found that GPT-2 Small achieved an accuracy of 79.7% on 300 prompts, with 146 features showing significant differences in activation levels between successful and failed trials. Notably, feature 17,491, labeled 'cryptographic keys', was identified as a strong correlate of failure, particularly when the prompt involved the object 'the keys'. However, further controls revealed that this feature was not a sufficient cause of failure, as ablating it did not restore accuracy. The authors emphasize the importance of their audit pipeline, which is model-agnostic and provides interpretable results, and they argue for the inclusion of robust controls in future audits to differentiate between genuine causal features and incidental correlations. The paper concludes with a candid report of negative results, highlighting the need for transparency in mechanistic interpretations of model behavior.
Methodology
The authors employed a sparse-autoencoder (SAE) to analyze the residual stream activations of GPT-2 Small on the IOI task. They conducted statistical analyses to compare feature activations between successful and failed trials, and implemented three controls to validate their findings: causal ablation, representation baseline comparison, and seed robustness checks.
Results
The audit revealed that feature 17,491 had a significantly higher activation during failed trials compared to successful ones, with a Cohen's d of +2.93. However, ablating this feature did not improve accuracy, indicating it was a correlate rather than a cause. The logistic regression analysis showed that the predictive power of the raw residual stream matched that of the top SAE features, and the failure rate remained consistent across different random seeds, although the top feature varied.
Implications
The findings suggest that while certain features may correlate with model failures, they do not necessarily indicate causation. This has implications for the interpretability of language models and the methodologies used to analyze their behavior. The audit pipeline developed could be applied to other models and tasks to enhance understanding of their internal workings.
Beyond Scalar Objectives: Expert-Feedback-Driven Autonomous Experimentation for Scientific Discovery at the Nanoscale
Robotics
Optimization
Theory
- Introduction of deep-kernel pairwise learning (DKPL) for autonomous experimentation.
- DKPL integrates expert feedback into the active learning loop, moving beyond scalar metrics.
- Demonstrated effectiveness in learning nanoscale structures and analyzing ferroelectric domain walls.
- Addresses limitations of traditional Bayesian optimization in capturing complex phenomena.
Read more
Beyond Scalar Objectives: Expert-Feedback-Driven Autonomous Experimentation for Scientific Discovery at the Nanoscale
Summary
This paper presents a novel approach to autonomous experimentation in scientific discovery, particularly at the nanoscale, by introducing deep-kernel pairwise learning (DKPL). Traditional Bayesian optimization methods in self-driving laboratories rely on predefined scalar metrics to guide experimentation, which can limit their effectiveness in capturing complex phenomena. The authors argue that many important scientific insights are not easily quantifiable and can be overlooked when relying solely on scalar descriptors. DKPL addresses this limitation by allowing experts to evaluate experimental outputs based on their interdisciplinary knowledge, rather than predefined metrics. This method learns a latent utility function from expert judgments to guide subsequent experiments. The authors demonstrate DKPL's effectiveness in identifying meaningful nanoscale structures and analyzing ferroelectric domain walls, showcasing its ability to prioritize high-information measurement regions. The development of DKPL marks a significant step towards integrating expert knowledge into autonomous experimentation, paving the way for self-driving laboratories that can tackle complex scientific problems beyond scalar-driven learning.
Methodology
The authors developed a preference-driven active learning framework called deep-kernel pairwise learning (DKPL). This framework incorporates human expertise into the experimental process by allowing experts to evaluate which experimental outputs are more promising. DKPL learns a latent utility function based on these evaluations to guide future experiments, rather than relying on predefined scalar metrics.
Results
DKPL was shown to effectively learn physically meaningful nanoscale structures and prioritize measurement regions with high information content. In specific applications, DKPL successfully distinguished between different characteristics of ferroelectric domain walls in materials such as bismuth ferrite and erbium manganite, demonstrating its capability to handle complex experimental data.
Implications
The integration of expert knowledge into autonomous experimentation could significantly enhance the discovery process in materials science and other fields, allowing for the exploration of complex phenomena that are not easily quantifiable. This approach could lead to more effective self-driving laboratories capable of addressing a broader range of scientific challenges.
CASE-NET: Deep Spatio-Temporal Representation Learning via Causal Attention and Channel Recalibration for Multivariate Time Series Classification
Time Series
- Introduces CASE-NET to address temporal non-causality and noise in MTS classification.
- Employs a Causal Temporal Encoder with masked self-attention and causal convolutions.
- Incorporates an Adaptive Channel Recalibration module to enhance feature purity.
- Achieves new state-of-the-art benchmarks on four tasks across six diverse datasets.
Read more
CASE-NET: Deep Spatio-Temporal Representation Learning via Causal Attention and Channel Recalibration for Multivariate Time Series Classification
Summary
The paper introduces CASE-NET, a novel architecture designed to enhance multivariate time series (MTS) classification by addressing two critical bottlenecks: temporal non-causality and the lack of explicit channel saliency mechanisms. The authors argue that existing models often suffer from temporal blindness, leading to confounding effects in non-stationary dynamics, and fail to adequately filter out noise from the latent space. CASE-NET combines a Causal Temporal Encoder, which utilizes masked self-attention and causal convolutions to enforce strict temporal causality, with an Adaptive Channel Recalibration module that acts as an information bottleneck to suppress noise. The architecture is evaluated across six diverse domains, demonstrating state-of-the-art performance on four tasks, including achieving a peak accuracy of 98.6% on the AWR dataset. The results indicate that CASE-NET not only improves classification accuracy but also enhances robustness in non-stationary environments, making it a significant advancement in MTS classification.
Methodology
CASE-NET employs a two-pronged approach: a Causal Temporal Encoder that applies masked self-attention and causal convolutions to enforce temporal causality, and an Adaptive Channel Recalibration module that filters out non-discriminative noise before the disentanglement stage. This architecture is designed to improve the fidelity of representations extracted from multivariate time series data.
Results
The evaluation of CASE-NET across six heterogeneous domains resulted in establishing new state-of-the-art benchmarks on four classification tasks. Notably, it achieved a peak accuracy of 98.6% on the AWR dataset, showcasing its effectiveness in handling non-stationary dynamics and improving classification performance.
Implications
The advancements presented in CASE-NET have significant implications for various applications involving multivariate time series data, such as clinical diagnostics, financial analysis, and activity recognition in IoT systems. The architecture's ability to enhance representation fidelity and robustness could lead to improved decision-making processes in these fields.
PEARL: Unbiased Percentile Estimation via Contrastive Learning for Industrial-Scale Livestream Recommendation
Theory
Optimization
- PEARL addresses behavioral intensity imbalance in recommender systems.
- The framework utilizes nonparametric contrastive learning to derive unbiased percentile estimates.
- Theoretical justification supports the effectiveness of pairwise comparisons in preference modeling.
- Real-world deployment shows substantial improvements in user engagement metrics.
Read more
PEARL: Unbiased Percentile Estimation via Contrastive Learning for Industrial-Scale Livestream Recommendation
Summary
The paper introduces PEARL, a novel framework designed to address the issue of behavioral intensity imbalance in recommender systems, particularly in the context of industrial-scale livestream recommendations. Traditional recommender systems often suffer from skewed feedback signals due to heterogeneous engagement patterns among users, leading to biased models that favor highly active users. PEARL proposes a nonparametric contrastive percentile approximation approach that focuses on modeling relative preference signals rather than absolute engagement levels. By leveraging real contrastive interaction samples, PEARL approximates percentile relationships directly, eliminating the need for auxiliary distribution estimation models. The authors provide theoretical justification for the unbiased nature of these pairwise comparisons and introduce mechanisms such as prediction-based bootstrapping for handling sparse feedback. Extensive offline experiments validate the effectiveness of PEARL in mitigating behavioral bias, while online A/B testing in a production environment demonstrates significant improvements in key performance metrics, indicating its practical applicability in enhancing recommendation quality.
Methodology
PEARL employs a nonparametric contrastive learning approach to model relative preference signals through pairwise comparisons of user interactions. It avoids the complexity of auxiliary distribution estimation by directly using observed engagement data to approximate percentile relationships. The framework includes a prediction-based bootstrapping mechanism for smoothing percentile estimates and a co-training strategy to enhance representation learning.
Results
The implementation of PEARL in offline experiments showed consistent improvements in recommendation performance across multiple ranking targets. In online A/B testing on a large-scale livestream platform, PEARL achieved a +2.10% increase in Watch Duration, +0.80% in Consumption Amount, +1.49% in Interaction Rate, and a -6.91% reduction in Report Rate, demonstrating its effectiveness in real-world applications.
Implications
The findings suggest that PEARL can significantly enhance the performance of recommender systems by addressing biases in user engagement data. Its application could lead to more equitable and effective recommendations across diverse user groups, improving overall user satisfaction and engagement on platforms with large user bases.
Predicting Performance of Symbolic and Prompt Programs with Examples
NLP
Large Language Models
Theory
- The paper formalizes performance prediction for both symbolic and prompt programs using a Bayesian inference framework.
- Empirical performance priors reveal stark differences between symbolic and prompt programs, impacting their reliability based on test case outcomes.
- RAP (Retrieved Approximate Prior) is introduced as a method to improve performance prediction for prompt programs by leveraging a corpus of tasks.
- RAP demonstrates superior performance compared to baseline predictors and adapts well with increasing test cases.
Read more
Predicting Performance of Symbolic and Prompt Programs with Examples
Summary
This paper investigates the reliability of performance prediction for symbolic programs (e.g., Python code) and prompt programs (natural-language instructions executed by large language models, LLMs). The authors propose a Bayesian approach to model the performance of these programs based on observed execution outcomes, treating them as Bernoulli random variables. They compile empirical performance priors from a diverse corpus of tasks and programs, revealing that symbolic programs exhibit an 'all-or-nothing' performance distribution, while prompt programs have a more diffuse distribution with many nearly-correct outcomes. This distinction is critical as it explains why a few passing test cases can certify the performance of symbolic programs but not that of prompt programs. To enhance performance prediction for prompt programs, the authors introduce RAP (Retrieved Approximate Prior), a method that retrieves similar tasks and prompt programs from an existing corpus to construct a domain-specific prior. The results demonstrate that RAP outperforms baseline predictors and shows desirable properties such as convergence to the in-domain posterior with more test cases and robustness to irrelevant information.
Methodology
The authors adopt a Bayesian inference model treating program execution outcomes as independent Bernoulli trials. They compile empirical performance priors for symbolic and prompt programs from a diverse set of tasks, and develop RAP to construct a domain-specific prior by retrieving similar tasks and prompt programs from a corpus.
Results
RAP outperforms baseline performance predictors, converges to the in-domain posterior as the number of test cases increases, and is robust to irrelevant information in the corpus. The empirical analysis shows that symbolic programs have concentrated performance distributions while prompt programs have diffuse distributions.
Implications
The findings suggest that performance prediction methods can significantly enhance the reliability of deploying prompt programs in real-world applications, particularly in critical settings where performance consistency is essential. The insights into the differences between symbolic and prompt programs can inform future research and development in program synthesis and evaluation.
Learning Causal Orderings for In-Context Tabular Prediction
Theory
Interpretability
- TABORDER integrates causal orderings into tabular prediction, improving robustness against distribution shifts.
- The model uses causal order-constrained attention to ensure predictions are based on causal relationships.
- It learns optimal variable orderings in an unsupervised manner, addressing the challenge of sample missingness.
- Empirical evaluations confirm TABORDER's effectiveness in recovering causal structures and performing well in predictive tasks.
Read more
Learning Causal Orderings for In-Context Tabular Prediction
Summary
This paper addresses the limitations of in-context learning for tabular data, which often relies on correlational structures that can fail under distribution shifts or interventions. The authors propose a novel model, TABORDER, which integrates causal orderings into tabular prediction tasks. Unlike traditional models that use all available features for predictions, TABORDER employs causal order-constrained attention, ensuring that predictions are based only on features that precede the target variable according to a learned causal order. This model learns the optimal variable ordering in an unsupervised manner using a likelihood-based objective, which is justified under standard functional model classes. The paper also explores the interaction between sample missingness and causal direction identification, demonstrating that missing samples can aid in learning causal orderings. Empirical results show that TABORDER effectively recovers accurate causal orderings while performing well in prediction and imputation tasks, particularly in real-world biological data scenarios involving interventions.
Methodology
The authors developed TABORDER, a transformer-based architecture that utilizes causal order-constrained attention to model the joint distribution of tabular data. The model learns the causal order of variables through a likelihood-based objective, allowing it to make predictions based on a learned ordering of features. The architecture is designed to handle missing data effectively, facilitating causal direction identification even in the presence of incomplete observations.
Results
TABORDER demonstrated the ability to accurately infer causal orderings and maintain competitive predictive performance across various tasks, including missing value imputation. The model's predictions remained robust under intervention scenarios, contrasting with traditional models that relied on non-causal associations, thus highlighting its effectiveness in real-world applications.
Implications
The findings suggest that incorporating causal structures into predictive models can enhance their robustness and reliability, particularly in settings where data distributions may shift or where interventions are present. This approach could have significant implications for fields such as healthcare, economics, and any domain where understanding causal relationships is crucial for accurate predictions.
The Attribution Impossibility: No Feature Ranking Is Faithful, Stable, and Complete Under Collinearity
Theory
Interpretability
- No feature ranking can be simultaneously faithful, stable, and complete under collinearity.
- The Rashomon property illustrates the instability of feature rankings due to multiple valid models.
- DASH is proposed as a robust ensemble method for feature attribution that mitigates the limitations of traditional methods.
- Quantitative bounds demonstrate varying degrees of violation of the desiderata across different model classes.
Read more
The Attribution Impossibility: No Feature Ranking Is Faithful, Stable, and Complete Under Collinearity
Summary
This paper addresses the challenges of feature ranking in machine learning, particularly under conditions of collinearity among features. The authors propose three desiderata for feature rankings: faithfulness, stability, and completeness. They introduce the concept of 'Attribution Impossibility,' demonstrating that no feature ranking can satisfy all three desiderata simultaneously when collinearity is present. The paper discusses the Rashomon property, which highlights the multiplicity of models that can explain the same data, leading to inherent instability in feature rankings. The authors provide quantitative bounds for various model classes, including gradient boosting, Lasso, neural networks, and random forests, illustrating how each model violates the desiderata to different extents. To address these limitations, they propose an ensemble attribution method called DASH, which aims to achieve Pareto optimality in feature ranking. The paper includes empirical comparisons of DASH with existing methods, showcasing its robustness and reduced information loss. The authors also explore extensions of their findings to areas such as fairness and causal discovery, ultimately contributing to a deeper understanding of the limitations of feature attribution in machine learning.
Methodology
The authors employ theoretical analysis to establish the impossibility results regarding feature rankings under collinearity. They derive quantitative bounds for various machine learning models and propose the DASH method for ensemble attribution. Empirical validation is conducted through experiments on synthetic and real-world datasets to compare DASH with existing attribution methods.
Results
The main results indicate that traditional feature ranking methods fail to meet the criteria of faithfulness, stability, and completeness when features are collinear. The DASH method demonstrates improved performance in terms of robustness and reduced information loss compared to existing methods, with empirical validation across multiple datasets supporting its effectiveness.
Implications
The findings suggest that practitioners should be cautious when interpreting feature rankings in the presence of collinearity. The DASH method offers a promising alternative for feature attribution, potentially enhancing the interpretability of machine learning models. Additionally, the implications extend to fairness audits and causal discovery, highlighting the need for robust attribution methods in these contexts.
Minimum Description Length based Granular-Ball Tree Regularization for Spectral Clustering
Graph Learning
- MDL-GBTRSC improves local connectivity in spectral clustering by utilizing a granular-ball tree structure.
- The method integrates local representation learning with affinity graph construction.
- It introduces a shared-neighbor bridge code to enhance local bridge relations without user-defined thresholds.
- Experimental results show superior performance compared to classical and other granular-ball-based spectral clustering methods.
Read more
Minimum Description Length based Granular-Ball Tree Regularization for Spectral Clustering
Summary
This paper presents a novel approach to spectral clustering, addressing the challenges of constructing an affinity graph that accurately reflects local connectivity in heterogeneous data structures. The proposed method, Minimum Description Length based Granular-Ball Tree Regularized Spectral Clustering (MDL-GBTRSC), utilizes a granular-ball tree constructed through local Minimum Description Length (MDL) model selection. This tree structure helps maintain reliable local connections by discouraging splits that disrupt these connections. The stable leaf balls from the tree provide essential coding-scale information for regularizing the sample-level affinity graph. Additionally, the introduction of a shared-neighbor bridge code allows for the adjustment of weak local bridge relations without the need for user-defined thresholds. The integration of local representation learning with affinity graph construction creates a unified framework for spectral clustering. Experimental results demonstrate that MDL-GBTRSC outperforms classical spectral clustering methods and other granular-ball-based approaches, achieving the highest average Adjusted Rand Index (ARI) and Normalized Mutual Information (NMI) across various datasets.
Methodology
The methodology involves constructing a granular-ball tree using local MDL model selection, which captures local connectivity and structural information. The stable leaf balls from this tree are used to regularize the sample-level affinity graph, while a shared-neighbor bridge code is introduced to refine local relations.
Results
MDL-GBTRSC achieved the best average ARI and NMI scores in experiments conducted on both real and synthetic datasets, outperforming classical spectral clustering baselines and other granular-ball, micro-cluster, and anchor-based methods.
Implications
The findings suggest that incorporating local region-level information into affinity graph construction can significantly enhance the performance of spectral clustering, making it more adaptable to complex data structures. This approach may have applications in various fields requiring effective clustering of heterogeneous data.
ImplicitTerrainV2: Wavelet-Guided Spatially Adaptive Neural Terrain Representation
Computer Vision
Efficient ML
Theory
- Introduction of wavelet complexity field (WCF) for spatially adaptive frequency control in terrain representation.
- Implementation of gradient matching to enhance derivative fidelity in terrain models.
- Significant reduction in model size and training time compared to previous terrain INRs.
- Competitive performance in rate-distortion metrics against established DEM codecs.
Read more
ImplicitTerrainV2: Wavelet-Guided Spatially Adaptive Neural Terrain Representation
Summary
The paper introduces ImplicitTerrainV2, an advanced implicit neural representation (INR) for digital elevation models (DEMs) that addresses limitations of previous terrain INRs. Traditional raster DEMs rely on interpolation and finite-difference operators, which can hinder accuracy and efficiency in terrain analysis. ImplicitTerrainV2 enhances terrain representation by integrating a wavelet complexity field (WCF) for spatially adaptive frequency control, derivative-aware supervision, and post-training model compression. The WCF generates frequency masks that focus high-frequency capacity on complex terrain areas, while gradient matching improves the fidelity of derivatives. The model achieves significant compression, reducing storage requirements to 1.23 bits per pixel (bpp) with minimal quality loss. Evaluated on 50 Swiss terrain tiles, ImplicitTerrainV2 achieves an end-to-end peak signal-to-noise ratio (PSNR) of 66.25 dB, outperforming previous models by 5.70 dB while using 3.2 times fewer parameters and training in just 55 seconds per tile on a single GPU. This new representation not only competes with established DEM codecs in rate-distortion performance but also supports off-grid point queries and closed-form derivative evaluations, making it beneficial for various GIS applications.
Methodology
The authors developed ImplicitTerrainV2 by enhancing a cascaded sinusoidal neural network (SIREN) architecture with wavelet-guided spatial adaptivity and derivative-aware supervision. The WCF is derived from wavelet coefficients to create spatial masks that optimize frequency localization. Gradient matching is employed to ensure the smooth manifold structure of terrain DEMs, and post-training techniques like mixed-precision quantization and entropy coding are used for model compression.
Results
ImplicitTerrainV2 achieved a PSNR of 66.25 dB on 50 Swiss terrain tiles, improving over the previous model by 5.70 dB while utilizing 3.2 times fewer parameters. The model trained in 55 seconds per tile on a single GPU and maintained a storage requirement of 1.23 bpp with a minimal PSNR drop of 0.28 dB. It demonstrated competitive performance against established DEM codecs in terms of rate-distortion.
Implications
The advancements in ImplicitTerrainV2 could significantly enhance GIS applications, particularly in hydrology, geomorphology, urban planning, and autonomous navigation, by providing a more efficient and accurate terrain representation that supports continuous evaluation and derivative analysis.
F-TIS: Harnessing Diverse Models in Collaborative GRPO
Reinforcement Learning
Large Language Models
- F-TIS enables heterogeneous models to collaborate in decentralized RL training.
- The framework effectively utilizes off-policy samples without harming convergence.
- F-TIS demonstrates performance on par with on-policy training and improves generalization in certain cases.
- The communication overhead is minimized, making it efficient for collaborative training.
Read more
F-TIS: Harnessing Diverse Models in Collaborative GRPO
Summary
The paper introduces Filtered Truncated Importance Sampling (F-TIS), a novel framework designed to enhance the Group Relative Policy Optimization (GRPO) method in decentralized reinforcement learning (RL) settings. Traditional GRPO relies on homogeneous models, which limits its applicability in decentralized environments where diverse models with varying computational capabilities and preferences collaborate. F-TIS addresses this limitation by allowing heterogeneous models to work together while effectively utilizing off-policy samples, which are typically detrimental to GRPO convergence. The authors demonstrate that F-TIS maintains performance comparable to on-policy training and, in some cases, improves generalization on out-of-distribution tasks by up to 12%. The framework is communication-efficient, requiring minimal data exchange between nodes, thus facilitating collaborative training without significant overhead. The extensive evaluation across various heterogeneous setups confirms the robustness and effectiveness of F-TIS in both decentralized and single-cluster training scenarios.
Methodology
The authors propose F-TIS as a unified framework that leverages off-policy samples in decentralized heterogeneous RL. The method involves generating multiple completions for prompts, calculating their advantages, and updating policies based on these advantages while maintaining communication efficiency. The framework is evaluated in various setups to assess its performance against traditional on-policy methods.
Results
F-TIS achieves convergence rates identical to those of purely on-policy training and, in some scenarios, enhances model performance on out-of-distribution tasks by up to 12%. The communication requirement is limited to 8 bytes per token, ensuring efficient collaboration among diverse models.
Implications
The introduction of F-TIS has significant implications for decentralized RL applications, allowing for more flexible collaboration among models with varying capabilities. This can lead to improved performance in real-world scenarios where model diversity is common, such as in federated learning environments.
When to Switch, Not Just What: Transition Quality Prediction in Clash Royale
Reinforcement Learning
- Frequent strategy switching in Clash Royale is associated with lower win rates.
- The Zero Switching Cost Assumption overlooks the behavioral costs of switching strategies.
- The Transition Quality Predictor (TQP) framework reformulates strategy recommendation as a transition-level decision problem.
- The TQP includes mechanisms to identify when and what strategies to recommend based on player behavior.
Read more
When to Switch, Not Just What: Transition Quality Prediction in Clash Royale
Summary
This paper investigates the phenomenon of strategy switching in competitive gaming, specifically in Clash Royale, where players often change strategies after losing streaks. Analyzing 926,334 match records from 34,619 players, the authors found that frequent switching correlates with lower win rates, challenging the assumption that switching is always beneficial. They introduce the concept of the Zero Switching Cost Assumption, which overlooks the behavioral costs associated with switching strategies. To address this, the authors propose a new framework called the Transition Quality Predictor (TQP), which consists of three stages: WHO (identifying the player), WHEN (determining the optimal timing for switching), and WHAT (selecting the best strategy). The TQP incorporates a PersonaGate to suppress recommendations for players who perform better with consistent strategies, a TimingGate to identify beneficial switching moments, and ScoreFusion to rank strategies based on their predicted transition quality. The paper also introduces the SwitchGap metric to evaluate the effectiveness of recommendations without assuming that observed player choices are optimal. The TQP pipeline demonstrates improved performance, particularly for loss-triggered switchers, who benefit significantly from subtype-conditioned guidance.
Methodology
The authors developed the Transition Quality Predictor (TQP) framework, which consists of three stages: PersonaGate to filter recommendations based on player consistency, TimingGate to identify optimal switching moments, and ScoreFusion to rank strategies by combining adoptability signals with predicted transition quality. They also introduced the SwitchGap metric for evaluation.
Results
The TQP pipeline achieved a SwitchGap improvement of +10.4 percentage points at a recommendation rate of 5.4%. Loss-triggered switchers, despite being the lowest-performing group, showed the most significant benefits from subtype-conditioned guidance.
Implications
The findings suggest that strategy recommendation systems in competitive games should account for individual player behavior and the costs associated with switching strategies. This could lead to more effective guidance and improved player performance in dynamic gaming environments.
Target-Aligned Bellman Backup for Cross-domain Offline Reinforcement Learning
Reinforcement Learning
- Transition-level similarity does not guarantee consistency in value updates in cross-domain offline RL.
- Target-Aligned Bellman Backup (TABB) selects and reweights source data based on Bellman backup consistency.
- TABB shows strong performance improvements across multiple environments and dataset combinations.
Read more
Target-Aligned Bellman Backup for Cross-domain Offline Reinforcement Learning
Summary
This paper addresses the challenges of Cross-domain Offline Reinforcement Learning (CDRL), which seeks to enhance policy learning in a target domain by utilizing data from a source domain. Traditional methods often rely on measuring the similarity of transitions between domains to assess the transferability of source-domain data. However, the authors argue that this approach can lead to misleading results, as transitions that appear similar may yield different long-term returns in the target domain. To overcome this limitation, the authors propose a novel method called Target-Aligned Bellman Backup (TABB). TABB evaluates the transferability of source-domain transitions based on their alignment with target-domain Bellman targets, rather than superficial transition similarity. By focusing on Bellman backup consistency, TABB selectively reweights source-domain data to improve policy learning in the target domain. The authors conduct extensive experiments across various environments and dataset combinations, demonstrating that TABB consistently outperforms existing methods, leading to significant performance improvements in cross-domain offline RL settings.
Methodology
The authors propose TABB, which assesses the transferability of source-domain transitions by measuring their alignment with target-domain Bellman targets. This method selectively reweights source transitions based on their contribution to accurate Bellman target estimation, rather than relying solely on transition similarity.
Results
The experimental results indicate that TABB consistently achieves superior performance compared to existing methods across six environments and sixteen dataset combinations, demonstrating its effectiveness in enhancing policy learning in cross-domain offline RL scenarios.
Implications
The findings suggest that TABB can significantly improve the efficiency and effectiveness of policy learning in real-world applications where data collection in the target domain is limited or costly, such as robotics and autonomous control systems.
One-Way Policy Optimization for Self-Evolving LLMs
NLP
Large Language Models
Reinforcement Learning
- OWPO decouples optimization direction from update magnitude to enhance training stability.
- Asymmetric reweighting strategies are employed to manage inferior and superior deviations effectively.
- Iterative reference updates create a 'Ratchet Effect' that consolidates performance gains.
- OWPO outperforms strong baseline methods, breaking the bottleneck of fixed priors.
Read more
One-Way Policy Optimization for Self-Evolving LLMs
Summary
This paper introduces One-Way Policy Optimization (OWPO), a novel approach to enhance the efficiency and stability of Reinforcement Learning with Verifiable Rewards (RLVR) for Large Language Models (LLMs). Traditional RLVR methods often struggle with sparse binary rewards, leading to low optimization efficiency and instability. Existing strategies typically impose token-level constraints based on a reference policy, which can inadvertently suppress performance gains by penalizing beneficial deviations. OWPO addresses this by decoupling the optimization direction from the update magnitude: the verifier determines the direction of updates while the reference policy adjusts the magnitude. This is achieved through asymmetric reweighting strategies: 'Accelerated Alignment' for inferior deviations and 'Gain Locking' for superior deviations. Additionally, OWPO incorporates iterative reference updates to create a 'Ratchet Effect' that consolidates gains over time. Experimental results demonstrate that OWPO significantly outperforms existing methods, enabling continuous self-evolution of LLMs without reliance on external reference models.
Methodology
The OWPO method utilizes a dual approach of asymmetric reweighting based on the verifier's signals and the reference policy's deviations. It categorizes policy behaviors into superior and inferior deviations and applies specific weighting rules to optimize updates. The method also incorporates iterative updates to the reference policy to maintain performance improvements.
Results
Experimental evaluations show that OWPO outperforms existing RLVR methods, including DAPO, OPD, and MOPD, demonstrating improved efficiency and stability in training LLMs. The results highlight the effectiveness of decoupling optimization direction from update magnitude and the benefits of iterative reference updates.
Implications
The findings suggest that OWPO can significantly enhance the training of LLMs, making them more efficient and stable. This has potential applications in various domains requiring advanced reasoning capabilities, such as automated reasoning systems, conversational agents, and other NLP tasks.
Hierarchical Variational Policies for Reward-Guided Diffusion
Generative Models
Computer Vision
Efficient ML
- Introduces a unified framework for test-time guidance in diffusion models using hierarchical variational policies.
- Develops Amortized HVP (AHVP) for high-quality reward-aligned samples with reduced inference cost.
- Presents Semi-Amortized HVP (SHVP) that combines amortized proposals with test-time refinement for improved perceptual quality.
- Demonstrates superior quality-speed tradeoff on inverse problems, achieving over 5× faster inference than leading methods.
Read more
Hierarchical Variational Policies for Reward-Guided Diffusion
Summary
This paper presents a novel framework for adapting pretrained diffusion models to various downstream tasks, particularly inverse problems, by utilizing hierarchical variational policies. The authors propose a method that significantly reduces the computational cost associated with test-time guidance or optimization, which is typically required for generating high-quality samples. By formulating test-time adaptation as a hierarchical variational model, the approach allows for amortized control through a lightweight stochastic policy. This enables few-step diffusion sampling, where larger step sizes facilitate faster inference while maintaining sample quality through structured control. The proposed Amortized Hierarchical Variational Policy (AHVP) achieves a strong quality-speed tradeoff, outperforming existing baselines in terms of perceptual quality and inference speed. Additionally, the Semi-Amortized Hierarchical Variational Policy (SHVP) combines inexpensive amortized proposals with limited test-time optimization, achieving state-of-the-art results across several challenging inverse problems. The framework is modular and can be applied to a wide range of tasks with differentiable likelihoods or rewards, demonstrating its versatility and effectiveness.
Methodology
The authors employ a hierarchical variational model to create a lightweight stochastic policy that guides the denoising process in diffusion models. This involves training an initial noise distribution and per-step stochastic controllers through variational inference, allowing for efficient sampling without the need for repeated gradient evaluations or inner-loop optimizations during inference.
Results
The proposed methods (AHVP and SHVP) achieve better perceptual quality and faster inference times compared to existing baselines on tasks such as 4× super-resolution. AHVP matches or exceeds the perceptual quality of leading test-time methods while being over 5× faster, and SHVP further enhances quality with modest additional computational cost.
Implications
The findings suggest that hierarchical variational policies can significantly improve the efficiency of diffusion models in practical applications, making them more viable for real-time and high-resolution tasks. This approach could be beneficial in fields such as computer vision, where rapid and high-quality image generation is crucial.
Manifold-Guided Attention Steering
NLP
Large Language Models
Interpretability
- MAGS introduces a dynamic, trajectory-aware intervention for correcting reasoning errors in LLMs.
- The method is grounded in the observation that correct and incorrect reasoning trajectories are geometrically separable.
- MAGS applies targeted corrections only when attention heads deviate from a learned correctness manifold.
- Empirical results show MAGS outperforms static steering approaches by up to 10.8% across multiple reasoning benchmarks.
Read more
Manifold-Guided Attention Steering
Summary
The paper introduces Manifold-Guided Attention Steering (MAGS), a novel approach aimed at improving the reasoning consistency of large language models (LLMs) during inference. Traditional activation steering methods apply fixed correction vectors, which can disrupt correct reasoning steps while attempting to correct errors. MAGS addresses this by leveraging the geometric structure of attention head activations, proposing that reasoning errors correspond to deviations from a low-dimensional correctness manifold. The authors develop a mechanism that learns these manifolds from contrastive pairs of correct and incorrect reasoning traces. During inference, MAGS monitors the proximity of attention heads to this manifold and applies targeted corrections only when deviations exceed a learned threshold. This adaptive intervention allows for precise steering of attention outputs, preventing the propagation of errors. The authors validate their approach through extensive experiments across various benchmarks, demonstrating that MAGS consistently outperforms both unsteered baselines and static steering methods, suggesting that correctness manifolds are a fundamental aspect of LLM attention geometry.
Methodology
The authors conducted diagnostic experiments to confirm their hypothesis about the geometric structure of attention head activations. They developed MAGS, which learns low-dimensional subspaces from contrastive pairs of correct and incorrect reasoning traces. During inference, MAGS monitors the attention heads' outputs and applies corrections dynamically based on their proximity to the correctness manifold.
Results
MAGS demonstrated significant improvements in reasoning tasks, outperforming static steering baselines by up to 10.8% across various benchmarks, including mathematical reasoning, code generation, and molecular generation. The method showed consistent effectiveness across three model families: Llama, Gemma, and GPT-OSS.
Implications
The findings suggest that incorporating geometric insights into the design of inference-time interventions can enhance the reliability of LLMs in reasoning tasks. MAGS could be applied in various domains requiring accurate multi-step reasoning, such as automated code generation, scientific research, and complex decision-making systems.
SepsisAI Orchestrator: A Containerized and Scalable Platform for Deploying AI Models and Real-Time Monitoring in Early Sepsis Detection
Efficient ML
- Introduction of SepsisAI Orchestrator as an open-source platform for early sepsis detection.
- Integration of various technologies including HL7 FHIR, NoSQL, LightGBM, Docker, and Kubernetes.
- Empirical findings on optimal scaling behavior for clinical AI inference workloads.
- Provision of a reproducible deployment recipe for clinical prediction tasks.
Read more
SepsisAI Orchestrator: A Containerized and Scalable Platform for Deploying AI Models and Real-Time Monitoring in Early Sepsis Detection
Summary
The paper introduces the SepsisAI Orchestrator, an open-source modular platform designed to bridge the gap between clinical machine learning models and their bedside deployment for early sepsis detection. The authors identify key barriers to deployment, including heterogeneous data representations and the lack of standardized workflows. The platform integrates HL7 FHIR-inspired Clinical Document Architecture (CDA) preprocessing, NoSQL storage, a containerized LightGBM classifier served via REST APIs, and a Streamlit clinical dashboard, all orchestrated with Docker and Kubernetes. The study empirically characterizes the system's performance under load, revealing that the number of replicas should match the physical CPU thread count to optimize latency and eliminate request failures. The findings highlight a U-shaped scaling behavior, which has not been previously quantified for clinical AI inference workloads. The authors provide a reproducible deployment recipe and emphasize that their contribution focuses on deployment systems rather than predictive modeling, aligning with HIPAA and GDPR principles.
Methodology
The SepsisAI Orchestrator employs a modular architecture that preprocesses electronic health records into HL7 FHIR-inspired CDA structures, stores them in a NoSQL database, and serves a containerized LightGBM classifier via REST APIs. The system is orchestrated using Docker and Kubernetes, with load testing conducted using k6 to evaluate performance under varying levels of concurrent users.
Results
The study found that scaling the number of replicas from 3 to 12 on a 12-thread CPU reduced the 95th percentile latency from 3.3 seconds to 1.41 seconds, achieving a 57.3% reduction in latency and eliminating request failures. Over-provisioning beyond 12 replicas led to degraded performance due to scheduler contention, highlighting a U-shaped scaling curve for clinical AI inference workloads.
Implications
The SepsisAI Orchestrator provides a scalable and efficient framework for deploying AI models in clinical settings, particularly for high-stakes applications like sepsis detection. Its open-source nature allows for adaptation in various clinical prediction tasks, potentially improving patient outcomes through timely and accurate decision support.
BioFormer: Rethinking Cross-Subject Generalization via Spectral Structural Alignment in Biomedical Time-Series
Time Series
- Introduces 'spectral drift' as a new perspective to understand subject-specific variability in biomedical time-series.
- Proposes BioFormer, which includes a Frequency-Band Alignment Module (FBAM) to align spectral structures.
- Implements Sample Conditional Layer Normalization (SCLN) to stabilize cross-subject representations.
- Demonstrates a 6% absolute improvement in F1-score over 12 baseline models across six datasets.
Read more
BioFormer: Rethinking Cross-Subject Generalization via Spectral Structural Alignment in Biomedical Time-Series
Summary
The paper addresses the challenge of cross-subject generalization in biomedical time-series (BTS) data, where models trained on data from certain subjects perform poorly on unseen subjects due to subject-specific variability. The authors introduce the concept of 'spectral drift' to characterize this variability, which manifests as differences in magnitude and phase across subjects for the same physiological signals. To mitigate this issue, they propose BioFormer, which includes a Frequency-Band Alignment Module (FBAM) that aligns spectral structures by adjusting amplitude and phase based on the spectral distribution. Additionally, they introduce Sample Conditional Layer Normalization (SCLN) to stabilize representations by inferring normalization parameters from intrinsic signal statistics rather than subject identity. The proposed methods were evaluated on six datasets, demonstrating significant improvements in performance compared to 12 baseline models, highlighting the effectiveness of explicitly modeling subject-specific variability in BTS classification tasks.
Methodology
The methodology involves the introduction of the Frequency-Band Alignment Module (FBAM) that uses cross-attention mechanisms to derive modulation coefficients for amplitude and phase adjustments in the Fourier domain. This approach allows for adaptive frequency-band alignment to suppress subject-dependent variability while preserving task-relevant oscillatory structures. Additionally, Sample Conditional Layer Normalization (SCLN) is employed to calibrate feature statistics at the sample level, enhancing the model's robustness against distribution shifts.
Results
The experiments conducted on six biomedical time-series datasets showed that BioFormer outperformed 12 baseline models, achieving an absolute F1-score improvement of 6%. This indicates that the proposed methods effectively address the challenges of cross-subject generalization by explicitly modeling and aligning subject-specific variability.
Implications
The findings suggest that explicitly modeling subject-specific variability can lead to more robust and generalizable models in biomedical applications, potentially improving outcomes in areas such as disease screening and monitoring. The methodologies developed could be applied to other domains where time-series data exhibit similar variability challenges.
Adaptive Measurement Allocation for Learning Kernelized SVMs Under Noisy Observations
Theory
Efficient ML
- Introduces an adaptive measurement allocation strategy for kernelized SVMs under noisy observations.
- Combines geometric sensitivity and active-set instability to focus measurements on critical kernel entries.
- Demonstrates that adaptive allocation significantly improves classifier performance compared to uniform allocation.
- Provides a theoretical framework for understanding when adaptive strategies are preferable.
Read more
Adaptive Measurement Allocation for Learning Kernelized SVMs Under Noisy Observations
Summary
This paper addresses the challenge of learning kernelized Support Vector Machines (SVMs) in the presence of noisy observations, particularly in contexts such as quantum machine learning where kernel entries must be inferred from limited measurements. Traditional methods often employ uniform measurement allocation, which does not account for the varying importance of different kernel entries. The author proposes an adaptive measurement allocation strategy that focuses on the most critical regions of the kernel matrix, guided by two principles: geometric sensitivity and active-set instability. The method involves an initial pilot round of measurements followed by adaptive rounds that concentrate resources based on the sensitivity of the SVM margin and the likelihood of changes in support vector membership. Theoretical analysis indicates that the effectiveness of adaptive allocation is influenced by the heterogeneity of kernel importance, leading to scenarios where adaptive strategies outperform uniform ones. Empirical results demonstrate significant improvements in support vector recovery, margin estimation, and decision function accuracy, even under fixed measurement budgets. The findings suggest that adaptive measurement allocation is a promising alternative to uniform sampling, enhancing classifier fidelity and computational efficiency.
Methodology
The methodology consists of an initial pilot measurement phase to estimate the kernel matrix, followed by adaptive rounds where the measurement budget is reallocated based on the sensitivity of the SVM margin and the instability of the active set of support vectors. This task-aware allocation strategy allows for a more efficient use of limited measurement resources.
Results
The results indicate that the adaptive measurement allocation strategy leads to better support vector recovery, improved margin estimation, and enhanced decision function accuracy compared to uniform allocation. The empirical evaluations across synthetic datasets demonstrate the effectiveness of the proposed method under various noise regimes.
Implications
The findings suggest that adaptive measurement allocation can be effectively applied in quantum machine learning and other domains where kernel estimation is subject to noise. This approach could lead to more efficient learning algorithms that require fewer resources while maintaining high accuracy.
Lumberjack: Better Differentially Private Random Forests through Heavy Hitter Detection in Trees
Theory
Efficient ML
- Lumberjack introduces a new differentially private random forest algorithm that enhances utility without compromising privacy.
- The algorithm employs a novel heavy hitter detection method to optimize tree structure and pruning.
- Empirical results show substantial improvements over existing differentially private random forest methods.
- The approach allows for deeper trees, improving expressiveness and performance under privacy constraints.
Read more
Lumberjack: Better Differentially Private Random Forests through Heavy Hitter Detection in Trees
Summary
This paper presents Lumberjack, a novel differentially private random forest algorithm designed to enhance the utility of random forests while ensuring privacy. Traditional methods for enforcing differential privacy in random forests often lead to significant performance degradation. Lumberjack addresses this by constructing larger random decision trees and implementing a privacy-preserving pruning mechanism that retains only sufficiently populated nodes. A key innovation is the (ε, δ)-DP heavy hitter detection algorithm tailored for hierarchical data, which allows for deeper trees with improved expressiveness under privacy constraints. The empirical evaluation demonstrates that Lumberjack consistently outperforms existing differentially private random forest methods, achieving a new state of the art in the privacy-utility trade-off, particularly for practical privacy budgets. The findings suggest that well-designed differentially private random forests can significantly reduce the utility gap, indicating a promising avenue for future research in this domain.
Methodology
Lumberjack combines randomized tree construction with a privacy-aware mechanism for determining when to stop splitting nodes. It allocates part of the privacy budget to assess node density and prunes low-density branches. The heavy hitter detection algorithm is structured to perform overlapping noisy binary searches over the tree, leveraging the monotone structure of rooted trees to enhance utility while satisfying differential privacy constraints.
Results
The empirical evaluation on benchmark datasets shows that Lumberjack outperforms prior differentially private random forest methods, establishing a new state of the art. The algorithm achieves significant improvements in the privacy-utility trade-off, particularly for practical privacy budgets, demonstrating that carefully designed differentially private random forests can close much of the utility gap.
Implications
The advancements presented in Lumberjack have significant implications for applications involving sensitive data, such as healthcare and finance, where maintaining privacy while maximizing predictive performance is crucial. The findings suggest that future research can further explore the potential of differentially private random forests in various domains.
Algebraic Machine Learning for Small-to-Medium Datasets Is Competitive against Strong Standard Baselines
Theory
Efficient ML
- AML demonstrates competitive performance against strong baselines in small to medium datasets.
- The framework does not require cross-validation or hyperparameter tuning, making it suitable for low-data scenarios.
- AML's generic algebraic inductive bias allows it to perform comparably to task-specific methods.
- The study provides empirical evidence that symbolic learning can be viable in realistic supervised tasks.
Read more
Algebraic Machine Learning for Small-to-Medium Datasets Is Competitive against Strong Standard Baselines
Summary
This paper evaluates Algebraic Machine Learning (AML), a framework that utilizes subdirect decomposition of algebraic structures instead of numerical optimization, in the context of supervised learning tasks. The authors challenge the prevailing assumption that symbolic methods are inferior to modern numerical learners by comparing AML against strong baselines, including convolutional neural networks (CNNs) for image classification and boosted-tree methods like XGBoost for tabular data. The study finds that AML outperforms these baselines on small to medium datasets (50-2000 examples) without requiring cross-validation or hyperparameter tuning, which is particularly advantageous in low-data scenarios. The results indicate that AML achieves competitive performance across diverse datasets, suggesting that symbolic learning can be effective when the structure is learned from data rather than manually specified.
Methodology
The authors conducted a systematic empirical evaluation of AML on supervised classification tasks using both image and tabular datasets. They compared AML's performance against strong cross-validated baselines, including CNNs and boosted-tree methods. AML operates by learning through subdirect decomposition and does not rely on task-dependent hyperparameters, allowing it to be trained solely on the training data.
Results
Across twelve standard image datasets, AML with a logistic regression readout was the best-performing method overall, statistically distinguishable from each baseline. In tabular datasets, while XGBoost was the top performer, AML was comparable to strong methods like LightGBM and random forests, achieving the best results on a plurality of datasets at 1000 training examples.
Implications
The findings suggest that symbolic learning frameworks like AML can be competitive alternatives to traditional numerical methods, particularly in scenarios with limited data. This could lead to broader applications of symbolic learning in various domains where data scarcity is a concern.
Two is better than one: A Collapse-free Multi-Reward RLIF Training Framework
Reinforcement Learning
Large Language Models
Theory
- Introduction of a multi-reward RLIF framework that combines answer-level and completion-level rewards.
- Implementation of GDPO normalization to balance reward scales and prevent optimization issues.
- Use of KL-Cov regularization to maintain exploration and prevent entropy collapse during training.
- Demonstrated improved performance and stability over existing single-reward RLIF methods.
Read more
Two is better than one: A Collapse-free Multi-Reward RLIF Training Framework
Summary
This paper introduces a novel multi-reward framework for Reinforcement Learning with Internal Feedback (RLIF), addressing the limitations of existing single-reward methods that can lead to reward hacking and model collapse. The proposed framework decomposes the training signal into two complementary components: an answer-level reward based on cluster voting and a completion-level reward based on token-wise self-certainty. To effectively combine these signals, the authors apply GDPO-based normalization to mitigate reward-scale imbalances. Additionally, they introduce KL-Cov regularization to target low-entropy token distributions, which helps preserve exploration and prevents late-stage collapse. The framework is evaluated across five mathematical reasoning benchmarks and two coding benchmarks, demonstrating improved stability and robustness compared to prior unsupervised RL approaches, while achieving performance levels close to supervised RL with verifiable rewards (RLVR). The findings suggest that using complementary internal rewards, along with targeted regularization, can enhance long-horizon reasoning capabilities without relying on external supervision.
Methodology
The authors propose a multi-reward RLIF framework that utilizes two types of internal rewards: an answer-level reward derived from cluster voting and a completion-level reward based on token-wise self-certainty. They employ GDPO normalization to balance the influence of these rewards and introduce KL-Cov regularization to target specific tokens that contribute to entropy collapse, thereby preserving diversity in the model's outputs.
Results
The proposed framework outperformed several baseline methods, including single-reward RLIF approaches, across five mathematical reasoning and two coding benchmarks. The results indicate that the multi-reward system enhances stability and robustness during training, maintaining performance gains over extended training periods.
Implications
This research suggests that multi-reward systems can significantly improve the reasoning capabilities of large language models in unsupervised settings, making it a promising approach for applications where external supervision is limited or unavailable. The findings could influence future developments in reinforcement learning methodologies and their applications in various AI domains.
A Tutorial on Diffusion Theory: From Differential Equations to Diffusion Models
Generative Models
Theory
Computer Vision
- Diffusion models can be represented using both ODE and SDE frameworks.
- The tutorial establishes a connection between score matching and noise-prediction objectives.
- Sampling methods for reverse dynamics are explored, including DPM-Solver and guided sampling.
- DDPM and DDIM are unified under the reverse SDE/ODE framework, highlighting their shared training objectives.
Read more
A Tutorial on Diffusion Theory: From Differential Equations to Diffusion Models
Summary
This tutorial provides a comprehensive overview of diffusion models through the lens of differential equations. It begins with the conditional Gaussian forward process, illustrating its representation as both an ordinary differential equation (ODE) and a stochastic differential equation (SDE). The authors derive marginalized forward ODE and SDE formulations that transition the data distribution from a given distribution to a Gaussian prior. The tutorial further explores reverse-time dynamics, including reverse SDE and reverse probability-flow ODE, which are governed by the marginal score. This leads to a training objective for score estimation, demonstrating that the conventional noise-prediction objective aligns with score matching, differing only by a constant. Additionally, the paper discusses various sampling methods for the learned reverse dynamics, such as DPM-Solver, and guided sampling techniques. A comparison between DDPM and DDIM within the reverse SDE/ODE framework reveals that both share the same training objective, with distinct sampling methods corresponding to reverse-SDE and reverse-ODE sampling. Overall, the tutorial serves as a foundational resource for understanding the mathematical underpinnings and practical applications of diffusion models in generative tasks.
Methodology
The authors develop diffusion models starting from the conditional Gaussian forward process, deriving both ODE and SDE representations. They then average the conditional process to obtain marginalized formulations and derive reverse dynamics. The tutorial also discusses the training objectives for score estimation and various sampling methods for the learned dynamics.
Results
The tutorial successfully illustrates the equivalence between standard noise-prediction objectives and score matching. It also clarifies the relationships between different sampling methods and their corresponding formulations in the reverse SDE/ODE framework.
Implications
This work has significant implications for the development of generative models, particularly in enhancing the understanding and implementation of diffusion processes in various applications such as image generation and data synthesis.
A Reproducible Log-Driven AutoML Framework for Interpretable Pipeline Optimization in Healthcare Risk Prediction
Optimization
Interpretability
- Introduction of a log-driven AutoML framework for reproducible healthcare risk prediction.
- Large-scale evaluation reveals a structured and partially redundant AutoML search space.
- Ensemble models demonstrate strong performance, with Macro-F1 scores around 0.88 and 0.94 for diabetes and stroke predictions, respectively.
- Identifies key components influencing model performance, particularly in the context of class imbalance.
Read more
A Reproducible Log-Driven AutoML Framework for Interpretable Pipeline Optimization in Healthcare Risk Prediction
Summary
This paper presents a novel deterministic, log-driven AutoML framework, named yvsoucom-iterkit, aimed at enhancing reproducibility and interpretability in healthcare risk prediction. The framework addresses the challenges of heterogeneous features, limited samples, and severe class imbalance in clinical datasets by optimizing the entire pipeline, including preprocessing, data augmentation, imbalance handling, and classification models. Through extensive evaluation of over 18,000 pipeline configurations on the Pima Indians Diabetes and Stroke datasets, the authors reveal a structured yet partially redundant search space where a small subset of components significantly influences performance. Key findings indicate that ensemble models yield strong and stable performance metrics, with a Macro-F1 score of approximately 0.88 on the Pima dataset and 0.94 on the Stroke dataset, although performance drops to 0.67 on Stroke under severe class imbalance. The study also highlights a performance-robustness trade-off, showing that ensemble models exhibit lower variability compared to SVM. The framework's design allows for transparent experimentation and comprehensive analysis, making it a valuable tool for healthcare data analysis and potentially applicable to other domains with categorical prediction tasks.
Methodology
The authors developed a pipeline-centric AutoML framework that jointly optimizes preprocessing and modeling components within a unified configuration space. The framework employs a log-driven execution paradigm to link pipeline configurations with their performance outcomes, enabling extensive experimentation and analysis across multiple configurations.
Results
The framework was evaluated on over 18,000 pipeline configurations, revealing that performance is primarily governed by a small subset of interacting components. Ensemble models achieved strong performance metrics, with a Weighted-F1 score of 0.89 and Macro-F1 of 0.88 on the Pima dataset, while the Stroke dataset showed a Weighted-F1 of 0.94 but a lower Macro-F1 of 0.67 due to class imbalance. The analysis also indicated a performance-robustness trade-off, with ensembles exhibiting lower variability than SVM.
Implications
The proposed framework enhances the reproducibility and interpretability of machine learning models in healthcare, potentially leading to better risk prediction and decision-making. Its design can be adapted for other domains requiring categorical predictions, making it a versatile tool in the field of automated machine learning.
PeakFocus: Bridging Peak Localization and Intensity Regression via a Unified Multi-Scale Framework for Electricity Load Forecasting
Time Series
- PeakFocus unifies peak localization and intensity regression in a single framework.
- The framework employs a triple hybrid loss for joint supervision of peak timing and intensity.
- Multi-Scale Mixing Peak Locator mitigates misjudgment and timing misalignment.
- Location-Aware Decoder improves intensity estimation by incorporating peak timing context.
Read more
PeakFocus: Bridging Peak Localization and Intensity Regression via a Unified Multi-Scale Framework for Electricity Load Forecasting
Summary
The paper introduces PeakFocus, a novel framework for Electricity Load Peak Forecasting (ELPF) that addresses significant limitations in existing forecasting methods. Traditional approaches often separate peak timing localization from intensity regression, leading to inefficiencies and inaccuracies. PeakFocus proposes a Unified Peak-Aware Pipeline (UPAP) that employs a triple hybrid loss to jointly supervise both tasks, enhancing their interdependence. Additionally, the Multi-Scale Mixing Peak Locator (MSM-PL) is introduced to resolve multi-scale representation conflicts by integrating coarse-grained features into fine-grained ones, thereby improving timing precision and reducing peak misjudgment. The Location-Aware Decoder (LAD) further enhances intensity estimation by incorporating peak timing context, counteracting the tendency for intensity smoothing. The framework is validated through extensive experiments on both public and industrial-scale datasets, demonstrating superior performance in timing precision and intensity estimation compared to baseline models.
Methodology
PeakFocus utilizes a unified framework that integrates three main components: the Unified Peak-Aware Pipeline (UPAP) for joint supervision of localization and regression, the Multi-Scale Mixing Peak Locator (MSM-PL) for resolving multi-scale representation conflicts, and the Location-Aware Decoder (LAD) for enhancing intensity regression by injecting peak timing context.
Results
The experiments conducted on the public Electricity (ELC) dataset and the World Large-scale Electricity Load (WLEL) dataset indicate that PeakFocus significantly outperforms baseline models in terms of both timing precision and intensity estimation, demonstrating its effectiveness in real-world applications.
Implications
The proposed framework has significant implications for electricity grid management, enabling more accurate forecasting of load peaks, which is crucial for resource allocation and risk management in power systems. It can potentially enhance operational efficiency and reliability in energy distribution.
Physics-Informed Generative Solver: Bridging Data-Driven Priors and Conservation Laws for Stable Spatiotemporal Field Reconstruction
Generative Models
Theory
Time Series
- Introduces a physics-informed generative framework for spatiotemporal field reconstruction.
- Decouples training and inference processes to enhance flexibility and stability.
- Demonstrates effectiveness in acoustic systems and generalizes to chaotic flows and meteorological fields.
- Addresses the challenge of observational sparsity in physical sciences.
Read more
Physics-Informed Generative Solver: Bridging Data-Driven Priors and Conservation Laws for Stable Spatiotemporal Field Reconstruction
Summary
This paper presents a novel framework called the Physics-Informed Generative Solver (PIGS) aimed at reconstructing spatiotemporal fields from sparse measurements, addressing the challenges posed by the ill-posed nature of inverse problems in physical sciences. The authors introduce a two-step approach: during training, they utilize Martingale-Regularized Score Matching (MRSM) to couple denoising score matching with a Score Fokker–Planck Equation constraint, ensuring a dynamically stable generative prior. During inference, the Physics-Informed Implicit Score Sampling (PI-ISS) method projects samples towards the physical manifold by back-propagating conservation-law residuals. This separation of training and inference allows for flexible reconstruction from incomplete and multimodal observations while maintaining physical consistency. The framework is validated in acoustic systems, where it successfully co-generates pressure and particle velocity fields from sparse data, and it is shown to generalize to chaotic flows and meteorological fields under extreme data sparsity. By combining dynamical stability with physics-constrained probabilistic inference, this work establishes a robust paradigm for solving high-dimensional inverse problems, bridging generative AI with first-principles science.
Methodology
The methodology involves two main components: Martingale-Regularized Score Matching (MRSM) for training, which combines denoising score matching with constraints from the Score Fokker–Planck Equation to ensure stability, and Physics-Informed Implicit Score Sampling (PI-ISS) for inference, which projects generated samples towards the physical manifold by incorporating conservation-law residuals.
Results
The framework successfully reconstructs coupled pressure and particle velocity fields in acoustic systems from sparse measurements, effectively transforming sparse physical arrays into dense virtual arrays and mitigating spatial aliasing. It also demonstrates generalizability to chaotic Kolmogorov flows and large-scale meteorological fields, showcasing its robustness under extreme data sparsity.
Implications
The proposed framework has significant implications for various fields in physical sciences where data is often sparse and incomplete. It can enhance the accuracy of simulations and reconstructions in areas such as meteorology, fluid dynamics, and acoustics, potentially leading to improved predictive models and better understanding of complex physical systems.
The Double Dilemma in Multi-Task Radiology Report Generation: A Gradient Dynamics Analysis and Solution
Optimization
Computer Vision
Multimodal
- Identifies the limitations of linear scalarization in multi-task RRG optimization.
- Introduces the concept of a 'Double Dilemma' in gradient dynamics affecting RRG performance.
- Proposes CAME-Grad, a new optimizer that enhances gradient dynamics for better multi-task learning.
- Demonstrates substantial performance improvements in clinical efficacy across multiple RRG methods.
Read more
The Double Dilemma in Multi-Task Radiology Report Generation: A Gradient Dynamics Analysis and Solution
Summary
This paper addresses the challenges in multi-task learning for automatic radiology report generation (RRG), particularly the limitations of linear scalarization strategies that fail to balance clinical supervision with report generation smoothness. The authors analyze the failure mechanisms of these strategies through gradient dynamics, identifying a 'Double Dilemma' characterized by drift term deviation and diffusion term decay. To overcome these issues, they propose a novel optimizer called Conflict-Averse Magnitude-Enhanced Gradient Descent (CAME-Grad), which enhances optimization dynamics without altering the core architecture of existing models. CAME-Grad employs three key mechanisms: Conflict-Averse Direction Rectification to mitigate destructive interference, Magnitude-Enhanced Energy Injection to restore gradient magnitude, and Adaptive Gradient Fusion to balance optimal directions with task-specific biases. Experimental results demonstrate that CAME-Grad significantly improves clinical efficacy across various RRG methods, achieving average performance gains of 2.3% on MIMIC-CXR and 1.9% on IU X-Ray datasets. This work highlights the importance of gradient dynamics in multi-task optimization and provides a practical solution to enhance the performance of RRG systems.
Methodology
The authors utilize the stochastic differential equation (SDE) framework to analyze the gradient dynamics of multi-task RRG, revealing the intrinsic conflicts in optimization. They then develop CAME-Grad, which incorporates Conflict-Averse Direction Rectification, Magnitude-Enhanced Energy Injection, and Adaptive Gradient Fusion to reshape optimization dynamics and improve task performance.
Results
CAME-Grad was tested across eight diverse RRG methods, resulting in an average clinical efficacy improvement of 2.3% on the MIMIC-CXR dataset and 1.9% on the IU X-Ray dataset, demonstrating its effectiveness as a universal optimizer.
Implications
The findings suggest that optimizing gradient dynamics can significantly enhance the performance of multi-task learning systems in clinical applications, potentially leading to more accurate and reliable automated radiology report generation.