AI-generated summaries
Today's ML research,
without the noise.
Daily summaries of the latest machine learning papers from arXiv, processed every 8 hours.
48
Papers today
8h
Update frequency
7
Days of history
Local Motion Matters: A Deconstruct-Recompose Paradigm for Reinforcement Learning Pre-training from Videos
Reinforcement Learning
Robotics
- Introduction of the Deconstruct-Recompose Paradigm (DRP) for RL pre-training.
- Focus on local motion patterns rather than global motions for better transferability.
- Development of a Dual-Attention Encoder (DAE) to learn local motion representations.
- Significant improvements in sample efficiency and performance across various tasks.
Read more
Local Motion Matters: A Deconstruct-Recompose Paradigm for Reinforcement Learning Pre-training from Videos
Summary
This paper presents a novel approach to improve reinforcement learning (RL) efficiency through a Deconstruct-Recompose Paradigm (DRP) that focuses on local motion representations derived from large-scale video data. Traditional methods often treat agents as indivisible entities, modeling global motion patterns that are tightly coupled with morphology, which limits transferability across different domains. The authors argue that local motion components exhibit similarities across various agents, which can be leveraged for better cross-domain transfer. In the Deconstruct phase, the method identifies local points and tracks their frame-wise motions, defining these as Atomic Actions. A Dual-Attention Encoder (DAE) is then employed to learn local motion representations, capturing their spatiotemporal relationships. In the Recompose phase, these local representations are aggregated using a learnable Motion Aggregation Token (MAT) and enriched with dynamic semantics through latent dynamics model learning. The paper demonstrates that this approach significantly enhances sample efficiency and performance in diverse robotic control and manipulation tasks, indicating the effectiveness of local motion modeling in RL pre-training.
Methodology
The methodology consists of two main phases: Deconstruct and Recompose. In the Deconstruct phase, local motion components (Atomic Actions) are identified and tracked, and a Dual-Attention Encoder (DAE) is used to learn their representations. The Recompose phase involves aggregating these representations with a learnable Motion Aggregation Token (MAT) and applying a latent dynamics model to enhance them with dynamic semantics. An adapter is also utilized to connect local motion representations to specific action dynamics for policy learning.
Results
The experimental results show that the proposed DRP significantly outperforms existing methods in terms of sample efficiency and overall performance in various robotic control and manipulation tasks, validating the effectiveness of local motion representations in reinforcement learning.
Implications
The findings suggest that focusing on local motion patterns can lead to more efficient RL training and better generalization across different tasks and domains. This approach could be applied in various fields requiring robotic control and manipulation, potentially enhancing the adaptability of RL agents.
Spectroscopy Analysis with Machine Learning Regression for the Quantification of Carbon and Nitrogen Contents in Inceptisol and Oxisol Soil Types: Comparing Different Preprocessing and Validation methods as well as Feature Importance
Efficient ML
- NIR spectroscopy offers a rapid, cost-effective alternative to traditional soil analysis methods.
- Savitzky-Golay filter and NIPALS-based outlier removal were the most effective preprocessing techniques.
- Oxisols showed better predictive performance for C and N content compared to Inceptisols.
- The study achieved low overfitting with RPD values greater than 2.0, indicating reliable model performance.
Read more
Spectroscopy Analysis with Machine Learning Regression for the Quantification of Carbon and Nitrogen Contents in Inceptisol and Oxisol Soil Types: Comparing Different Preprocessing and Validation methods as well as Feature Importance
Summary
This paper explores the application of Near-Infrared (NIR) spectroscopy combined with machine learning (ML) techniques to quantify carbon (C) and nitrogen (N) contents in two prevalent Brazilian soil types: Inceptisols and Oxisols. The study highlights the advantages of NIR spectroscopy over traditional soil analysis methods, such as speed, cost-effectiveness, and non-destructive testing. Various preprocessing techniques were assessed, with the Savitzky-Golay filter and a robust outlier removal method based on the Non-linear Iterative Partial Least Squares (NIPALS) algorithm being the most effective. The authors compared multiple validation strategies, including 10-fold cross-validation, leave-one-out, and holdout via the Kennard-Stone method. Stacking ensemble learning models were utilized, incorporating Partial Least Squares (PLS), Support Vector Regression (SVR), and Ridge regression as base models, with linear regression serving as the meta-model. The evaluation metrics included R², Root Mean Squared Error (RMSE), Mean Absolute Error (MAE), and Ratio of Performance Deviation (RPD). The results demonstrated superior predictive performance for Oxisols (R² = 0.91 for C and R² = 0.89 for N) compared to Inceptisols (R² = 0.79 for C and R² = 0.77 for N), indicating the influence of soil characteristics on model performance. The models achieved an RPD greater than 2.0 with minimal overfitting, validating the effectiveness of this approach for rapid quantification of C and N, thus supporting sustainable agricultural practices.
Methodology
The study employed NIR spectroscopy to collect spectral data from soil samples, followed by various preprocessing techniques including the Savitzky-Golay filter and NIPALS for outlier removal. Stacking ensemble learning models were constructed using PLS, SVR, and Ridge regression as base models, with linear regression as the meta-model. Multiple validation strategies were applied to assess model performance.
Results
The models achieved R² values of 0.91 for carbon and 0.89 for nitrogen in Oxisols, while Inceptisols yielded R² values of 0.79 for carbon and 0.77 for nitrogen. The results indicated a significant performance gap between the two soil types, attributed to their distinct pedological characteristics.
Implications
The findings suggest that the developed machine learning approach can significantly enhance soil nutrient quantification, facilitating more efficient and environmentally friendly agricultural practices. This method can aid producers and consultants in making informed decisions based on soil fertility indicators.
Distributed Online Bandit Submodular Maximization with Bounded Sampling Violations
Optimization
Theory
- Developed a unified algorithmic framework for distributed online submodular maximization under partition matroid constraints.
- Achieved sublinear (1 - 1/e)-regret guarantees for both full-information and bandit feedback models.
- Introduced a bounded stochastic pipage rounding scheme to address sampling violations.
- Demonstrated that cumulative sampling violations remain sublinear in T.
Read more
Distributed Online Bandit Submodular Maximization with Bounded Sampling Violations
Summary
This paper addresses the problem of distributed online submodular maximization under partition matroid constraints, where multiple agents sequentially select actions from their own subsets to maximize cumulative values of time-varying objective functions. The authors propose a unified algorithmic framework that works with both full-information and bandit feedback models. They demonstrate that their algorithms achieve sublinear (1 - 1/e)-regret guarantees, comparable to existing centralized methods. To mitigate sampling violations due to continuous relaxation and rounding, a bounded stochastic pipage rounding scheme is introduced, ensuring that the probability of sampling violations diminishes asymptotically. The cumulative sampling violation is shown to remain sublinear in T, with certain conditions indicating that this bound is not improvable. The theoretical findings are supported by numerical results, validating the effectiveness of the proposed methods.
Methodology
The authors developed a unified algorithmic framework that integrates both full-information and bandit feedback models for distributed online submodular maximization. They introduced a bounded stochastic pipage rounding scheme to handle sampling violations, ensuring that the probability of such violations decreases asymptotically. The algorithms were analyzed for their regret guarantees and performance through theoretical proofs and numerical simulations.
Results
The proposed algorithms achieved sublinear (1 - 1/e)-regret guarantees, which are on par with existing centralized algorithms. The bounded stochastic pipage rounding scheme effectively reduced sampling violations, maintaining them at a sublinear level in relation to the time horizon T. Numerical results corroborated the theoretical findings, demonstrating the practical applicability of the proposed methods.
Implications
The findings of this paper have significant implications for large-scale decision-making in multi-agent systems, particularly in applications such as resource allocation, sensor selection, and task coordination. The ability to achieve near-optimal performance with bounded sampling violations can enhance the efficiency and effectiveness of distributed systems in real-world scenarios.
How Early Is Early Enough? Design-Dependent Observation-Window Sufficiency in Subscription Churn Prediction
Time Series
- The sufficiency of early observation windows for churn prediction varies significantly across different experimental designs.
- A nine-window sufficiency curve indicates diminishing returns in predictive performance within a 45-90 day range.
- Contract-driven factors dominate churn prediction, but behavioral data adds predictive value in high-churn segments.
- Early predictability is robust and not solely due to survivorship bias in the dataset.
Read more
How Early Is Early Enough? Design-Dependent Observation-Window Sufficiency in Subscription Churn Prediction
Summary
This paper investigates the optimal duration of early behavioral observation necessary for effective subscription churn prediction, using the public KKBox dataset. The authors develop a sufficiency curve to determine how many days of early activity are required to achieve significant predictive performance. They find that while a fixed-horizon design shows a knee in performance at 45-90 days, this sufficiency curve is not universally applicable across different cohort designs. The study highlights that the early warning signals for churn are heavily influenced by contract status, particularly in segments with high churn rates. The authors stress the importance of specifying cohort construction, target definitions, and feature sets when making claims about observation window sufficiency. Their findings contribute to understanding the dynamics of churn prediction and offer insights into the timing of customer interventions.
Methodology
The authors constructed a sufficiency curve based on early behavioral data from the KKBox dataset, employing three different cohort/task designs to stress-test the sufficiency of observation windows. They utilized various models, including GBDT, 1D-CNN, and logistic regression, to analyze the predictive performance across different time windows.
Results
The study revealed that the optimal observation window for churn prediction is not fixed and can shift based on the cohort design and feature set used. The sufficiency curve demonstrated a knee at 45-90 days, with significant predictive improvements noted in the manual-renewal segment of users. The ROC-AUC scores increased from 0.913 at day 7 to 0.963 at day 120, indicating strong early predictability.
Implications
The findings suggest that subscription-based companies should carefully consider the timing of customer interventions based on the specific characteristics of their user base and the design of their predictive models. This research can guide businesses in optimizing their churn prediction strategies and improving customer retention efforts.
GRPO, Dr. GRPO, and DAPO Are Three Operations on One Number: The Group-Standard-Deviation Identity
Reinforcement Learning
Large Language Models
Theory
- GRPO, Dr. GRPO, and DAPO are variations of a single operation on the standard deviation of sampled answers.
- The group-standard-deviation identity shows that disagreement among answers directly influences training updates.
- A split group maximizes learning potential, while unanimous groups provide minimal training benefit.
- The paper provides closed-form expressions for group size and difficulty bias, aiding practitioners in model training.
Read more
GRPO, Dr. GRPO, and DAPO Are Three Operations on One Number: The Group-Standard-Deviation Identity
Summary
This paper explores three popular methods for training language models—Group Relative Policy Optimization (GRPO), Dr. GRPO, and Decoupled Clip and Dynamic Sampling Policy Optimization (DAPO)—and reveals that they are variations of a single underlying mechanism that adjusts the standard deviation of a prompt's sampled answers. The authors argue that the standard deviation reflects the level of disagreement among the answers, which is crucial for effective training. The paper establishes the 'group-standard-deviation identity,' showing that the training update size is directly proportional to the disagreement among answers. The authors validate their findings using a large real difficulty dataset (Big-Math) and controlled training runs, demonstrating that the choice of group size and the handling of disagreement significantly influence learning outcomes. The results indicate that a split group provides the most informative training signals, while unanimous groups contribute little to learning. This work emphasizes the importance of understanding the dynamics of group sampling in reinforcement learning settings, particularly in the context of language model training.
Methodology
The authors derive the group-standard-deviation identity mathematically, analyzing the training updates in reinforcement learning with verifiable rewards. They present closed-form expressions for the updates based on group size and the distribution of correct and incorrect answers. The methodology includes both theoretical derivations and empirical validation using the Big-Math dataset.
Results
The paper establishes that the per-prompt update in GRPO is scaled by the group reward standard deviation, leading to a precise understanding of how group size affects learning fidelity. The results indicate that a group size of G is necessary to achieve a certain level of fidelity in learning, with specific recommendations based on the difficulty of the tasks. The findings also highlight the probability of encountering silent groups, which do not contribute to learning.
Implications
This research has significant implications for the training of language models, particularly in optimizing group sampling strategies to enhance learning efficiency. It provides a theoretical framework that can guide practitioners in selecting appropriate group sizes and understanding the impact of answer disagreement on model performance.
GAIA: Geometry-Adaptive Operator Learning for Forward and Inverse Problems
Optimization
Theory
Efficient ML
- GAIA provides a unified model for both forward and inverse problems on arbitrary geometries.
- Introduces new benchmarks for varying-geometry inverse and BVP problems.
- Achieves state-of-the-art results on all evaluated tasks, significantly reducing error rates.
- Maintains stable accuracy across point resolutions, outperforming transformer-based baselines.
Read more
GAIA: Geometry-Adaptive Operator Learning for Forward and Inverse Problems
Summary
The paper presents GAIA, a novel operator learning model designed to address the limitations of existing geometry-adaptive neural operators, which primarily focus on forward problems. GAIA is capable of handling both forward and inverse problems on arbitrary geometries without the need for retraining or iterative optimization. The model utilizes a dual-pathway tokenization approach that encodes both the domain boundary and the interior field distribution into geometry tokens. These tokens condition integral transform layers through cross-attention, allowing the model to adapt locally to geometric features. GAIA is evaluated on seven benchmarks, including new or extended tasks for inverse problems and boundary value problems (BVPs), such as electrical impedance tomography and optical tomography. The results demonstrate that GAIA achieves state-of-the-art performance, significantly reducing median relative L2 error compared to existing methods while maintaining competitive accuracy on forward problems across varying resolutions.
Methodology
GAIA employs a geometry-adaptive integral autoencoder architecture that encodes domain geometry into tokens. It utilizes multi-head cross-attention to condition integral kernels on these tokens, allowing for spatially adaptive integration. This architecture enables efficient single-pass solutions for both forward and inverse problems without the need for iterative optimization or retraining.
Results
GAIA sets new state-of-the-art results on all inverse and BVP tasks, achieving a 64% reduction in median relative L2 error for airfoil flow reconstruction and a 27% reduction for electrical impedance tomography compared to the next best method. It also outperforms all baselines on various shape categories in the modified mechanical components benchmark, while remaining competitive on forward problems.
Implications
The development of GAIA has significant implications for fields requiring efficient and accurate solutions to PDEs, such as fluid dynamics, medical imaging, and structural analysis. Its ability to handle varying geometries without retraining opens up new avenues for real-time applications and multi-query settings.
Representation as a Bottleneck for Mechanistic Interpretability: The Manifestation Unit Protocol
Interpretability
- Introduction of Manifestation Units, a structured protocol for organizing neural network component analyses.
- Demonstrated that structured representation improves retrieval performance significantly over unstructured methods.
- Established causal relationships and minimal-optimal core components in CNN and GPT-2 architectures.
- Provided empirical evidence supporting the effectiveness of the proposed schema in mechanistic interpretability.
Read more
Representation as a Bottleneck for Mechanistic Interpretability: The Manifestation Unit Protocol
Summary
This paper addresses the challenges of mechanistic interpretability in neural networks, particularly the difficulty in reusing outputs from component-level analyses such as selectivity tables and circuit diagrams. The authors identify the representation layer as a bottleneck and propose the Manifestation Unit Protocol, which organizes component-level statistics into a structured format (E, S, R, D, G, T) that allows for effective querying and retrieval. The protocol is evaluated across three architectures: generative vision (β-VAE), discriminative vision (CNN), and language (GPT-2). The findings demonstrate that a typed structure significantly enhances retrieval performance compared to unstructured baselines, and that CNN filters retrieved through this schema meet causal sufficiency and necessity criteria. The study emphasizes the importance of structured representation for facilitating mechanistic interpretability and provides a foundation for future research in this area.
Methodology
The authors developed the Manifestation Unit Protocol, a typed tuple protocol that organizes component-level statistics into six fields. They employed a hybrid retrieval system combining exact matching and dense semantic search. The protocol was instantiated across three neural network architectures, and various hypotheses were tested regarding structural accessibility and causal mediation.
Results
The results showed that the structured representation outperformed unstructured baselines in retrieval tasks, with the GPT-2 schema achieving an oracle recall@30 of 0.411. In CNN experiments, retrieved filters significantly influenced predictions, demonstrating causal mediation. The study identified a minimal-optimal core (S+R) and revealed redundancy and interference among other fields.
Implications
The findings suggest that structured representation can enhance the usability of mechanistic interpretability analyses, making them more actionable for downstream applications such as auditing and intervention in neural networks. This work lays the groundwork for future advancements in interpretability methods.
A Stationary-Distribution Theory for Triplet-Based Plateau Search in Random Forest Ensemble-Size Selection
Theory
Optimization
- Introduces a stationary-distribution theory for ensemble-size selection in Random Forests.
- Models the ensemble size as a birth-death Markov chain to analyze its behavior.
- Demonstrates that the central ensemble size fluctuates around a stationary regime.
- Provides equilibrium equations that characterize the stationary center and spread.
Read more
A Stationary-Distribution Theory for Triplet-Based Plateau Search in Random Forest Ensemble-Size Selection
Summary
This paper addresses the challenge of selecting the optimal number of trees in Random Forests, a critical hyperparameter that balances prediction accuracy and computational cost. The authors introduce a stationary-distribution theory to analyze the triplet-based plateau search method for ensemble-size selection. This method adapts the number of trees by comparing out-of-bag scores at three points: a lower bound, a central point, and an upper bound, represented as a geometric triplet. The study models the central ensemble size as a birth-death Markov chain and derives its stationary distribution through local balance. The findings indicate that the central ensemble size fluctuates around a stationary regime rather than converging to a fixed value. The paper provides equilibrium equations for both the original and a modified update rule, revealing that the stationary center and spread are influenced by the scale factor and the update rule. This research reinterprets the plateau-based tuning process as a stochastic rather than deterministic approach, offering insights into the dynamics of ensemble-size selection in Random Forests.
Methodology
The authors develop a theoretical framework using a birth-death Markov chain to model the ensemble size in Random Forests. They derive the stationary distribution through local balance and analyze the equilibrium equations for both the original and modified update rules. The study employs a centered folded-normal approximation to characterize the stationary center and spread, and utilizes a local Gaussian approximation for variance estimation.
Results
The analysis reveals that the stationary center of the ensemble size is proportional to O(ε−2) as ε approaches zero, while the stationary spread is characterized by σB,∗= O(ε−2) and variance O(ε−4). These results indicate that the relative spread is independent of ε and is controlled by the scale factor and update rule, highlighting the stochastic nature of the plateau-based tuning process.
Implications
The findings suggest that Random Forest ensemble-size selection can be treated as a stochastic process, which may lead to more robust hyperparameter optimization strategies. This approach can improve the reliability of variable importance measures derived from Random Forests, particularly in high-dimensional settings with correlated features.
Joint discovery of governing partial differential equations from multi-source datasets by competitive optimization
Optimization
Interpretability
Theory
- Introduction of MCO-PDE framework for discovering PDEs from multi-source datasets.
- Utilizes independent neural surrogates and a soft-competitive weighting mechanism.
- Achieves high accuracy in recovering canonical equations with limited observations.
- Handles complex domains and extracts meaningful laws from real-world experiments.
Read more
Joint discovery of governing partial differential equations from multi-source datasets by competitive optimization
Summary
This paper presents a novel framework, termed MCO-PDE, for the joint discovery of governing partial differential equations (PDEs) from multiple datasets characterized by different initial conditions or boundary configurations. Traditional data-driven approaches often rely on single datasets, which limits their effectiveness in scenarios with sparse observations. The proposed MCO-PDE framework addresses this limitation by training independent neural surrogates for each dataset and employing a soft-competitive weighting mechanism to evaluate dataset credibility. This allows for the aggregation of a consensus global coefficient. The methodology integrates a genetic algorithm for structural search, enabling the simultaneous identification of functional forms and parameters of governing laws. The authors demonstrate that the framework can accurately recover canonical equations using as few as 50 observations per dataset across seven different cases. Furthermore, MCO-PDE is capable of handling complex two- and three-dimensional domains with irregular boundaries and heterogeneous coefficients, successfully extracting meaningful physical laws from real-world wave-tank experiments. This work highlights the potential of automated scientific discovery through the fusion of heterogeneous data sources.
Methodology
The MCO-PDE framework involves training independent neural surrogates for each data source, followed by a soft-competitive weighting mechanism to assess dataset credibility. It integrates a genetic algorithm for structural search to identify both the functional forms and parameters of governing equations.
Results
The framework successfully recovers canonical equations with high accuracy using as few as 50 observations per dataset across seven cases. It effectively handles two- and three-dimensional domains with irregular boundaries and heterogeneous coefficients, demonstrating its applicability to real-world scenarios.
Implications
The MCO-PDE framework offers a promising approach for automated scientific discovery, enabling researchers to uncover governing laws from diverse datasets. This could significantly enhance the understanding of complex physical systems and improve predictive modeling in various scientific fields.
Generative Model Proposal based Particle Filtering for Data Assimilation
Generative Models
Time Series
Theory
- Introduction of Flow Proposal Particle Filters (FPPF) for improved data assimilation.
- FPPF utilizes a learned conditional generative model to propose particle distributions.
- The method effectively reduces weight variance and delays degeneracy in high-dimensional spaces.
- FPPF and its localized variant (L-FPPF) show superior performance in chaotic dynamical systems.
Read more
Generative Model Proposal based Particle Filtering for Data Assimilation
Summary
This paper introduces Flow Proposal Particle Filters (FPPF), a novel approach to data assimilation that enhances particle filtering methods by integrating generative models. Traditional particle filters struggle with high-dimensional state spaces due to degeneracy, where the particle weights collapse to a few particles. FPPF addresses this by proposing a conditional generative model-based proposal distribution that approximates the optimal proposal, steering particles towards high-likelihood regions informed by observations. This method retains a Bayesian update step, allowing for accurate importance weight computation and reducing weight variance. The authors also extend FPPF to high-dimensional problems through a localized variant (L-FPPF), which maintains computational efficiency and stability. Extensive experiments demonstrate that FPPF outperforms classical filters and other generative methods in various dynamical systems, particularly in non-linear and non-Gaussian scenarios, while remaining stable over long assimilation horizons.
Methodology
The authors propose FPPF, which learns a conditional generative model to create a proposal distribution for particle filtering. This model is informed by observations, allowing for effective steering of particles towards high-likelihood areas. The method incorporates a Bayesian update step for accurate weight computation. The localized variant, L-FPPF, further enhances performance in high-dimensional settings by factorizing the proposal log-density locally.
Results
FPPF consistently outperformed classical filters and prior generative methods across various chaotic dynamical systems, including Lorenz-63, Lorenz-96, and the Kuramoto-Sivashinsky equation. The method showed significant improvements in state estimation accuracy and probabilistic calibration, particularly in non-Gaussian regimes. L-FPPF maintained these advantages as the dimensionality of the state space increased.
Implications
The proposed FPPF framework has potential applications in fields requiring robust data assimilation, such as weather forecasting, climate modeling, and motion tracking. Its ability to handle high-dimensional, non-linear, and non-Gaussian problems makes it a valuable tool for scientific and engineering applications.
Generative Modeling of Quantum Distribution with Functional Flow Matching
Generative Models
Theory
- Introduction of Quantum Flow Matching (QFM) for generative modeling of quantum distributions.
- Utilization of spin Wigner functions to bypass direct density matrix learning.
- Application of Functional Flow Matching (FFM) for effective learning in function space.
- Demonstration of accurate reconstruction of quantum states and physical properties.
Read more
Generative Modeling of Quantum Distribution with Functional Flow Matching
Summary
This paper introduces Quantum Flow Matching (QFM), a novel generative model aimed at learning quantum distributions by utilizing spin Wigner functions and flow matching techniques. The authors highlight the challenges of accurately modeling quantum states due to their complex properties and the limitations of existing generative models when applied directly to quantum distributions. QFM circumvents these issues by converting density matrices into spin Wigner functions, which allows for the effective learning of multi-qubit quantum distributions in function space. The methodology involves using Functional Flow Matching (FFM) to learn the distribution of these spin Wigner functions, enabling the generation of new quantum states that maintain the physical properties of the original distributions. The effectiveness of QFM is demonstrated through evaluations of various physical quantities, such as trace, purity, and entanglement entropy, showcasing its ability to accurately capture the underlying physics of quantum systems.
Methodology
The methodology involves converting quantum states into spin Wigner functions, followed by the application of Functional Flow Matching (FFM) to learn the distribution of these functions in function space. This approach allows for the generation of new spin Wigner functions, which are then used to reconstruct quantum states that reflect the original quantum distributions.
Results
The results indicate that QFM effectively learns the underlying physics of quantum systems, as evidenced by accurate evaluations of physical quantities such as trace, purity, and entanglement entropy of the generated quantum states. The method demonstrates a significant improvement in modeling complex quantum distributions compared to traditional approaches.
Implications
The proposed QFM model has the potential to advance the field of quantum machine learning by providing a robust framework for accurately modeling quantum states. This could lead to improved simulations of quantum systems, enhanced understanding of quantum phenomena, and applications in quantum computing and quantum information science.
Policy Optimization Achieves Data-Dependent Regret Bounds in MDPs with Unknown Transitions
Reinforcement Learning
Optimization
Theory
- Introduces a new algorithm for policy optimization in MDPs with unknown transitions.
- Achieves data-dependent regret bounds, adapting to the complexity of the loss sequence.
- Combines first-order, second-order, and path-length bounds with best-of-both-worlds guarantees.
- Identifies a transition-dependent complexity term that impacts regret bounds.
Read more
Policy Optimization Achieves Data-Dependent Regret Bounds in MDPs with Unknown Transitions
Summary
This paper addresses the challenge of policy optimization in online episodic tabular Markov Decision Processes (MDPs) with unknown transition kernels. The authors develop a novel algorithm based on optimistic follow-the-regularized-leader that achieves data-dependent regret bounds, which adapt to the complexity of the loss sequence. The key innovation is the introduction of optimistic Q-function estimators and a data-dependent transition bonus that mitigates estimator bias. The analysis reveals a transition-dependent complexity term that quantifies the intrinsic cost of estimating the transition kernel. The proposed algorithm achieves first-order, second-order, and path-length bounds while ensuring gap-dependent polylog(T) regret in stochastic scenarios. This work is significant as it is the first to provide data-dependent guarantees for policy optimization under unknown transitions, combining best-of-both-worlds performance in both adversarial and stochastic regimes. Additionally, the algorithm is applicable in a full-information setting, maintaining its guarantees and recovering the performance of existing occupancy-measure-based methods.
Methodology
The authors develop an algorithm based on optimistic follow-the-regularized-leader, utilizing optimistic Q-function estimators and a data-dependent transition bonus to control bias. The analysis includes a detailed examination of the transition-dependent complexity term, which captures the cost of estimating the transition kernel.
Results
The algorithm achieves first-order, second-order, and path-length bounds, along with gap-dependent polylog(T) regret in the stochastic regime. It is the first to provide data-dependent guarantees for policy optimization in the context of unknown transitions, and it also performs well in a full-information setting, recovering the performance of existing methods.
Implications
This work has implications for reinforcement learning applications where the transition dynamics are not fully known, enabling more efficient learning strategies that adapt to the specific characteristics of the environment. It opens avenues for further research in developing algorithms that can handle unknown transitions while maintaining optimal performance.
Review Residuals: Update-Conditioned Residual Gating for Transformers
Large Language Models
NLP
Theory
- Introduction of Review Residuals, a gated residual update mechanism for transformers.
- Additive gating form preserves identity and avoids vanishing gradients, ensuring stable training at depth.
- Significant performance improvements at larger model sizes (590M and 1B parameters) compared to standard residuals and Highway gates.
- No advantage observed at smaller scales, highlighting the method's effectiveness at scale.
Read more
Review Residuals: Update-Conditioned Residual Gating for Transformers
Summary
This paper introduces Review Residuals, a novel approach to residual connections in transformer architectures that incorporates a learned, input-dependent gating mechanism to evaluate the reliability of proposed updates before committing them. Traditional residual connections add updates with a fixed coefficient of one, lacking a mechanism to assess the quality of these updates. Drawing inspiration from human-factors engineering, the proposed method scales each update by a gate conditioned on both the current state and the proposed update. The study reveals two significant findings: first, an additive form of the gate maintains depth stability, avoiding the vanishing gradient problem seen in convex forms; second, the method demonstrates an emergence-with-scale effect, showing no advantage at smaller model sizes but significantly outperforming traditional residuals and Highway gates at larger scales (590M and 1B parameters). The results indicate that the benefits of Review Residuals grow with model size, suggesting its potential for large-scale applications. The paper also emphasizes the importance of reproducibility, providing a detailed account of experimental protocols and effect sizes.
Methodology
The methodology involves implementing Review Residuals by scaling the proposed update of each sublayer with a learned gate that is conditioned on both the current state and the proposed update. The paper compares the performance of this method against traditional residual connections and Highway networks across various model sizes, employing a multi-seed experimental protocol to ensure robustness of results.
Results
The results indicate that Review Residuals do not provide a performance advantage at smaller model sizes (60M), but at 590M and 1B parameters, they significantly outperform both parameter-matched Highway gates and standard residuals, with statistical significance at larger scales. The performance advantage grows with model size, suggesting that the method is particularly beneficial for large-scale transformer architectures.
Implications
The findings suggest that incorporating a verification mechanism into residual connections can enhance the reliability and performance of transformer models, especially as they scale. This could have implications for the design of future deep learning architectures, particularly in applications requiring large models, such as natural language processing and other complex tasks.
Gauging, Measuring, and Controlling Critic Complexity in Actor-Critic Reinforcement Learning
Reinforcement Learning
- Introduces critic complexity as a measurable and controllable dimension in actor-critic RL.
- Utilizes spectral effective-rank entropy to quantify critic complexity.
- Demonstrates a systematic relationship between critic complexity, performance, and bias, varying by task and algorithm.
- Implements a spectral-entropy regularizer that effectively reduces critic complexity and improves performance in certain scenarios.
Read more
Gauging, Measuring, and Controlling Critic Complexity in Actor-Critic Reinforcement Learning
Summary
This paper introduces the concept of critic complexity as a critical evaluation dimension in actor-critic reinforcement learning (RL). The author proposes a method to measure critic complexity using spectral effective-rank entropy, which summarizes the singular-value distributions of critic weight matrices. Through experiments with TD3 and PPO algorithms, the study tracks critic complexity alongside return and Monte Carlo value-estimation bias, revealing that critic complexity is measurable and systematically related to training behavior, albeit in a heterogeneous manner across different algorithms and tasks. The author also evaluates a direct intervention to control critic complexity by incorporating a spectral-entropy penalty into the critic loss function. The results indicate that this regularization technique can effectively reduce critic complexity and improve performance in specific settings, such as TD3 on HalfCheetah-v4, although the benefits do not uniformly transfer across all tasks. The findings suggest that while simpler critics may enhance reliability, the relationship between complexity and performance is nuanced and task-dependent.
Methodology
The methodology involves defining and measuring critic complexity using spectral effective-rank entropy, which is computed from the singular values of critic weight matrices. The study tracks this complexity alongside performance metrics during training of TD3 and PPO algorithms. Additionally, a spectral-entropy penalty is introduced to the critic loss to directly control complexity.
Results
The analysis shows that critic complexity is consistently measurable throughout training and is linked to performance in structured ways. The spectral-entropy regularizer successfully reduces critic complexity and enhances performance in the TD3/HalfCheetah-v4 setting, although the improvements are not universally applicable across all tasks.
Implications
The findings suggest that managing critic complexity could be a valuable strategy for improving actor-critic RL performance. This approach may lead to more reliable critics and better policy optimization, with potential applications in various RL tasks where critic performance is critical.
Distributionally Robust Linear Regression With Block Lewis Weights
Optimization
Theory
Efficient ML
- Introduces a novel algorithm for group distributionally robust least squares regression.
- Achieves optimal solutions with improved computational efficiency compared to existing methods.
- Utilizes block Lewis weights to connect GDR problems to least squares frameworks.
- Offers algorithms that interpolate between different loss minimization objectives.
Read more
Distributionally Robust Linear Regression With Block Lewis Weights
Summary
This paper introduces an algorithm for solving the group distributionally robust (GDR) least squares problem, which aims to minimize prediction errors across multiple groups while ensuring fairness in model performance. The authors propose a method that achieves a (1+ε)-multiplicative optimal solution using a computational complexity of e O(min{rank(A), m}1/3ε−2/3) linear-system solves, where A is the stacked design matrix and b is the response vector. The algorithm leverages a novel geometric construction known as block Lewis weights, which connects the empirical GDR problem to a least squares framework. This approach allows for improved efficiency over existing interior point methods, particularly in moderate accuracy scenarios, and matches state-of-the-art performance for ℓ∞ regression. Additionally, the authors present algorithms that interpolate between minimizing average least squares loss and distributionally robust loss, providing flexibility in model training. The empirical evaluation demonstrates the effectiveness of the proposed methods on both synthetic and real-world datasets, highlighting their practical applicability in scenarios where model fairness is critical.
Methodology
The authors develop an algorithm based on accelerated proximal methods and a geometric construction involving block Lewis weights. This approach reformulates the GDR problem into a least squares problem, allowing for efficient optimization. The algorithm iteratively solves proximal subproblems and employs techniques from convex optimization to ensure convergence and optimality.
Results
The proposed algorithm successfully achieves (1+ε)-multiplicative optimal solutions with significantly reduced computational complexity. It outperforms traditional interior point methods in moderate accuracy regimes and matches state-of-the-art guarantees for ℓ∞ regression. Empirical results indicate that the algorithm effectively balances prediction accuracy and fairness across different groups.
Implications
The findings suggest that the proposed methods can be applied in various fields where model fairness is essential, such as economics, healthcare, and social sciences. By ensuring equitable model performance across diverse groups, the approach can help mitigate biases in predictive modeling and enhance decision-making processes.
Robustness of neural networks to random noise perturbations of their inputs
Theory
- Introduces a new robustness measure for neural networks against input perturbations.
- Proposes robustness curves to visualize the impact of noise on mean squared error.
- Demonstrates the method's applicability to various machine learning models beyond neural networks.
- Validates the approach with experimental results on real-world datasets.
Read more
Robustness of neural networks to random noise perturbations of their inputs
Summary
This paper investigates the robustness of trained neural networks (NNs) against random noise perturbations in their input values. The authors focus on the relationship between the accuracy of the network, measured by mean squared error (MSE), and its robustness. They propose a new robustness measure that provides an upper bound on the MSE with high probability for a given perturbation of input values. This measure is efficient to compute and treats the NN as a black box, making it applicable to various machine learning algorithms. The authors introduce 'robustness curves' to visualize how increasing input perturbations affect MSE, fitting these curves to the Gompertz function to characterize growth rates and compare robustness across datasets. Experimental results on real-world datasets validate the proposed methods, demonstrating their effectiveness in assessing and visualizing the robustness of NNs under input noise perturbations.
Methodology
The authors adapt the mean squared error (MSE) to account for input perturbations by adding random Gaussian noise. They employ a Monte Carlo method to compute confidence intervals for the derived MSE, allowing for the assessment of robustness. The robustness measure is computed efficiently, treating the NN as a black box. Robustness curves are generated and fitted to the Gompertz function to analyze the relationship between input perturbations and MSE degradation.
Results
The experimental results demonstrate that the proposed robustness measure effectively predicts the upper bounds of MSE under input perturbations. The robustness curves provide clear visualizations of how MSE degrades with increasing noise, allowing for comparisons across different datasets. The analysis reveals distinct growth rates and linear regimes in the robustness curves, enhancing the understanding of NN stability under noise.
Implications
The findings have significant implications for the deployment of neural networks in real-world applications where input noise is prevalent. By providing a method to quantify and visualize robustness, practitioners can better assess the reliability of their models and make informed decisions about model selection and tuning. The approach can also be extended to other machine learning algorithms, broadening its applicability in various domains.
Knowledge Distillation from Large Reasoning Models to Compact Student Models: A Case Study on the John O Bryan Mathematics Competition
Large Language Models
NLP
Efficient ML
- Constructed a CoT training corpus from 15 years of mathematics competition problems.
- Demonstrated that LoRA fine-tuning improved the student model's accuracy from 64.67% to 69.43%.
- Identified a practical lower bound of 50-100 words for response length in multi-step problems.
- Provided an error-type analysis indicating that 40% of failures were due to formatting errors.
Read more
Knowledge Distillation from Large Reasoning Models to Compact Student Models: A Case Study on the John O Bryan Mathematics Competition
Summary
This paper explores the process of knowledge distillation from a large reasoning model, DeepSeek-R1, to a more compact student model, Qwen2.5-7B, using historical problems from the John O’Bryan Mathematics Competition. The authors constructed a Chain-of-Thought (CoT) training corpus through a dual-agent framework, which was then utilized to fine-tune the student model using Low-Rank Adaptation (LoRA) on Apple Silicon hardware. The base model achieved an accuracy of 64.67% on competition problems, while the teacher model reached 91.40%. Initial training runs indicated overfitting, prompting further experiments with early stopping and multiple training runs. The fine-tuned student model achieved a mean accuracy of 69.43% ± 0.17%, showing a significant improvement over the base model, and generalizing to 73.1% ± 0.18% on the MATH-500 benchmark. The study also examined the impact of response length on answer quality across six reasoning levels, revealing a consistent decline in accuracy as response length decreased. The findings underscore the effectiveness of CoT distillation in enhancing compact models and highlight the importance of response length in mathematical reasoning tasks.
Methodology
The methodology involved collecting and digitizing competition problems, utilizing a dual-agent framework to generate and verify CoT reasoning traces, and fine-tuning the Qwen2.5-7B model with LoRA. The training process included an initial diagnostic run to determine optimal early stopping points, followed by five independent training runs to ensure result stability.
Results
The fine-tuned student model achieved a mean accuracy of 69.43% ± 0.17% on the competition dataset, a 4.76 percentage-point improvement over the base model. It also generalized to 73.1% ± 0.18% on the MATH-500 benchmark. The analysis of response length showed a consistent decline in accuracy from 69.43% at R1 to 41.9% at R6.
Implications
The findings suggest that knowledge distillation can effectively enhance the performance of compact models in mathematical reasoning tasks. The study also emphasizes the need to consider response length when designing models for complex reasoning tasks, which could inform future research and applications in educational technology and automated reasoning systems.
Bridging the Gap Between Latent and Explicit Reasoning with Looped Transformers
NLP
Large Language Models
Efficient ML
- LOTUS is the first latent-CoT method to bridge the accuracy gap with explicit CoT at the 3B scale.
- The architecture employs looped Transformers to enhance computation depth without increasing parameters.
- LOTUS reduces thought-phase latency by 2.5 to 6.9 times compared to explicit CoT methods.
- The latent space of LOTUS is interpretable, recovering gold reasoning steps and revealing alternative valid steps.
Read more
Bridging the Gap Between Latent and Explicit Reasoning with Looped Transformers
Summary
This paper addresses the limitations of existing latent chain-of-thought (CoT) reasoning methods in large language models, particularly their underperformance compared to explicit CoT methods as model sizes increase beyond 1 billion parameters. The authors propose a novel architecture called LOTUS (Looped Transformers with parallel supervision on latents), which utilizes looped, or recurrent-depth, Transformers to enhance latent reasoning. LOTUS processes multiple latent blocks in parallel across several iterations, applying cross-entropy loss on each latent position's corresponding gold CoT-step token. This approach effectively bridges the accuracy gap between latent and explicit CoT methods at the 3 billion parameter scale while significantly reducing thought-phase latency. The results demonstrate that LOTUS not only matches explicit CoT accuracy on math reasoning tasks but also surpasses it in out-of-domain scenarios, showcasing the interpretability of its latent space. The paper concludes that LOTUS represents a significant advancement in latent reasoning methods, enabling efficient and interpretable multi-step reasoning in large language models.
Methodology
LOTUS employs a looped padded Transformer architecture that processes K latent blocks in parallel for R iterations. It uses cross-entropy loss to align each latent position with the corresponding gold CoT-step token, allowing for efficient multi-step reasoning without the sequential bottleneck of traditional methods.
Results
LOTUS achieves near CoT accuracy on the GSM8K test set at the 3B scale, significantly outperforming previous latent methods. It matches explicit CoT accuracy on natural-language tasks while reducing thought-phase latency by 6.9 times. The architecture demonstrates interpretability, recovering gold reasoning steps and alternative valid intermediate steps.
Implications
The advancements presented in LOTUS could lead to more efficient and interpretable reasoning in large language models, enhancing their applicability in complex reasoning tasks across various domains such as education, automated reasoning, and AI-assisted decision-making.
Quality-Aware Modulation for Diffusion Transformers
Generative Models
Computer Vision
- Introduction of the Quality Representation Module (QRM) for quality-aware modulation in diffusion transformers.
- QRM enhances the denoising process by incorporating latent image quality signals.
- No significant changes to the diffusion backbone or sampling schedule are required.
- Extensive evaluations show improved image quality and prompt fidelity over baseline models.
Read more
Quality-Aware Modulation for Diffusion Transformers
Summary
This paper addresses the limitations of existing text-to-image diffusion models, particularly diffusion transformers (DiT), which rely solely on timestep and prompt embeddings for modulating the denoising process. The authors introduce the Quality Representation Module (QRM), a lightweight transformer module that learns a quality-aware representation from existing model inputs. The QRM generates modulation updates that refine the denoising parameters, injecting quality-sensitive signals into the model without altering the underlying diffusion architecture or sampling schedule. The paper presents extensive experiments, including ablations on QRM training losses and architectures, demonstrating that the QRM significantly enhances image quality, prompt adherence, and visual fidelity compared to baseline DiT models. The proposed method is evaluated using Stable Diffusion 3.5 (SD3.5) as a representative model, showing consistent improvements in image generation quality.
Methodology
The QRM module receives baseline modulation inputs (timestep and pooled prompt embeddings) along with latent image features. It predicts modulation updates that refine the baseline parameters using a Reward Feedback Learning (ReFL) strategy, allowing the model to learn the impact of modulation on image quality and prompt alignment.
Results
The experiments reveal that the QRM consistently improves image quality metrics, including CLIP Score, ImageReward, and Human Preference Score, demonstrating its effectiveness in enhancing the denoising process and ensuring better alignment with prompts.
Implications
The findings suggest that integrating quality-aware modulation can significantly enhance the performance of diffusion models in generating high-fidelity images, potentially impacting various applications in text-to-image generation and other generative tasks.
Human-Machine Collaboration on Generative Meta-Learning: Model and Algorithm
Generative Models
Reinforcement Learning
Time Series
- Introduction of the GMHF framework that combines generative modeling, reinforcement learning, and human feedback.
- Theoretical bounds derived to demonstrate the potential of human feedback in improving generalization under distribution shifts.
- Empirical validation shows significant reduction in deployment loss with increased expert reliability.
- Framework extends beyond ODE-governed systems, applicable to non-dynamical probabilistic models.
Read more
Human-Machine Collaboration on Generative Meta-Learning: Model and Algorithm
Summary
This paper addresses the challenge of generalizing machine learning models to unseen environments, particularly when target domain data is scarce. The authors introduce the Generative Meta-Learning with Human Feedback (GMHF) framework, which integrates expert intuition into data synthesis to bridge the domain gap. The framework employs a Conditional Neural ODE (cNODE) as a generative model, coupled with a Reinforcement Learning (RL) agent that refines generated data based on human feedback. Theoretical analysis reveals that aligning generated data with human beliefs significantly reduces generalization error. Empirical validation on a nonlinear Duffing oscillator demonstrates that GMHF effectively minimizes deployment loss as expert reliability increases, confirming the theoretical predictions. Additionally, the framework's applicability extends beyond ODE-governed systems, showcasing its robustness in various scenarios. Overall, GMHF represents a novel approach to enhancing model performance in partially observed deployment situations through human-AI collaboration.
Methodology
The GMHF framework consists of a generative model (cNODE), a reinforcement learning agent, and a meta-learner. The generative model synthesizes training data, while the RL agent refines this data based on human feedback, iteratively improving dataset quality for the meta-learner to use in predictive modeling.
Results
The empirical results on a nonlinear Duffing oscillator indicate that GMHF significantly reduces the divergence between generated and target distributions as expert feedback reliability increases. The framework also demonstrates effectiveness in a non-dynamical probabilistic model, confirming its versatility.
Implications
The GMHF framework has potential applications in various fields requiring robust generalization under distribution shifts, such as healthcare and finance, where human expertise can be leveraged to enhance model performance in real-world scenarios.
Deep Reinforcement Learning for Spacecraft Attitude Control During Atmospheric Re-Entry
Reinforcement Learning
Robotics
- Application of deep reinforcement learning for spacecraft attitude control during re-entry.
- Comparison of RL performance against traditional PID controllers and hybrid control architectures.
- Use of dynamics randomization to improve out-of-distribution generalization.
- Hybrid controllers show superior performance in tracking and robustness compared to traditional methods.
Read more
Deep Reinforcement Learning for Spacecraft Attitude Control During Atmospheric Re-Entry
Summary
This paper explores the application of deep reinforcement learning (RL) for the attitude control of spacecraft during atmospheric re-entry, addressing the challenges posed by nonlinear dynamics and uncertainties. The authors compare the performance of state-of-the-art RL algorithms against a traditional proportional-integral-derivative (PID) controller with gain scheduling, establishing the PID controller as a strong baseline. The study reveals that while RL can achieve comparable performance under nominal conditions, it struggles with out-of-distribution generalization. To enhance generalization, the authors implement dynamics randomization during training, introducing variations in task conditions. The results indicate that hybrid controllers, which integrate RL with traditional control methods, outperform standard controllers in tracking the angle of attack and demonstrate greater robustness to variations in mass, inertia tensor, and actuator bandwidth. The paper emphasizes the potential of RL to improve spacecraft attitude control while addressing safety and verification challenges associated with deploying RL in critical applications.
Methodology
The authors employed continuous, off-policy reinforcement learning algorithms, specifically focusing on the MR.Q algorithm. They established a baseline using a PID controller and explored hybrid control architectures that combine RL with traditional methods. The training involved dynamics randomization to enhance robustness and generalization across varying task conditions.
Results
The study found that the best-performing RL-based controllers outperformed the traditional PID controller in terms of tracking accuracy and robustness under various operational conditions. Hybrid controllers demonstrated improved performance metrics, particularly in handling variations in spacecraft dynamics.
Implications
The findings suggest that deep reinforcement learning can significantly enhance the adaptability and robustness of spacecraft attitude control systems, paving the way for safer and more efficient re-entry operations. The integration of RL into hybrid control systems may also facilitate advancements in other safety-critical applications in aerospace and robotics.
FRAME: Learning the Adaptation Domain with a Mixture of Fractional-Fourier Experts
NLP
Large Language Models
Efficient ML
- FRAME allows for a learnable adaptation domain, improving flexibility in PEFT methods.
- The method utilizes a mixture of experts with fractional-Fourier orders, enhancing expressivity and reducing interference.
- FRAME outperforms existing MoE-LoRA and spectral baselines while maintaining a low active parameter count.
- The learned orders provide interpretable specialization across different tasks and layers.
Read more
FRAME: Learning the Adaptation Domain with a Mixture of Fractional-Fourier Experts
Summary
The paper introduces FRAME (Fractional-Fourier Mixture of Experts), a novel parameter-efficient fine-tuning (PEFT) method that allows for the adaptation domain to be learned rather than fixed. Traditional PEFT methods typically operate in either the spatial or Fourier domain, but FRAME proposes a mixture of experts where each expert has a learnable fractional-Fourier order. This approach enables the model to interpolate between spatial and spectral domains, optimizing the adaptation process for different tasks, layers, and tokens. The authors argue that no single basis is optimal for all scenarios, and by allowing the domain to be a learnable parameter, FRAME can achieve better performance. The method utilizes a chirp-FFT surrogate for efficient computation and maintains a low active parameter budget. Experimental results demonstrate that FRAME outperforms existing methods like MoE-LoRA and spectral adapters across various benchmarks while providing interpretable insights into how learned orders specialize by task and layer.
Methodology
FRAME employs a mixture-of-experts architecture where each expert has a learnable fractional-Fourier order. The routing of tokens to experts is based on their respective orders, allowing for adaptation in the most suitable domain. The method includes a separate optimizer for the fractional orders and uses a chirp-FFT surrogate for efficient computation.
Results
In experiments across commonsense reasoning, mathematical tasks, code generation, and knowledge benchmarks using LLAMA-3.1-8B and QWEN2.5-7B, FRAME demonstrated significant improvements over strong baseline methods like MoE-LoRA and spectral adapters. The results indicate that FRAME effectively reduces parameter costs while enhancing performance, with learned orders showing task-specific specialization.
Implications
FRAME's approach to learning the adaptation domain could lead to more efficient and effective fine-tuning strategies in various applications of large language models, potentially improving performance in multi-task learning scenarios. The insights gained from the learned orders may also contribute to better understanding the underlying mechanisms of model adaptation.
Measuring Dead Directions: Decomposing and Classifying Singular Structure off Canonical Alignment
Theory
Optimization
Interpretability
- Introduces a descent-free and alignment-free method for measuring singular structures in neural networks.
- Develops a detect-then-read pipeline that adapts to different neural network architectures.
- Successfully classifies dead directions into genuine singularities and gauge symmetries.
- Demonstrates the ability to recover architecture-predicted orders in various trained networks.
Read more
Measuring Dead Directions: Decomposing and Classifying Singular Structure off Canonical Alignment
Summary
This paper presents a novel methodology for measuring singular structures in trained neural networks without relying on descent or alignment. The proposed approach allows for the recovery of the order of dead directions at a single frozen checkpoint using the directional-Fisher rate. The method classifies these directions into genuine singularities and flat gauge symmetries, providing a clear distinction based on the directional-Fisher magnitude. A pluggable detector is introduced, which adapts to various architectures including transformers and convolutional layers. The methodology is validated across different network types, demonstrating its effectiveness in recovering architecture-predicted orders and providing insights into the underlying singular structures. The results indicate that the method can accurately decompose the per-direction learning coefficient and the dead-subspace dimension, while also mapping the findings to the Watanabe triple, which characterizes the learning dynamics of the network.
Methodology
The methodology involves a two-step process: detection and reading. A detector identifies dead directions using K-FAC factors, while a directional-Fisher rate scan at a frozen checkpoint reads the order of these directions. The approach is architecture-agnostic, allowing for application across different types of neural networks.
Results
The results show that the proposed method can accurately recover the order of dead directions and classify them effectively. It also reveals the per-direction learning coefficient and dead-subspace dimension, providing a comprehensive understanding of the network's singular structure.
Implications
This work has significant implications for understanding the learning dynamics of neural networks, potentially leading to improved model training and architecture design. It opens avenues for further research into the relationship between singular structures and network performance.
Personalizing Marketplace Policies with Competing Objectives and Constrained Experiments: Evidence from a Job Marketplace
Optimization
- Developed a framework for personalizing marketplace policies that balances competing objectives.
- Introduced an ensemble-based hybrid ranking model that reduces guardrail risks while optimizing target metrics.
- Addressed the challenges of cross-side externalities and marketplace interference in experimental design.
- Validated the methodology through empirical testing and production deployment.
Read more
Personalizing Marketplace Policies with Competing Objectives and Constrained Experiments: Evidence from a Job Marketplace
Summary
This paper addresses the challenge of personalizing policies in two-sided marketplaces, specifically in a job marketplace context where employer and job seeker interests often conflict. The authors propose an integrated framework for personalizing free-value thresholds, which govern complimentary services for job listings. The framework is designed to optimize for multiple objectives simultaneously, acknowledging the cross-side externalities that arise when improving outcomes for one user group may negatively impact the other. The authors identify two main challenges: the need for multi-objective optimization due to cross-side externalities and the limitations imposed by marketplace interference on treatment variation. To tackle these challenges, they develop an ensemble-based hybrid ranking model that separately targets and guards against risks, achieving over 10% lower guardrail risk while maintaining target gains. Additionally, they introduce a treatment effect extrapolation method to extend experimental estimates to untested policy levels. The framework was successfully deployed in a production environment, demonstrating its effectiveness in achieving meaningful personalization under constrained experimental conditions.
Methodology
The authors employed an ensemble-based hybrid ranking model to separately optimize target and guardrail metrics, addressing cross-side externalities and marketplace interference. They utilized cluster-level randomization for experimental design and developed a treatment effect extrapolation method to extend findings from limited experimental variations.
Results
The integrated framework resulted in statistically significant and economically sizable improvements in target metrics while reducing guardrail risks by over 10% compared to traditional single-objective approaches. The extrapolation method was validated, confirming its accuracy and compliance with engagement constraints post-launch.
Implications
The findings suggest that personalized policies can be effectively implemented in two-sided marketplaces, enhancing user experiences for both employers and job seekers. This approach can be applied to various marketplace settings where conflicting objectives exist, potentially leading to more sustainable business models.
Surrogate Fidelity: When Can Open LLMs Explain Closed Ones?
NLP
Large Language Models
Interpretability
- Introduces surrogate fidelity as a framework for evaluating mechanistic interpretability across open and closed models.
- Establishes a hierarchy of surrogate fidelity metrics: prediction, attribution, representation, and cross-level.
- Finds that prediction fidelity often overstates attribution fidelity, indicating a disconnect between model outputs and causal reasoning.
- Identifies an access-validity inversion where stable white-box signals do not predict causal attributions effectively.
Read more
Surrogate Fidelity: When Can Open LLMs Explain Closed Ones?
Summary
This paper addresses the challenge of mechanistic interpretability (MI) in large language models (LLMs), particularly when comparing open models to closed ones. The authors introduce the concept of surrogate fidelity, which assesses how well insights gained from open models can predict behaviors in closed models. They evaluate surrogate fidelity across three levels: prediction, attribution, and representation, using binary classification tasks. The study reveals a significant discrepancy between prediction fidelity and attribution fidelity, indicating that models may agree on outputs while differing in causal reasoning. The authors propose a hierarchy of surrogate fidelity metrics and demonstrate that high agreement in observable signals does not guarantee accurate causal attributions. This work highlights the need for a critical examination of the assumptions underlying mechanistic insights across different model types.
Methodology
The authors propose a systematic approach to evaluate surrogate fidelity by measuring prediction log-odds, attribution through leave-one-out methods, and representation responses to perturbations. They apply these metrics across eleven models from four families (Llama, Qwen, GPT, and Gemini) on multiple binary classification benchmarks, comparing open-weight and closed-API models.
Results
The study finds that while models may share similar predictions and representations, they can diverge significantly in their causal attributions. The authors document an access-validity inversion, where high agreement in observable signals (like attention patterns) does not correlate with accurate causal attributions derived from black-box input ablations.
Implications
The findings suggest that insights gained from open models may not reliably extend to closed models, which has implications for the development of interpretability tools and practices in the field of AI. This work encourages a reevaluation of how mechanistic insights are communicated and applied across different model types.
Gradient Smoothing: Coupling Layer-wise Updates for Improved Optimization
Optimization
- Introduction of Depth-wise Gradient Augmentation as a new optimization paradigm.
- Development of Gradient Smoothing, specifically the Window Smoothing operator, to enhance layer-wise updates.
- Demonstrated improvements in optimization and generalization across various architectures and tasks.
- Empirical and theoretical evidence supporting structured representation evolution across depth.
Read more
Gradient Smoothing: Coupling Layer-wise Updates for Improved Optimization
Summary
This paper introduces a novel optimization paradigm called Depth-wise Gradient Augmentation, which leverages the structured relationships across layers in deep neural networks, particularly those with repeated architectural blocks like transformers. The authors propose Gradient Smoothing, a family of depth-wise smoothing methods, with a specific focus on a simple local Window Smoothing operator. This method augments the updates applied to each layer by averaging the updates from neighboring layers, thereby promoting coordinated representation evolution across depth. The approach is compatible with various base optimizers and incurs minimal computational overhead. The authors evaluate Gradient Smoothing across diverse architectures and training regimes, including language model pretraining, reinforcement learning fine-tuning, diffusion modeling, and image classification with Vision Transformers. The results demonstrate consistent improvements in optimization and generalization performance without altering model architectures or training objectives. Additionally, empirical and theoretical analyses reveal that Gradient Smoothing enhances layer-wise trajectory alignment and representation residual similarity, indicating a more structured evolution of representations throughout the network's depth.
Methodology
The authors propose a depth-wise augmentation operator that transforms block-wise optimizer updates before application. The Window Smoothing operator is a specific instantiation that averages updates from neighboring layers, promoting coordinated learning dynamics across layers. The method is evaluated empirically across different architectures and training scenarios, and theoretical analyses are conducted to understand its impact on representation structures.
Results
Gradient Smoothing consistently outperforms standard optimization methods across various tasks, including language model pretraining, reinforcement learning fine-tuning, and image classification. The method leads to improved optimization and generalization performance, as well as more structured representation evolution, evidenced by increased alignment in layer-wise trajectories and representation similarities.
Implications
The findings suggest that leveraging structural relationships across layers can significantly enhance the training of deep neural networks. This approach could be applied to various architectures and tasks, potentially leading to more efficient training processes and better-performing models in practical applications.
Safe Online Learning via Smooth Safety-Structured Policy Composition
Reinforcement Learning
Robotics
Theory
- AutoSafe integrates safety monitoring and intervention directly into the action generation process.
- The architecture allows for smooth transitions between performance and safety behaviors.
- Empirical results show strong safety enforcement and stable learning dynamics.
- AutoSafe outperforms existing safety filter-based approaches in both safety assurance and task performance.
Read more
Safe Online Learning via Smooth Safety-Structured Policy Composition
Summary
The paper addresses the challenge of safe online reinforcement learning (RL), where agents must adhere to safety constraints while optimizing their performance. Traditional methods either enforce strict safety through abrupt action interventions, which disrupt learning, or use soft constraints that allow temporary violations, compromising safety assurance. The authors propose AutoSafe, a novel safety-aware policy architecture that integrates structured safety monitoring and intervention into the action generation process. This integration allows for smooth, risk-dependent transitions between performance-driven and safety-preserving behaviors, facilitating continuous online interaction and learning. The empirical evaluation across various continuous-control benchmarks demonstrates that AutoSafe achieves strong safety enforcement without sacrificing learning smoothness. Additionally, validation on a physical cart-pole system showcases its practical effectiveness in real-world scenarios, highlighting its potential for safe online learning applications.
Methodology
The authors developed AutoSafe by embedding risk monitoring and safe intervention as structural inductive biases within a differentiable policy composition framework. This approach reshapes the policy parameterization to allow for smooth learning dynamics while ensuring safety. The architecture includes a safe policy prior that guides interventions, providing reliable fallback behaviors and enabling precautionary actions before reaching safety boundaries.
Results
AutoSafe consistently exhibited stable learning dynamics across multiple simulated benchmarks, achieving strong safety assurance comparable to safety filter-based methods. It matched or outperformed state-of-the-art safe learning methods in task performance. The practical effectiveness of AutoSafe was further validated through a real-world cart-pole training experiment.
Implications
The proposed AutoSafe architecture has significant implications for the deployment of reinforcement learning in safety-critical applications, such as robotics and autonomous systems, where maintaining safety while optimizing performance is crucial. Its ability to ensure smooth learning dynamics while enforcing safety constraints could enhance the reliability and robustness of RL agents in real-world environments.
ComplianceGate: Classifier-Gated Multi-Tier LLM Routing for Inference in Regulated Industries
NLP
Large Language Models
Efficient ML
- Introduces a classifier-gated routing architecture for LLMs in regulated industries.
- Ensures compliance by routing PII-containing queries to local endpoints before inference.
- Achieves 39% median latency reduction and 33-52% cost savings based on query complexity.
- The encoder classifier demonstrates 99.2% accuracy with minimal inference overhead.
Read more
ComplianceGate: Classifier-Gated Multi-Tier LLM Routing for Inference in Regulated Industries
Summary
The paper presents ComplianceGate, a novel classifier-gated routing architecture designed for deploying large language models (LLMs) in regulated industries. It addresses the dual challenges of compliance enforcement and cost efficiency by ensuring that personally identifiable information (PII) does not leave jurisdictional boundaries during processing. The proposed system employs a trained encoder classifier that evaluates incoming queries for complexity and data sensitivity before routing them to appropriate model tiers. This pre-inference classification allows simple queries to be processed by smaller, faster models, while sensitive queries are directed to local endpoints, thus preventing data residency violations. The architecture is evaluated on 600 queries, demonstrating significant improvements in latency, cost savings, and generation throughput compared to traditional single-model deployments. The encoder classifier achieves high accuracy and low inference overhead, establishing a practical approach to compliance-by-design in LLM deployment.
Methodology
The architecture consists of an encoder classifier that processes user queries to classify them based on complexity and sensitivity. The classifier is pre-trained on a large dataset and fine-tuned on domain-specific examples. It outputs a probability distribution that informs the routing layer, which selects the appropriate model and geographic location for processing. The entire routing decision is made before any LLM computation begins, ensuring compliance and efficiency.
Results
The evaluation on 600 queries showed a 39% reduction in median latency, cost savings ranging from 33% to 52% depending on the query distribution, and improved generation throughput of 122-200 tokens/second compared to 50-64 tokens/second for baseline models. The encoder classifier achieved 99.2% accuracy with a 7ms inference overhead.
Implications
ComplianceGate has significant implications for deploying LLMs in sectors like finance and healthcare, where regulatory compliance is critical. The architecture enables organizations to leverage advanced AI capabilities while adhering to data residency laws, potentially transforming how sensitive data is processed and managed in these industries.
Estimating Supply Incrementality in Two-sided Marketplaces: A Causal Machine Learning Approach
Theory
- Introduces a causal machine learning approach to estimate supply incrementality in two-sided marketplaces.
- Combines double/debiased machine learning with a hierarchical Bayesian framework to isolate the impact of supply on bookings.
- Utilizes geospatial measures of product segment similarity to improve model accuracy and reduce variance in treatment effect estimates.
- Demonstrates strong out-of-sample performance and provides plausible estimates of marketplace returns to additional supply.
Read more
Estimating Supply Incrementality in Two-sided Marketplaces: A Causal Machine Learning Approach
Summary
This paper addresses the challenge of estimating supply incrementality in two-sided marketplaces, particularly focusing on the Airbnb platform. The authors propose a causal machine learning methodology that combines double/debiased machine learning with a hierarchical Bayesian framework to analyze the impact of additional supply on marketplace outcomes such as total bookings. The study emphasizes the importance of understanding how additional supply affects different product segments, as traditional observational data often confounds causal relationships due to endogeneity and substitution effects. By leveraging geospatial measures of product segment similarity, the authors construct informative features that enhance model performance. The results indicate that their approach yields plausible estimates of marketplace returns to additional supply and demonstrates strong out-of-sample performance, suggesting that the methodology can be applied to other two-sided marketplaces facing similar challenges.
Methodology
The authors employ a two-stage double machine learning framework to mitigate endogeneity issues. In the first stage, they model historical variations in supply and bookings using rich signals. In the second stage, they use residuals from the first stage to isolate the causal impact of supply on bookings. The model incorporates geospatial similarity measures to enhance predictive power and employs a hierarchical Bayesian framework to update beliefs based on new data.
Results
The proposed model provides credible estimates of the incremental value of additional supply across different listing segments. The findings suggest that certain segments exhibit higher supply incrementality, indicating that additional listings can significantly increase total bookings rather than cannibalizing existing ones.
Implications
This research has significant implications for marketplace operators seeking to optimize supply strategies. By understanding the causal impact of supply changes, operators can make informed decisions about resource allocation, pricing strategies, and marketing initiatives to enhance overall marketplace performance.
Generative Refinement for Low-Budget Black-Box Optimization
Optimization
Generative Models
Theory
- SPARROW decouples generative modeling from optimization, enhancing efficiency in low-budget settings.
- The algorithm utilizes a fixed sampler as a proposal operator, requiring only knowledge of its corruption process.
- Rank-based guidance over an archive of evaluated candidates improves robustness against unreliable feedback.
- Asymptotic convergence guarantees are provided, demonstrating theoretical soundness.
Read more
Generative Refinement for Low-Budget Black-Box Optimization
Summary
This paper addresses the challenges of black-box optimization (BBO) in scenarios where function evaluations are costly, noisy, or unreliable. Traditional methods often require numerous evaluations to effectively optimize, which is impractical under tight budgets. The authors introduce SPARROW, a novel algorithm that decouples the generative prior from the reward signal, allowing it to utilize any sampler with a known corruption process. SPARROW employs rank-based guidance over an archive of evaluated candidates to navigate complex search spaces and handle unreliable feedback. The paper provides asymptotic convergence guarantees and demonstrates SPARROW's effectiveness in optimizing under low evaluation budgets, particularly in geometrically complex landscapes with unreliable rewards.
Methodology
SPARROW employs a fixed, unconditional sampler as a proposal operator and utilizes rank-based guidance over an archive of evaluated candidates. This approach allows it to effectively navigate complex geometries and optimize under low evaluation budgets without relying on extensive feedback or labeled data.
Results
The empirical evaluations demonstrate that SPARROW outperforms existing methods in scenarios characterized by unreliable rewards and complex search landscapes. The algorithm shows strong convergence properties and effectively identifies high-performing solutions within limited evaluation budgets.
Implications
SPARROW has significant implications for fields requiring optimization under constraints, such as materials design, drug discovery, and engineering simulations. Its ability to operate efficiently with limited evaluations can lead to advancements in these areas by enabling more effective exploration of complex solution spaces.
Watermarking for Proprietary Dataset Protection
NLP
Large Language Models
Generative Models
- Watermarking is proposed as a solution to improve membership inference for generative models.
- The study introduces a new randomization-based watermark detection test.
- Watermarking can achieve comparable performance to traditional loss-based methods under specific conditions.
- The authors provide a unified experimental framework for evaluating different membership inference techniques.
Read more
Watermarking for Proprietary Dataset Protection
Summary
This paper addresses the challenges of membership inference attacks (MIAs) and dataset inference attacks (DIAs) in the context of proprietary datasets used for training generative models. The authors propose a watermarking technique as a proactive measure to enhance the tractability of membership tests for generative models. They argue that traditional loss-based methods struggle in modern language modeling due to the complexity of variable-length outputs and the large parameter sizes of contemporary models. By implementing a watermarking approach, the authors demonstrate that it is possible to mark training datasets in a way that minimally impacts model performance while allowing for effective membership inference. The study builds on prior work in watermarking and membership inference, introducing a randomization-based watermark detection test and a unified experimental framework for evaluating both watermark-based and traditional methods. The results indicate that watermarking can achieve comparable performance to traditional methods under certain conditions, particularly when subset exposure is sufficiently high.
Methodology
The authors utilize a watermarking technique that involves marking training datasets with a secret key, allowing for the detection of whether specific data was included in a model's training. They employ a paraphraser language model to create watermarked samples and test these against a watermark detection test. The methodology includes varying parameters such as per-key support fractions and effective epochs, and benchmarking against traditional loss-based methods.
Results
The results show that watermarking can effectively indicate membership in training datasets, achieving performance levels comparable to traditional methods when the conditions of subset exposure are favorable. The randomization-based watermark detection test demonstrated p-value validity, enhancing the reliability of membership inference.
Implications
The findings suggest that watermarking could serve as a viable method for dataset owners to protect proprietary data used in training generative models, potentially influencing legal frameworks surrounding data use and intellectual property rights in AI.
Why Do Few-Step Text Latents Fail When Image Latents Work? Non-Commitment at Sharp Categorical Readouts
NLP
Generative Models
Theory
- Deterministic few-step generation fails for text latents due to sharp categorical readouts.
- The failure is governed by decoder sharpness rather than transport accuracy.
- DABI and CCI diagnostics reveal significant differences in performance between text and image decoders.
- Two escape mechanisms (categorical commitment and stochastic re-injection) allow some models to succeed despite sharp readouts.
Read more
Why Do Few-Step Text Latents Fail When Image Latents Work? Non-Commitment at Sharp Categorical Readouts
Summary
This paper investigates the failure of deterministic few-step generation in continuous text latents compared to its success in continuous image latents. The author argues that this failure is due to geometric factors rather than deficiencies in training or scaling. Specifically, the sharpness of the decoder's readout plays a crucial role; a smooth deterministic map cannot effectively resolve discrete choices before a sharp categorical readout occurs. The paper formalizes this issue through the introduction of two diagnostics: DABI (Decoder Amplification of Boundary-aligned Inputs) and CCI (Categorical Commitment Index). The findings reveal that text decoders amplify perturbations near decision boundaries significantly more than image decoders, leading to incoherent text generation. The paper also discusses two mechanisms that allow some models to escape the limitations of deterministic transport: categorical commitment and stochastic re-injection. The author presents a series of theorems that establish a relationship between accuracy, depth, and stiffness in the context of transport laws, highlighting a tradeoff that exists within deterministic continuous models. Overall, the paper provides a theoretical foundation for understanding the discrepancies in performance between text and image generation models.
Methodology
The author employs theoretical analysis and formal proofs to explore the geometric factors affecting the performance of deterministic few-step generation in text and image latents. Key diagnostics (DABI and CCI) are introduced to quantify the sharpness of readouts and the impact of perturbations on token generation. Theorems are presented to establish relationships between various parameters affecting model performance.
Results
The paper demonstrates that the flip rate of tokens in text generation is significantly influenced by the sharpness of the decoder's readout, with DABI values indicating a stark contrast between text and image decoders. The findings show that text decoders can amplify boundary-aligned perturbations, leading to high rates of incoherence, while image decoders maintain stability. Additionally, the paper confirms that stochastic and categorical mechanisms can mitigate the failures observed in deterministic models.
Implications
The insights from this research could inform the design of more robust text generation models, potentially leading to improvements in natural language processing applications. Understanding the geometric limitations of current models may guide future research towards developing hybrid approaches that leverage both deterministic and stochastic elements.
When Context Compensates for Sparse Event History: AlphaEarth for Spatio-Temporal Point-Process Forecasting
Time Series
- AlphaEarth embeddings significantly enhance predictive performance in spatio-temporal point-process models.
- The benefits of incorporating spatial context are most pronounced when local event histories are sparse.
- The study provides controlled evidence on the effectiveness of external spatial context in improving spatial transfer in forecasting.
- Predictive gains from AE embeddings taper off as the amount of historical data increases.
Read more
When Context Compensates for Sparse Event History: AlphaEarth for Spatio-Temporal Point-Process Forecasting
Summary
This paper investigates the role of external spatial context in enhancing the predictive performance of spatio-temporal point-process (STPP) models, particularly when local event histories are sparse. The authors propose the use of AlphaEarth (AE) embeddings, which provide standardized geospatial representations, as a means to augment a log-Gaussian Cox process (LGCP) model. The study focuses on forecasting emergency medical services (EMS) demand across various regions with limited historical data. By comparing an event-only LGCP model with an AE-augmented version, the authors demonstrate that incorporating spatial context significantly improves predictive accuracy, especially in scenarios with minimal event history. The results indicate that AE embeddings yield substantial improvements in predictive performance, particularly when training data is scarce, and that these benefits diminish as more historical data becomes available. This research highlights the importance of contextual information in stabilizing forecasts in spatio-temporal settings.
Methodology
The authors utilize a log-Gaussian Cox process (LGCP) framework to model spatio-temporal events, comparing two configurations: an event-only model and an AE-augmented model. The AE embeddings serve as linear spatial context, and the models are evaluated across eight held-out regions with varying lengths of historical data. The AE embeddings are treated as strictly exogenous, ensuring that only prior information is used for forecasting.
Results
The study finds that the AE-augmented model outperforms the event-only model in predictive performance across all tested history lengths. Specifically, with only 1-2 weeks of training data, the AE model achieves approximately 2-6 times better predictive density compared to the baseline. Even with longer history lengths (20-104 weeks), the AE model maintains a performance advantage of about 10-20%. Additionally, the AE embeddings contribute to smoother and more stable spatial structures in the forecasts.
Implications
The findings suggest that integrating external spatial context can significantly improve the robustness of spatio-temporal forecasts, particularly in domains where historical data is limited. This approach could be beneficial for various applications, including emergency response planning, urban development, and resource allocation in healthcare services.
QuasiMoTTo: Quasi-Monte Carlo Test-Time Scaling
NLP
Large Language Models
Reinforcement Learning
Efficient ML
- QuasiMoTTo replaces i.i.d. sampling with correlated samples to reduce redundancy in inference compute and RL.
- Utilizes quasi-Monte Carlo methods for generating more evenly distributed uniform samples.
- Achieves 25-47% fewer samples for equivalent pass@k accuracy compared to i.i.d. sampling.
- Matches i.i.d. performance in policy-gradient RL with 50% fewer training steps.
Read more
QuasiMoTTo: Quasi-Monte Carlo Test-Time Scaling
Summary
The paper introduces QuasiMoTTo, a novel approach to improve sample efficiency in inference compute and reinforcement learning (RL) by utilizing correlated samples instead of independent and identically distributed (i.i.d.) samples. Traditional parallel sampling methods generate multiple i.i.d. samples, which often leads to redundancy and inefficiency, as many samples explore the same high-probability regions. QuasiMoTTo addresses this issue by employing a reparameterization of autoregressive sampling through inverse-CDF sampling, utilizing quasi-Monte Carlo (QMC) methods to generate more evenly distributed uniform samples. This method allows for the generation of correlated samples that maintain the marginal distribution of the language model while covering the output space more effectively. The empirical analysis demonstrates that QuasiMoTTo can achieve the same pass@k accuracy as i.i.d. sampling with 25-47% fewer samples and matches i.i.d. performance in policy-gradient RL with 50% fewer training steps. The findings suggest that QuasiMoTTo significantly enhances sample efficiency, providing a stronger learning signal per batch and reducing computational costs.
Methodology
The methodology involves generating correlated samples using a reparameterization of autoregressive sampling as inverse-CDF sampling, with the underlying uniforms drawn using quasi-Monte Carlo (QMC) methods. This approach allows for parallel generation of samples that are correlated yet maintain the marginal distribution of the language model, thus improving coverage and efficiency.
Results
QuasiMoTTo matches the accuracy of i.i.d. sampling with 25-47% fewer samples across four reasoning benchmarks. In policy-gradient RL, it achieves similar performance with 50% fewer training steps, indicating a significant improvement in sample efficiency and learning signal strength.
Implications
The findings suggest that QuasiMoTTo can be applied to various language model tasks and reinforcement learning scenarios, potentially leading to more efficient model training and inference processes. This approach could reduce computational costs and improve the performance of models in real-world applications.
Active-GRPO: Adaptive Imitation and Self-Improving Reasoning for Molecular Optimization
NLP
Reinforcement Learning
Optimization
- Identifies the static-reference ceiling in reference-guided policy optimization, highlighting the risks of weak or misaligned references.
- Introduces active reasoning as a new paradigm for training, enhancing the adaptability of reference-guided methods.
- Implements Active-GRPO, which combines active imitation and self-improvement mechanisms to improve training robustness.
- Demonstrates significant performance improvements in molecular optimization tasks compared to existing methods.
Read more
Active-GRPO: Adaptive Imitation and Self-Improving Reasoning for Molecular Optimization
Summary
This paper addresses the challenges of improving the robustness and efficiency of training large language models (LLMs) for instruction-based molecular optimization. The authors identify limitations in existing methods, such as answer-only supervised fine-tuning (SFT) and reinforcement learning with verifiable rewards (RLVR), which struggle with multi-step reasoning and sparse feedback, respectively. They propose a novel approach called Active Group Relative Policy Optimization (Active-GRPO), which introduces the concept of active reasoning. This paradigm allows the model to dynamically decide when to imitate reference molecules and when to leverage its own discoveries, thereby continuously upgrading its imitation targets. Active-GRPO employs two mechanisms: active imitate-reinforce, which shifts from imitation to self-improvement as the policy generates superior candidates, and active referencing, which updates the reference with the best policy-generated candidate. The authors demonstrate that Active-GRPO significantly enhances performance across various molecular optimization benchmarks, overcoming the limitations of static references and enabling more effective training.
Methodology
The authors developed Active-GRPO, which integrates active reasoning through two main mechanisms: active imitate-reinforce, allowing the model to switch from imitation learning to self-improvement based on performance, and active referencing, which continuously updates the reference with the best-performing candidates generated by the policy. This approach ensures that the model adapts its learning strategy based on the quality of references and its own capabilities.
Results
Active-GRPO achieved an average SR×Sim score of 0.1773, outperforming GRPO (0.0959) and RePO (0.1665) under matched three-seed evaluation. The method also showed statistically significant improvements in key metrics such as LogP, MR, and QED across various molecular optimization benchmarks.
Implications
The findings suggest that Active-GRPO can enhance the efficiency and effectiveness of LLMs in scientific domains, particularly in drug discovery and materials science, where precise molecular optimization is crucial. The active reasoning paradigm could also be applied to other areas requiring adaptive learning and decision-making.
Visualizing High-Dimensional Graph Embeddings via Informed Multi-View Projections
Graph Learning
Optimization
- Optimal 2D projections from high-dimensional graph embeddings yield better readability metrics than traditional 2D layouts.
- The differentiable loss function SigmoidX significantly reduces edge crossings compared to existing methods.
- DataFly provides an interactive platform for exploring high-dimensional graph layouts, enhancing user engagement and understanding.
- The proposed method reveals structural patterns that are often hidden in static 2D visualizations.
Read more
Visualizing High-Dimensional Graph Embeddings via Informed Multi-View Projections
Summary
This paper addresses the challenge of visualizing high-dimensional graph embeddings by proposing a method to find informative 2D projections that optimize aesthetic and readability metrics. Traditional 2D layouts often distort the higher-dimensional structure of graphs, making it difficult to interpret complex relationships. The authors develop a computational pipeline that generates multiple candidate projections from high-dimensional embeddings and introduce a novel differentiable surrogate loss function, SigmoidX, to minimize edge crossings effectively. Their approach demonstrates that these optimized projections can outperform standard 2D layouts and existing algorithms designed for similar metrics. Additionally, the authors present DataFly, an interactive system that allows users to explore various viewpoints of high-dimensional graph layouts, facilitating the discovery of structural patterns that may be obscured in conventional visualizations. A usability study indicates that DataFly enhances the understanding of graph structures for both expert and non-expert users.
Methodology
The authors developed a computational pipeline to explore high-dimensional graph layouts, producing a family of candidate projections. They employed a gradient-based optimization scheme to find near-optimal projections based on metrics like stress, edge crossings, and angular resolution. The differentiable loss function SigmoidX was introduced to minimize edge crossings effectively. DataFly, an interactive visualization tool, was created to allow users to navigate through different projections seamlessly.
Results
The experiments showed that the optimized projections consistently outperformed standard 2D layouts and even surpassed state-of-the-art algorithms designed for optimizing readability metrics. The usability study of DataFly indicated that users could identify structural patterns more effectively compared to traditional visualization methods.
Implications
The findings suggest that high-dimensional graph embeddings can be effectively visualized in a way that enhances user comprehension of complex relational data. This approach can be applied in various fields such as social network analysis, bioinformatics, and any domain where understanding complex relationships is crucial.
OTCache: Optimal Transport for Geometry-Aware Caching in Diffusion Models
Generative Models
Optimization
Efficient ML
- OTCache provides a training-free approach to accelerate diffusion models through optimal transport-inspired schedule modeling.
- The framework overcomes limitations of existing caching methods by addressing additive independence assumptions and modeling schedule evolution.
- Experiments show substantial acceleration in sampling times while maintaining high fidelity in generated outputs.
Read more
OTCache: Optimal Transport for Geometry-Aware Caching in Diffusion Models
Summary
The paper introduces OTCache, a novel framework designed to enhance the efficiency of diffusion sampling through optimal transport-inspired caching schedule predictions. Traditional graph-based caching methods often rely on an additive independence assumption, which fails to hold in low Noise Function Evaluation (NFE) regimes, leading to inefficiencies. OTCache addresses this by modeling caching schedules as a smooth evolution in policy space, utilizing a three-stage process: first, it establishes a high-fidelity reference schedule using a graph-based method under conservative budgets; second, it conducts a lightweight anchor search under extreme low-budget conditions via Optuna optimization; and third, it predicts schedules for target budgets through quantile interpolation between the reference and anchor policies. The experimental results demonstrate that OTCache significantly accelerates the sampling process, achieving 4.5×, 4.7×, and 3.66× speedups on FLUX.1, Qwen-Image, and HunyuanVideo datasets, respectively, while also improving generation fidelity compared to existing caching methods.
Methodology
OTCache employs a three-stage framework: (1) it generates a high-fidelity reference schedule using a graph-based caching method; (2) it performs a lightweight anchor search for low-budget settings using Optuna optimization; and (3) it predicts target budgets through quantile interpolation between the reference and anchor schedules using continuous warping representations.
Results
OTCache achieved accelerations of 4.5× on FLUX.1, 4.7× on Qwen-Image, and 3.66× on HunyuanVideo, while also improving generation fidelity, evidenced by an LPIPS score of 0.171 on Qwen-Image.
Implications
The findings suggest that OTCache can significantly reduce computational overhead in diffusion models, making them more applicable in resource-constrained environments and enhancing their deployment in real-time applications.
LeNEPA: No-Augmentation Next-Latent Prediction for Time-Series Representation Learning
Time Series
- LeNEPA is a no-augmentation architecture for time-series representation learning.
- It utilizes a next-latent-token prediction objective with a causal transformer backbone.
- The method shows improved performance stability across different datasets without requiring specific augmentation strategies.
- LeNEPA achieves faster representation acquisition compared to traditional methods.
Read more
LeNEPA: No-Augmentation Next-Latent Prediction for Time-Series Representation Learning
Summary
The paper introduces LeNEPA, a novel self-supervised learning (SSL) architecture designed for time-series representation learning without relying on data augmentation. The authors address the challenges of existing SSL methods that depend heavily on augmentation strategies, which can hinder the portability of learned representations across different datasets. LeNEPA employs a next-latent-token prediction objective using a causal transformer backbone, replacing traditional stabilization methods with SIGReg-based isotropy regularization. The authors conduct experiments comparing LeNEPA with an ECG-tuned JEPA recipe across two datasets: PTB-XL and a synthetic dataset called Diag. The results indicate that LeNEPA maintains performance across varied datasets without the need for augmentation, demonstrating faster early representation acquisition and competitive accuracy compared to existing methods. This work highlights the potential of no-augmentation approaches in enhancing the efficiency and adaptability of time-series SSL.
Methodology
LeNEPA employs a Latent Euclidean Next-Embedding Prediction Architecture that predicts the next latent token in a sequence of time-series data. It uses a causal transformer backbone and incorporates SIGReg for isotropy regularization, avoiding the use of stop-gradient techniques. The model is evaluated using a fixed-recipe stress test across different datasets, maintaining the same configuration for both LeNEPA and a baseline JEPA method.
Results
LeNEPA demonstrated superior performance in maintaining frozen-probe gains across both PTB-XL and Diag datasets, outperforming the ECG-tuned JEPA method when reused unchanged. Learning curves indicated that LeNEPA reached 80% of its final AUROC/AUPRC gain after 2-5k updates, compared to 5-10k updates for JEPA. Additionally, a CauKer-pretrained variant of LeNEPA achieved 77.65% mean UCR-128 Random-Forest accuracy, closely matching leading benchmarks.
Implications
The findings suggest that no-augmentation methods like LeNEPA can significantly reduce the engineering effort required for time-series SSL, making it easier to adapt models to new datasets. This could lead to more efficient and portable pretraining recipes in various applications, including industrial monitoring, finance, and healthcare.
Revocable Learned State via Process Sidecars
NLP
Large Language Models
Theory
- Introduces process sidecars for effective memory revocation in language models.
- Proves that naive task arithmetic is first-order incomplete when safety training alters memory directions.
- Demonstrates that process sidecars achieve second-order accuracy in memory edits.
- Empirical results show significant improvements in refusal closure across multiple trials.
Read more
Revocable Learned State via Process Sidecars
Summary
This paper addresses the challenge of revoking learned states in language models after they have undergone multiple training phases, including public skill adaptation, private memory incorporation, and safety training. The author introduces a novel approach called 'process sidecars,' which utilizes a two-coefficient edit family to effectively manage the complexities of memory revocation. The methodology involves estimating the transported direction of memory updates during the safety training phase, which is crucial for accurately reverting to a counterfactual safety-only oracle. The paper presents theoretical proofs demonstrating that the proposed sidecar method achieves second-order accuracy in correcting first-order errors that arise when safety training alters memory directions. Empirical evaluations across various model configurations show that the process sidecars significantly improve refusal closure compared to naive task arithmetic methods, consistently outperforming them in multiple trials. The findings suggest that traditional unlearning methods may not be suitable in this context, as they can lead to increased safety loss. Overall, this work contributes to the understanding of memory management in language models and offers a robust solution for ensuring safety and privacy.
Methodology
The methodology involves the introduction of process sidecars, which are a two-coefficient edit family designed to estimate the transported direction of memory updates during safety training. The author employs theoretical proofs to establish the accuracy of the sidecar method and conducts extensive empirical evaluations across different model configurations to assess performance improvements in refusal closure.
Results
The results indicate that the process sidecars improve held-out refusal closure in 60 out of 60 trials across various models, outperforming naive task arithmetic and the γ = λ process-JVP subfamily. A confirmatory replication on unseen seeds further validates these findings, achieving 70 out of 70 successful trials. The method demonstrates a significant reduction in safety loss compared to traditional unlearning approaches.
Implications
The implications of this work are significant for the development of safer and more reliable language models. The proposed method can enhance privacy by effectively managing learned states, making it applicable in scenarios where sensitive information must be revoked. This research could influence future designs of language models and their training protocols, particularly in contexts requiring stringent safety measures.
Scaling Up Thermodynamic AI Models
Efficient ML
Theory
Optimization
- Development of a scalable backpropagation-based algorithm for training deep convolutional networks on Ising machines.
- Achieved high classification accuracies of 94.9% on CIFAR-10 and 76.0% on CIFAR-100 using thermodynamic inference.
- Established a mathematical theory linking inference cost to accuracy and controlling autocorrelation times.
- Demonstrated that over 99.99% of FLOPs can be off-loaded to thermodynamic inference in larger models.
Read more
Scaling Up Thermodynamic AI Models
Summary
This paper addresses the challenge of training large thermodynamic AI models, specifically those based on the Ising model, for low-power AI inference and edge computing. The authors propose a scalable backpropagation-based algorithm for training deep convolutional networks that can be implemented on Ising machine hardware. They establish a theoretical framework that connects the time-averaged behavior of high-temperature Gibbs-sampled Ising systems to feed-forward neural inference, allowing for the training of deep models. The authors validate their approach through experiments on image classification tasks, achieving high accuracies on benchmarks such as CIFAR-10 and CIFAR-100. They also develop a mathematical theory that relates inference cost to accuracy, providing insights into controlling autocorrelation times and deriving optimal inference schedules. The findings suggest that thermodynamic computing can significantly reduce energy consumption in AI inference without compromising performance, paving the way for future advancements in hardware development for thermodynamic AI models.
Methodology
The authors developed a series of algorithms to train deep neural networks as thermodynamic blocks, ensuring that each block adheres to the correspondence theorem. They utilized pure backpropagation with regularization terms to scale with problem size and established convergence theory to control inference costs and mixing times.
Results
The proposed methods achieved accuracies of 98.1% on MNIST, 93.5% on FashionMNIST, 94.9% on CIFAR-10, and 76% on CIFAR-100. The authors demonstrated a tradeoff between inference-time cost and performance, with the ability to derive optimal inference schedules based on their theoretical framework.
Implications
The findings suggest that thermodynamic AI models can significantly enhance energy efficiency in AI inference tasks, making them suitable for edge computing applications. The work also lays the groundwork for future hardware development in thermodynamic computing, potentially leading to more sustainable AI technologies.
Probabilistic Inversion with Flow Matching
Generative Models
Optimization
Theory
- Flow Matching is adapted for probabilistic inversion in geophysics, enhancing the analysis of seismic data.
- Probabilistic inversion allows for uncertainty quantification without the need for initial guesses or regularization.
- The method is evaluated through case studies, demonstrating its applicability to both simple and complex models.
- Flow Matching bridges the gap between traditional probabilistic methods and modern deep learning techniques.
Read more
Probabilistic Inversion with Flow Matching
Summary
This paper explores the application of Flow Matching, a technique from generative AI, to the field of probabilistic inversion in geophysics, particularly in seismic Full-Waveform inversion. The authors adapt the mathematical framework of Flow Matching to probabilistic inversion, which traditionally faces challenges due to the ill-posed nature of inverse problems. They present two case studies: a simple 2D velocity model to demonstrate the method's general features and the OpenFWI dataset to showcase its effectiveness in handling complex seismic velocity models. The paper highlights the advantages of probabilistic inversion over deterministic methods, such as the absence of a need for initial guesses or regularization, and the ability to express uncertainty in model parameters. The authors also discuss how Flow Matching can efficiently transform probability distributions, making it a promising approach for real-world applications in geophysical data analysis.
Methodology
The authors employ Flow Matching, which iteratively transforms a known initial distribution into a target distribution using Continuous Normalizing Flows (CNF). They adapt this framework for probabilistic inversion, allowing for the efficient sampling of possible solutions from the distribution of seismic velocity models. The method is evaluated through case studies involving both a simple 2D velocity model and a more complex dataset.
Results
The application of Flow Matching to the simple 2D velocity model successfully illustrated the method's ability to recover model parameters from observed data. The case study on the OpenFWI dataset demonstrated the method's robustness and effectiveness in probabilistic inversion, showcasing its potential to handle complex seismic velocity models.
Implications
The findings suggest that Flow Matching can significantly improve the efficiency and accuracy of probabilistic inversion in geophysical applications. This could lead to better modeling of subsurface structures and enhanced decision-making in resource exploration and environmental monitoring.
LLM-Guided ODE Discovery and Parameter Inference from Small-Cohort Aggregate Data
Large Language Models
Time Series
Interpretability
- AgentODE is an end-to-end framework for ODE discovery and parameter inference using population-level summary statistics.
- The framework utilizes a large language model to propose ODE structures and refine parameter distributions iteratively.
- AgentODE demonstrates superior performance in structure discovery compared to traditional methods that rely on individual-level data.
- The approach is particularly valuable for modeling rare diseases where data scarcity and privacy constraints are significant challenges.
Read more
LLM-Guided ODE Discovery and Parameter Inference from Small-Cohort Aggregate Data
Summary
This paper introduces AgentODE, an innovative framework that leverages large language models (LLMs) to discover ordinary differential equation (ODE) structures and infer parameter distributions from population-level summary statistics, particularly in the context of rare diseases where individual-level data is scarce. Traditional methods for ODE modeling require extensive expert knowledge and access to dense data, which is often unavailable in clinical settings dealing with rare diseases. AgentODE addresses this gap by employing an LLM to propose candidate ODE structures and a tool-augmented inference agent that iteratively refines parameter distributions through a diagnosis-update loop. The framework operates solely on summary statistics, thus preserving privacy and accommodating data heterogeneity. The authors evaluate AgentODE on synthetic benchmark problems and two clinical datasets, including recessive dystrophic epidermolysis bullosa (RDEB), demonstrating its ability to recover functionally consistent ODE structures. The results indicate that reasoning from summary statistics enhances mechanistic structure discovery, outperforming traditional methods that rely on individual-level data, which may yield implausible models despite better predictive performance. Overall, AgentODE represents a significant advancement in mechanistic modeling for rare diseases, enabling researchers to derive insights from limited data while adhering to privacy constraints.
Methodology
AgentODE employs a two-loop feedback mechanism where an LLM proposes candidate ODE structures and a parameter inference agent refines parameter distributions through iterative diagnosis and updates based on discrepancies between synthetic and empirical summary statistics. The process begins with structure proposals, followed by simulations to generate synthetic data for comparison with real data, allowing for continuous improvement of parameter distributions.
Results
AgentODE was evaluated on three synthetic benchmark problems and two clinical datasets, including RDEB, achieving performance comparable to methods using full trajectory access while relying solely on population-level summary statistics. The framework successfully recovered functionally consistent ODE structures and demonstrated that reasoning from summary statistics leads to more accurate mechanistic insights compared to methods using individual-level data.
Implications
AgentODE opens new avenues for mechanistic modeling in clinical research, particularly for rare diseases, by enabling the analysis of data that is otherwise difficult to utilize due to privacy concerns and scarcity. This framework could facilitate better understanding of disease mechanisms and improve treatment strategies based on limited data.
TDGT: A Tabular Data Generation Toolkit supporting adaptive GPU-accelerated Bayesian mixture models, diffusion-based models, and latent-space generative modeling
Generative Models
- TDGT provides an integrated web-based toolkit for synthetic tabular data generation.
- The Adaptive Bayesian Mixture Synthesizer (ABMS) autonomously optimizes mixture components, reducing manual configuration.
- VAE-ABMS combines latent space learning with adaptive synthesis for high-fidelity data generation.
- The toolkit includes GPU acceleration for efficient processing in large-scale scenarios.
Read more
TDGT: A Tabular Data Generation Toolkit supporting adaptive GPU-accelerated Bayesian mixture models, diffusion-based models, and latent-space generative modeling
Summary
The paper presents TDGT, a web-based toolkit designed for synthetic tabular data generation and fidelity assessment, addressing the challenges of existing tools that require programming expertise and manual hyperparameter tuning. TDGT introduces the Adaptive Bayesian Mixture Synthesizer (ABMS), which autonomously determines the optimal number of mixture components through iterative cluster quality optimization, thus eliminating the need for manual configuration. Additionally, the toolkit features VAE-ABMS, a hybrid architecture that combines Variational Autoencoder-based latent space learning with adaptive Bayesian mixture synthesis, enabling the generation of high-fidelity complex tabular distributions. For large-scale applications, TDGT offers a GPU-accelerated version of ABMS using CUDA for efficient clustering and Gaussian mixture fitting. The fidelity of the generated synthetic data is evaluated using eleven statistical metrics, including distributional divergence and structural correlation, alongside privacy risk indicators like k-anonymity scoring. The toolkit supports real-time streaming and interactive visualizations, making it accessible for domain researchers. Evaluations across healthcare, socioeconomic modeling, and cybersecurity datasets demonstrate TDGT's ability to consistently generate high-quality synthetic data while maintaining statistical coherence across various feature types and scales.
Methodology
TDGT employs the Adaptive Bayesian Mixture Synthesizer (ABMS) for optimal mixture component determination and integrates a Variational Autoencoder (VAE) with ABMS for latent space learning. The toolkit is GPU-accelerated using CUDA for enhanced performance in clustering and Gaussian mixture fitting.
Results
TDGT was evaluated on datasets from healthcare, socioeconomic modeling, and cybersecurity, showing consistent generation fidelity and statistical coherence across diverse feature types and data scales. The toolkit's performance was validated through eleven statistical fidelity metrics and privacy risk assessments.
Implications
TDGT has significant implications for privacy-preserving data sharing in regulated domains such as healthcare and finance, enabling collaborative research and model development without compromising individual privacy. It democratizes access to high-quality datasets, facilitating advancements in AI-driven systems.
TiRex-2: Generalizing TiRex to Multivariate Data and Streaming
Time Series
- TiRex-2 is the first time series foundation model to effectively integrate both past and future covariates while ensuring strict causality.
- The model operates at constant computational cost per time step, making it suitable for real-time streaming applications.
- A novel synthetic coupling pipeline allows for scalable multivariate pretraining from univariate data, enhancing model generalization.
- TiRex-2 achieves state-of-the-art performance on GIFT-Eval and fev-bench benchmarks.
Read more
TiRex-2: Generalizing TiRex to Multivariate Data and Streaming
Summary
The paper introduces TiRex-2, a recurrent xLSTM-based time series foundation model that extends the capabilities of the univariate TiRex to handle multivariate forecasting with both past and future covariates. Unlike existing Transformer-based models that struggle with computational efficiency and require full-history recomputation, TiRex-2 employs a memory-centric design that allows for constant per-patch cost during streaming. The architecture integrates a bidirectional time mixer and an asymmetric grouped-attention variate mixer, ensuring strict causality while incorporating future-known covariates. The authors also propose a synthetic coupling pipeline for scalable multivariate pretraining, generating diverse training samples from univariate datasets. Empirical evaluations demonstrate that TiRex-2 achieves state-of-the-art zero-shot performance on benchmark datasets while maintaining stability and efficiency in streaming contexts.
Methodology
TiRex-2 utilizes a recurrent architecture based on xLSTM, combining a bidirectional time mixer with an asymmetric grouped-attention variate mixer. This design allows for the integration of future covariates while preserving causality. The synthetic coupling pipeline generates multivariate training instances from univariate datasets, facilitating effective pretraining.
Results
TiRex-2 demonstrated state-of-the-art zero-shot performance on GIFT-Eval and fev-bench, remained stable during streaming to arbitrary context lengths, and maintained constant inference costs per patch. The model's architecture allows it to efficiently handle multivariate forecasting tasks.
Implications
The advancements presented in TiRex-2 have significant implications for real-time forecasting in various domains, such as finance, healthcare, and industrial monitoring, where accurate and timely predictions are critical. The model's ability to efficiently process multivariate data in a streaming context opens new avenues for applications in dynamic environments.
Physics-informed Conditional Normalizing Flows for Angles-only Cislunar Orbit Determination
Generative Models
- Introduction of a generative modeling approach for orbit determination in cislunar space.
- Utilization of normalizing flows for conditional density estimation based on angles-only measurements.
- Incorporation of a physics-informed loss term to enhance the accuracy of state estimates.
- Demonstration of improved performance over traditional orbit determination methods.
Read more
Physics-informed Conditional Normalizing Flows for Angles-only Cislunar Orbit Determination
Summary
This paper presents a novel approach to orbit determination in the cislunar environment using physics-informed conditional normalizing flows. The authors formulate the problem as conditional density estimation, aiming to infer the probability distribution of the initial state from angles-only measurements over short observation arcs. A normalizing flow model is trained on perturbed topocentric observations from Near Rectilinear Halo Orbits (NRHO), allowing for a flexible and potentially multimodal posterior representation. The learned density is sampled to generate statistically consistent and physics-informed state hypotheses, which are then refined through nonlinear least-squares minimization. This approach provides a competitive warm start for classical orbit determination algorithms, addressing the challenges posed by the absence of a dedicated Global Navigation Satellite System (GNSS) in cislunar space and the complexities of communication and dynamical behavior in this region. The paper discusses the architecture of the model, the implementation of a physics-informed loss term, and compares the performance of the proposed method with traditional orbit determination techniques.
Methodology
The authors employ a normalizing flow model trained on perturbed topocentric observations to estimate the probability distribution of the initial state from angles-only measurements. The model incorporates a physics-informed loss term to regularize the training process, ensuring that the generated state hypotheses are consistent with the underlying physical laws governing orbital dynamics. The estimated states are further refined using nonlinear least-squares minimization.
Results
The proposed method demonstrates a significant improvement in the accuracy of orbit determination compared to traditional techniques. The inclusion of the physics-informed loss term enhances the model's ability to generate reliable state estimates, even in the challenging cislunar environment characterized by limited tracking measurements and communication delays.
Implications
This work has potential implications for future space missions operating in the cislunar environment, particularly in enhancing the reliability and accuracy of orbit determination for newly launched or uncooperative spacecraft. The methodology could be applied to various scenarios in astrodynamics, contributing to advancements in space exploration and navigation.
Expected Gain-based Escalation in Vertical Federated Learning
Federated Learning
Efficient ML
Theory
- Introduces a two-round inference protocol for selective escalation in VFL.
- Develops an analytical routing rule based on expected gain without requiring a separate routing model.
- Empirically shows improved communication-accuracy trade-off over existing methods.
- Utilizes held-out calibration data for reliable score estimation.
Read more
Expected Gain-based Escalation in Vertical Federated Learning
Summary
This paper addresses the challenge of optimizing communication and computational costs in vertical federated learning (VFL) during collaborative inference. The authors propose a two-round inference protocol where the first round generates a low-cost prediction using local client posteriors, and a second round is invoked only when it is expected to improve the prediction accuracy. The decision to escalate to the second round is based on an expected-gain score estimation, which evaluates whether the predicted improvement in correctness justifies the additional communication costs. This score is derived from a calibrated pooled posterior and classwise reliability estimates obtained from held-out calibration data, allowing for an interpretable routing mechanism without the need for a separately trained routing network. The proposed method is empirically validated across various multi-view classification benchmarks, demonstrating a favorable trade-off between communication and accuracy compared to existing baselines.
Methodology
The authors propose a two-round inference protocol where the first round involves each client sending local posteriors to a server, which computes a pooled prediction. The second round is triggered based on an expected-gain score that assesses the potential improvement in prediction accuracy against the communication costs. This score combines a calibrated pooled posterior and classwise reliability statistics from the VFL model, allowing for a straightforward and interpretable routing decision.
Results
The proposed routing mechanism significantly outperforms baseline methods based on confidence, learned gain, and deferral strategies across multiple multi-view classification tasks. The experiments demonstrate that the expected gain-based escalation leads to better accuracy while minimizing unnecessary communication.
Implications
This work has implications for improving the efficiency of federated learning systems, particularly in scenarios where communication costs are a concern. The proposed method can be applied to various multi-agent systems requiring collaborative inference, enhancing their predictive performance while managing resource constraints.
TRIE: An Evaluation Framework for Stochastic PDE Surrogates
Generative Models
Theory
Efficient ML
- Introduction of TRIE as a novel evaluation framework for stochastic PDE surrogates.
- Demonstration that traditional deterministic models fail to capture long-term statistical structures.
- Generative models outperform other methods in capturing invariant measures and providing reliable uncertainty estimates.
- Latent generative models with automatic dimension discovery reduce inference time significantly.
Read more
TRIE: An Evaluation Framework for Stochastic PDE Surrogates
Summary
The paper introduces TRIE, an evaluation framework designed for stochastic partial differential equation (SPDE) surrogates, addressing the challenges posed by uncertainty in scientific systems. Traditional deterministic neural surrogates often fail to capture the statistical measures and uncertainties inherent in these systems. TRIE evaluates models based on three criteria: trustworthiness in predictive uncertainty, the ability to reproduce invariant measures, and efficiency in probabilistic generation. The authors demonstrate TRIE on two chaotic SPDEs: the stochastic Kuramoto–Sivashinsky equation and the stochastic Kolmogorov flow, across 11 parameter values. The evaluation reveals that standard pointwise-trained neural surrogates may yield plausible short-term predictions but struggle with long-term statistical accuracy. Approximate uncertainty methods, such as Monte Carlo dropout, often exhibit miscalibration and overconfidence. In contrast, generative models consistently perform well, accurately capturing invariant measure statistics and achieving the lowest continuous ranked probability score (CRPS). The study also highlights the benefits of latent generative models with automatic dimension discovery, which significantly reduce inference time while maintaining statistical fidelity. The authors provide code and data to facilitate reproducible evaluations of stochastic PDE forecasting models.
Methodology
The TRIE framework evaluates stochastic PDE surrogates based on three criteria: trustworthiness (using continuous ranked probability score and spatial uncertainty diagnostics), invariance (assessing long-time invariant measures), and efficiency (measuring inference time). The authors apply TRIE to two specific SPDEs and compare various surrogate models, including deterministic, approximate probabilistic, and generative models.
Results
The evaluation shows that pointwise-trained neural surrogates often fail to reproduce invariant measures, while approximate uncertainty methods can be miscalibrated. Generative models consistently achieve the best performance, accurately capturing invariant measure statistics and achieving the lowest CRPS across all tested scenarios. Latent generative models also demonstrate a significant reduction in inference time by approximately 12 times.
Implications
The findings suggest that TRIE can serve as a robust framework for evaluating and improving stochastic PDE surrogates, which has implications for various scientific fields where uncertainty plays a critical role, such as climate modeling, finance, and biological systems. The ability to efficiently generate reliable probabilistic forecasts can enhance decision-making processes in these domains.
Resolving superposition in AI for interpretability and cross-modal alignment in patient-neuronal images
Interpretability
Computer Vision
Multimodal
- Superposition in neural networks can corrupt the geometry of latent spaces, impacting interpretability.
- Sparse Autoencoders (SAEs) effectively disentangle superposed concepts, restoring geometric fidelity.
- The authors adapt scRNA-seq analysis methods to image data, enhancing biological hypothesis evaluation.
- GW-map framework aligns image representations with scRNA-seq data, reconstructing neuronal pathology pathways.
Read more
Resolving superposition in AI for interpretability and cross-modal alignment in patient-neuronal images
Summary
This paper addresses the challenge of superposition in neural networks, particularly in the context of biological image analysis. Superposition occurs when distinct biological concepts are compressed into lower-dimensional representations, leading to interpretability issues and corrupting the geometry of latent spaces. The authors propose using Sparse Autoencoders (SAEs) to disentangle these superposed concepts, thereby recovering geometric fidelity in the representations. They validate their approach using over 100,000 multiplexed images of patient-derived neurons, demonstrating that SAEs can effectively restore the intrinsic geometry of the data. Furthermore, the authors adapt single-cell RNA sequencing (scRNA-seq) analysis methodologies to the image domain, leveraging the purified representations. They introduce a novel framework called GW-map, which utilizes Gromov-Wasserstein optimal transport to align image representations with authentic scRNA-seq data, enabling the reconstruction of hierarchical neuronal pathology pathways without relying on spatial transcriptomics. This work establishes a scalable foundation for spatial biology and enhances the interpretability of AI models in biological contexts.
Methodology
The authors employed Sparse Autoencoders (SAEs) to disentangle superposed biological concepts from high-dimensional images. They validated their framework through theoretical and empirical analyses, demonstrating the contamination of representation metric spaces by superposition. The GW-map framework was developed to align purified SAE representations with scRNA-seq data using Gromov-Wasserstein optimal transport.
Results
The study showed that SAEs successfully recover the geometric fidelity of representations, allowing for the application of scRNA-seq analytical methods to image data. The GW-map framework enabled the reconstruction of hierarchical neuronal pathways, demonstrating the effectiveness of the approach in aligning image and transcriptomic data.
Implications
This research has significant implications for improving interpretability in AI models applied to biological data, facilitating better understanding of complex biological systems. The methodologies developed can enhance the analysis of neuronal pathology and potentially lead to new insights in spatial biology.