AI-generated summaries
Today's ML research,
without the noise.
Daily summaries of the latest machine learning papers from arXiv, processed every 8 hours.
62
Papers today
8h
Update frequency
7
Days of history
UOTIP: Unbalanced Optimal Transport Map for Unpaired Inverse Problems
Optimization
Generative Models
Computer Vision
- Introduces UOTIP, the first model for unpaired inverse problems based on Unbalanced Optimal Transport.
- Demonstrates robustness to multi-level observation noise and adaptability to class imbalance.
- Proves the existence and uniqueness of the transport map for ill-posed inverse problems.
- Achieves state-of-the-art performance on unpaired image inverse problem benchmarks.
Read more
UOTIP: Unbalanced Optimal Transport Map for Unpaired Inverse Problems
Summary
This paper addresses the challenge of unpaired image inverse problems, where only independent sets of noisy measurements and clean target signals are available for training. The authors introduce a novel solver called Unbalanced Optimal Transport Map for Inverse Problems (UOTIP), which formulates the reconstruction task as learning a Unbalanced Optimal Transport (UOT) map from the noisy measurement distribution to the clean signal distribution. By relaxing the exact marginal constraint, UOTIP enhances robustness to multi-level observation noise, adapts to class imbalance, and generalizes across diverse noise types. The authors theoretically prove that incorporating a quadratic cost term ensures the existence and uniqueness of the transport map, even for ill-posed inverse problems. Experimental results show that UOTIP achieves state-of-the-art performance on benchmarks for both linear and nonlinear inverse problems, demonstrating its effectiveness in real-world scenarios where strict alignment between noisy observations and clean signals is often unattainable.
Methodology
The methodology involves formulating the inverse problem as learning a UOT map from the noisy measurement distribution to the clean signal distribution. A likelihood-based cost function is incorporated, and the exact marginal constraint is relaxed to enhance model robustness and adaptability.
Results
UOTIP outperforms existing OT-based direct transport methods on various benchmarks for unpaired inverse problems, demonstrating superior performance in both linear and nonlinear contexts.
Implications
The findings suggest that UOTIP can be effectively applied in real-world scenarios where data alignment is challenging, such as in medical imaging, seismic imaging, and other fields requiring robust signal reconstruction from noisy measurements.
Robust Recommendation from Noisy Implicit Feedback: A GMM-Weighted Bayes-label Transition Matrix Framework
Theory
Optimization
- Introduction of RGBT framework that combines GMM with BLTM for robust recommendations.
- Theoretical guarantees of full sample utilization and low-variance estimation.
- Demonstrated effectiveness of RGBT in utilizing noisy samples compared to traditional methods.
- Superior calibration capability of transition matrix over state-of-the-art approaches.
Read more
Robust Recommendation from Noisy Implicit Feedback: A GMM-Weighted Bayes-label Transition Matrix Framework
Summary
This paper addresses the challenge of learning from noisy implicit feedback in recommender systems, which often leads to biased estimates and low data utilization. The authors propose a novel Robust GMM-weighted Bayes-label Transition Matrix framework (RGBT) that integrates Gaussian Mixture Model (GMM) for instance-specific reliability scoring with Bayesian label transition matrix (BLTM) modeling. This approach allows for effective utilization of noisy samples while correcting biases in transition matrix estimation. Theoretical analysis confirms that RGBT maintains full sample utilization, provides consistent estimation, and significantly reduces estimation variance. Extensive experiments on real-world and synthetic datasets demonstrate that RGBT outperforms existing denoising methods by effectively leveraging noisy samples and achieving superior calibration of the transition matrix.
Methodology
The RGBT framework employs a Gaussian Mixture Model to derive instance-specific reliability scores, which are used to calibrate the Bayesian label transition matrix. This method allows for the effective use of all available data, including noisy samples, while ensuring accurate bias correction in the transition matrix estimation.
Results
The experiments show that RGBT significantly outperforms mainstream reliable sample-based denoising methods in utilizing noisy samples. It also achieves a higher calibration capability of the transition matrix compared to existing state-of-the-art transition matrix-based denoising approaches, confirming its effectiveness and robustness.
Implications
The findings suggest that RGBT can enhance the performance of recommender systems in real-world scenarios where implicit feedback is often noisy. This framework could be applied to various domains requiring robust recommendation strategies, potentially improving user satisfaction and engagement.
Training Language Agents to Learn from Experience
NLP
Large Language Models
Reinforcement Learning
- Introduction of the In-context Training (ICT) task for evaluating cross-task self-improvement in language agents.
- Development of a reinforcement learning-based training pipeline for reflectors to learn from experience autonomously.
- Demonstration of significant performance improvements in language agents across unseen tasks using the proposed framework.
- Generalization of learned reflections to substantially different environments beyond the training benchmarks.
Read more
Training Language Agents to Learn from Experience
Summary
This paper addresses the challenge of enabling language agents to learn from experience in interactive environments, specifically focusing on cross-task self-improvement. The authors introduce the In-context Training (ICT) task, which evaluates the ability of language agents to distill experiences into reusable lessons for future tasks. The framework involves a dual-LLM design where an actor model interacts with environments and a reflector model generates improved prompts based on the actor's performance. The proposed RL-based training pipeline allows the reflector to learn from experience without human-provided examples. Experiments conducted on ALFWorld and MiniHack demonstrate that trained reflectors significantly outperform untrained baselines across various held-out task families, showcasing the potential for generalization beyond the training benchmarks. Additionally, the authors introduce MetaGym, a Python library for constructing meta-environments, facilitating further research on self-improving language agents.
Methodology
The authors propose a dual-LLM architecture consisting of an actor model that interacts with environments and a reflector model that generates improved prompts based on the actor's performance. The reflector is trained using a reinforcement learning pipeline to analyze experiences and produce system prompts that facilitate cross-task learning without relying on human examples.
Results
The experiments show that the trained reflectors outperform untrained baselines on most held-out task families in both ALFWorld and MiniHack. The reflectors also demonstrate the ability to generalize and improve performance on previously unseen tasks, with some cases exhibiting generalization to different environments.
Implications
The findings suggest that language agents can be trained to learn from experience effectively, enabling them to adapt to new tasks more efficiently. This has potential applications in various interactive environments, enhancing the capabilities of language models in decision-making and problem-solving scenarios.
Robust Subspace-Constrained Quadratic Models for Low-Dimensional Structure Learning
Optimization
Theory
- Introduction of a robust SCQM that accommodates various noise distributions.
- Development of a gradient descent algorithm with orthogonality-preserving updates.
- Theoretical analysis showing improved robustness with βp loss functions.
- Extensive experiments confirming superior performance over traditional methods.
Read more
Robust Subspace-Constrained Quadratic Models for Low-Dimensional Structure Learning
Summary
This paper introduces a robust subspace-constrained quadratic model (SCQM) aimed at learning low-dimensional structures from high-dimensional data. The proposed model builds on the subspace-constrained quadratic matrix factorization (SQMF) framework and is designed to handle a wide range of noise distributions, including generalized Gaussian and radial Laplace models. This flexibility enhances the model's robustness against both heavy-tailed and light-tailed noise, which is often encountered in real-world data. To tackle the nonconvex optimization problem that arises, the authors develop a gradient-based algorithm with a backtracking line-search strategy, ensuring stable and efficient convergence. The paper also includes a sensitivity analysis of the βp and β2 loss functions, highlighting their varying behaviors under different noise conditions. Extensive numerical experiments validate the theoretical findings, demonstrating that the proposed SCQM consistently outperforms existing methods in terms of robustness and reconstruction accuracy.
Methodology
The authors extend the classical Frobenius-norm formulation to matrix factorization using alternative norms such as the entrywise β1,1 norm and the mixed β2,1 norm. They derive explicit gradients for all variables and develop a Riemannian gradient descent algorithm on the Stiefel manifold to solve the nonconvex optimization problem, ensuring orthogonality is preserved during updates.
Results
The proposed SCQM demonstrates significant improvements in robustness and reconstruction accuracy when compared to existing methods, particularly in the presence of outliers. The theoretical analysis supports the effectiveness of the βp loss functions in enhancing model performance under diverse noise conditions.
Implications
The findings suggest that the proposed model can be effectively applied in various fields where low-dimensional structure learning is critical, such as image analysis, sensor data processing, and robust representation learning, particularly in scenarios with complex noise patterns.
LOSCAR-SGD: Local SGD with Communication-Computation Overlap and Delay-Corrected Sparse Model Averaging
Optimization
Federated Learning
Efficient ML
- LOSCAR-SGD combines local training, sparse model averaging, communication-computation overlap, and worker-specific local-step counts.
- The delay-corrected merge rule preserves local progress during communication delays.
- Theoretical guarantees are provided for convergence in smooth non-convex settings.
- Empirical results show significant reductions in training time with the proposed method.
Read more
LOSCAR-SGD: Local SGD with Communication-Computation Overlap and Delay-Corrected Sparse Model Averaging
Summary
The paper introduces LOSCAR-SGD, a novel Local Stochastic Gradient Descent (SGD) method designed to address the communication bottleneck in distributed learning, particularly in large-scale and federated learning environments. The authors identify three primary strategies to mitigate communication costs: local training, communication compression, and communication-computation overlap. LOSCAR-SGD uniquely integrates these strategies, allowing workers to perform local optimization while communication is ongoing. A significant contribution of this work is the development of a delay-corrected merge rule, which ensures that progress made during the overlap phase is preserved rather than discarded. The authors provide theoretical convergence guarantees for smooth non-convex objectives, demonstrating how factors like sparsity, overlap, and worker heterogeneity influence convergence rates. Empirical results validate that the proposed method reduces training time and that the delay-corrected merge rule outperforms traditional naive overwriting methods.
Methodology
The authors propose a Local SGD framework that allows for infrequent sparse model averaging and communication-computation overlap. They introduce a delay-corrected merge rule that maintains local progress during communication delays. Theoretical analysis is conducted to derive convergence guarantees, and experiments are performed to compare the performance of LOSCAR-SGD against traditional methods.
Results
The theoretical analysis confirms convergence guarantees for smooth non-convex objectives, with bounds illustrating the impact of sparsity, overlap, and worker heterogeneity on convergence rates. Experimental results indicate that LOSCAR-SGD significantly reduces training time compared to traditional methods, and the delay-corrected merge rule is shown to be more effective than naive overwriting.
Implications
LOSCAR-SGD has potential applications in large-scale distributed learning and federated learning scenarios, where communication costs are a critical bottleneck. The integration of communication-computation overlap and delay-corrected synchronization can lead to more efficient training processes in various machine learning tasks.
Robust Personalized Recommendation under Hidden Confounding in MNAR
Theory
Optimization
- Introduces a novel framework (PUID) for personalized estimation of hidden confounding strength in recommender systems.
- Develops an entropy-based sensitivity estimator to quantify the influence of unobserved confounders.
- Proposes a benchmark-guided variant (BPUID) that enhances robustness and predictive accuracy.
- Demonstrates significant performance improvements over global methods in extensive experiments on real-world datasets.
Read more
Robust Personalized Recommendation under Hidden Confounding in MNAR
Summary
This paper addresses the challenge of selection bias in recommender systems caused by hidden confounding, particularly in scenarios where data is Missing Not At Random (MNAR). Traditional methods like inverse propensity weighting and doubly robust estimators fail when unobserved confounders influence user interactions. The authors propose a novel framework called Personalized Unobserved-Confounding-aware Interaction Deconfounder (PUID), which estimates user-item specific sensitivity bounds to quantify hidden confounding strength. This approach relaxes the homogeneity assumption of existing global sensitivity bounds, allowing for personalized estimation of confounding effects. Additionally, the authors introduce a benchmark-guided variant (BPUID) that incorporates pre-trained models to enhance robustness and predictive accuracy. The methodology employs an entropy-based sensitivity estimator to infer confounding from observational data, and extensive experiments on three real-world datasets demonstrate that PUID significantly outperforms traditional global methods, validating its effectiveness in mitigating selection bias without requiring costly randomized controlled trial data.
Methodology
The methodology involves the development of the Personalized Unobserved-Confounding-aware Interaction Deconfounder (PUID), which estimates user-item-specific sensitivity bounds using an entropy-based approach. The framework incorporates adversarial optimization strategies and a benchmark-guided extension (BPUID) that utilizes pre-trained models to stabilize predictions and improve robustness against hidden confounding.
Results
The experiments conducted on three real-world datasets show that the proposed PUID and BPUID frameworks significantly outperform existing global methods in terms of recommendation accuracy and robustness against hidden confounding, validating the effectiveness of personalized sensitivity bounds.
Implications
The findings suggest that personalized sensitivity estimation can enhance the performance of recommender systems in real-world applications, particularly in scenarios where data is prone to selection bias. This approach could be applied in various domains such as e-commerce, social media, and content recommendation, improving user experience by providing more accurate and relevant recommendations.
A New Framework to Analyse the Distributional Robustness of Deep Neural Networks
Theory
Interpretability
Computer Vision
- Introduces a framework for analyzing distributional robustness in deep neural networks.
- Uses Bernoulli distributions to model interactions between layer weights and activations.
- Demonstrates the ability to distinguish between memorization and generalization in neural networks.
- Shows that distribution shifts negatively impact the separation metrics used for robustness diagnostics.
Read more
A New Framework to Analyse the Distributional Robustness of Deep Neural Networks
Summary
This paper addresses the critical issue of distributional robustness in deep neural networks (DNNs), which is essential for their deployment in real-world applications. The authors propose a novel framework that quantifies and analyzes the interactions between layer weights and activations using Bernoulli distributions. This framework serves as a diagnostic tool to assess the robustness of neural networks by examining the separation between classes as a proxy for robustness. The authors validate their framework through experiments on CIFAR-10 and ImageNet datasets, demonstrating that their metrics can effectively differentiate between networks that have memorized training data and those that have not. They also explore the behavior of these metrics under various distribution shifts, revealing that such shifts diminish the separation observed in their diagnostics. The findings suggest that the proposed framework offers valuable insights into the representation structure and robustness of neural networks, paving the way for improved understanding and enhancement of DNNs in the face of distributional changes.
Methodology
The authors study the interactions between weights and activations in a neural network by focusing on pre-activation values. They construct an interaction matrix that captures the contributions of each weight-activation pair, allowing for the analysis of class-specific activation paths through the network. This approach enables the quantification of robustness based on the separation of classes.
Results
The proposed framework successfully distinguishes between networks that have memorized their training data and those that have not. Experiments reveal that distribution shifts lead to reduced separation in the diagnostic metrics, indicating a decrease in robustness. The framework provides a model-level diagnostic tool that enhances understanding of neural network behavior under distributional changes.
Implications
This framework can be used to improve the robustness of deep neural networks in practical applications, particularly in scenarios where distribution shifts are common. It offers a new perspective on model evaluation and can guide the development of more resilient neural network architectures.
Q-SYNTH: Hybrid Quantum-Classical Adversarial Augmentation for Imbalanced Fraud Detection
Generative Models
- Q-SYNTH is a hybrid quantum-classical framework for fraud detection.
- It synthesizes minority-class fraud samples to address class imbalance.
- The framework shows improved statistical fidelity and competitive downstream performance.
- Q-SYNTH offers a favorable trade-off between distributional fidelity and detection performance.
Read more
Q-SYNTH: Hybrid Quantum-Classical Adversarial Augmentation for Imbalanced Fraud Detection
Summary
The paper addresses the challenge of credit card fraud detection, which is hindered by extreme class imbalance, where fraudulent transactions are rare. This imbalance often leads supervised learning models to favor the legitimate class, resulting in high accuracy but poor recall and F1-scores for the fraud class. To tackle this issue, the authors propose Q-SYNTH, a hybrid quantum-classical generative adversarial framework. In this framework, a parameterized quantum circuit acts as the generator, while a classical neural network serves as the discriminator. Q-SYNTH is specifically designed for synthesizing minority-class fraud samples in tabular data. The authors evaluate the generated samples based on statistical fidelity to real fraud samples and the performance of downstream fraud detection. They utilize distributional similarity measures, including KolmogorovβSmirnov statistics and Wasserstein distances, as well as AUC-ROC for detectability assessments. The results indicate that Q-SYNTH reduces the marginal distribution mismatch compared to a classical GAN baseline while maintaining competitive performance in fraud detection. Although SMOTE shows the best feature-wise similarity and classical GANs achieve higher downstream performance in some cases, Q-SYNTH strikes a balance between distributional fidelity and performance, demonstrating the potential of hybrid quantum augmentation in imbalanced fraud detection.
Methodology
The methodology involves a hybrid adversarial framework where a parameterized quantum circuit generates synthetic fraud samples, and a classical neural network discriminates between real and synthetic samples. The evaluation includes statistical fidelity measures and downstream performance metrics across both quantum and classical classifiers.
Results
Q-SYNTH successfully reduces the marginal distribution mismatch compared to classical GANs while maintaining competitive performance in fraud detection tasks. It demonstrates a favorable balance between statistical fidelity and downstream performance, making it a viable option for addressing imbalanced datasets in fraud detection.
Implications
The findings suggest that hybrid quantum-classical models can enhance data augmentation techniques for imbalanced datasets, particularly in sensitive domains like fraud detection. This approach could lead to more robust detection systems capable of adapting to evolving fraud patterns.
Compositional Transduction with Latent Analogies for Offline Goal-Conditioned Reinforcement Learning
Reinforcement Learning
Robotics
Theory
- Introduces analogy transduction for synthesizing goal-reaching behaviors across varying contexts.
- Proposes a novel task-endogenous analogy representation that captures essential changes for optimal execution.
- Develops the Compositional Transduction with latent Analogies (CTA) approach for offline GCRL.
- Demonstrates significant performance improvements over existing methods in empirical evaluations.
Read more
Compositional Transduction with Latent Analogies for Offline Goal-Conditioned Reinforcement Learning
Summary
This paper addresses the challenge of compositional generalization in offline goal-conditioned reinforcement learning (GCRL), where agents must learn to reach unseen goals from limited data. The authors critique existing methods that rely on trajectory stitching, which fails to effectively generalize across varying contexts. They propose a novel framework termed analogy transduction, which synthesizes new plans by leveraging task-endogenous analogies in different contexts. The authors introduce a new representation of these analogies that captures necessary changes for optimal task execution while remaining invariant to contextual variations. This representation is grounded in a theoretical framework that treats task-irrelevant contexts as noise. The proposed Compositional Transduction with latent Analogies (CTA) approach enhances generalization capabilities by enabling the agent to compose unseen analogy-context combinations. Empirical results demonstrate that CTA significantly outperforms prior methods on OGBench manipulation environments, achieving an average performance improvement of approximately 42%.
Methodology
The authors formalize analogy transduction as a method for synthesizing goal-reaching behaviors by composing task-endogenous analogies with task-exogenous contexts. They introduce a new analogy representation based on the difference between optimal temporal distance fields, which helps in ignoring irrelevant context differences. The CTA approach is designed to leverage this representation to support both in-distribution and out-of-combination generalization.
Results
The empirical evaluation on OGBench manipulation environments shows that the CTA approach outperforms the strongest baseline methods by about 42%, demonstrating its effectiveness in achieving compositional generalization in offline GCRL.
Implications
The findings suggest that leveraging analogy transduction can significantly enhance the capabilities of goal-conditioned agents in diverse and changing environments, potentially leading to more robust and adaptable robotic systems. This approach may also influence future research in reinforcement learning and compositional generalization.
Improved Guarantees for Constrained Online Convex Optimization via Self-Contraction
Optimization
Theory
- Introduces a projection-based algorithm for COCO with improved regret and CCV bounds.
- Achieves O(log T) regret and O(log T) CCV for strongly convex losses.
- Maintains O(βT) regret while improving CCV to O(βT) for convex losses.
- Utilizes a novel movement bound related to self-contracted curves.
Read more
Improved Guarantees for Constrained Online Convex Optimization via Self-Contraction
Summary
This paper addresses the problem of Constrained Online Convex Optimization (COCO) with adversarially chosen constraints, where the learner selects actions before observing the loss and constraint functions. The authors propose a projection-based algorithm that achieves improved performance metrics: O(log T) regret and O(log T) cumulative constraint violation (CCV) for strongly convex losses, and O(βT) regret with O(βT) CCV for convex losses. The key innovation lies in leveraging a geometric result related to self-contracted curves, which helps in bounding the movement cost associated with successive projections onto shrinking feasible sets. This advancement represents a significant improvement over previous algorithms, which had worse CCV bounds. The methodology involves a class of first-order algorithms that utilize gradient steps followed by projections onto feasible sets, with the authors establishing a new movement bound that connects these projections to self-contraction properties. The results indicate that the proposed algorithm not only meets but exceeds the performance guarantees of existing methods, providing a more efficient approach to COCO problems.
Methodology
The authors develop a first-order algorithm class called NP-OGD, which performs a gradient step on the revealed cost function and projects the result onto the intersection of feasible sets. They establish a new movement bound for the algorithm by connecting the projection operations to properties of self-contracted curves, allowing them to control the cumulative constraint violation effectively.
Results
The proposed algorithm achieves O(log T) regret and O(log T) CCV for strongly convex losses, and O(βT) regret with O(βT) CCV for convex losses. This represents a substantial improvement in CCV bounds compared to existing methods, which had previously achieved O(βT log T) CCV for strongly convex losses.
Implications
The findings suggest that the proposed algorithm can be effectively applied in scenarios requiring constrained optimization under uncertainty, such as online learning, resource allocation, and decision-making processes in adversarial environments. The improved guarantees may lead to more robust and efficient algorithms in practical applications.
Ada2MS: A Hybrid Optimization Algorithm Based on Exponential Mixing of Elementwise and Global Second-Moment Estimates
Optimization
- Ada2MS combines the advantages of AdamW and Momentum SGD to improve optimization performance.
- The algorithm utilizes exponential interpolation between elementwise and global second-moment estimates.
- Ada2MS maintains stability while gradually introducing SGD-like characteristics during training.
- Experimental results show that Ada2MS performs competitively in visual tasks.
Read more
Ada2MS: A Hybrid Optimization Algorithm Based on Exponential Mixing of Elementwise and Global Second-Moment Estimates
Summary
This paper introduces Ada2MS, a novel optimization algorithm designed to combine the strengths of two prominent optimization methods: AdamW and Momentum SGD. While AdamW is known for its stability and robustness across various training scenarios, it sometimes lacks generalization performance compared to Momentum SGD, which can achieve better results but requires careful tuning and is sensitive to gradient-scale variations. Ada2MS addresses this trade-off by employing a continuous exponential interpolation between elementwise second-moment estimates (similar to AdamW) and global second-moment estimates (similar to Momentum SGD). This hybrid approach allows for a smooth transition in optimization behavior, enhancing the algorithm's adaptability throughout the training process. The experimental results demonstrate that Ada2MS achieves competitive performance on visual tasks, indicating its potential as a versatile optimization tool in deep learning.
Methodology
The Ada2MS algorithm is developed through a continuous exponential interpolation mechanism that blends the elementwise second-moment estimates of AdamW with the global second-moment estimates of Momentum SGD. This design allows the algorithm to adapt its behavior dynamically during training, balancing stability and generalization performance.
Results
Ada2MS was evaluated on various visual tasks and demonstrated competitive results compared to existing optimization algorithms under a unified optimizer-comparison protocol, showcasing its effectiveness in improving model convergence and performance.
Implications
The introduction of Ada2MS could lead to more efficient training processes in machine learning, particularly in scenarios where both stability and generalization are critical. Its hybrid nature may also inspire further research into optimization algorithms that leverage the strengths of existing methods.
Catching a Moving Subspace: Low-Rank Bandits Beyond Stationarity
Theory
- Introduces a framework for piecewise-stationary low-rank linear contextual bandits.
- Establishes an identification boundary for recovering moving subspaces under scalar feedback.
- Develops the SPSC algorithm that interleaves probing and exploitation to adapt to subspace changes.
- Demonstrates significant performance improvements over existing methods in empirical evaluations.
Read more
Catching a Moving Subspace: Low-Rank Bandits Beyond Stationarity
Summary
This paper addresses the challenges of low-rank bandits in dynamic environments where the underlying reward structure is not stationary. Traditional approaches either assume a fixed low-rank structure or adapt to non-stationarity at the cost of increased regret. The authors propose a novel framework for piecewise-stationary low-rank linear contextual bandits with scalar feedback, where the reward structure is characterized by a rank-r factor that remains constant within segments but can shift at unknown change points. The paper introduces the Single-Play Subspace-Calibrated Optimism (SPSC) algorithm, which interleaves probing with exploitation to effectively learn and adapt to the moving subspace. The authors establish a tight identification boundary for recovering the moving subspace through quadratic functionals of rewards, identifying necessary and sufficient conditions for successful recovery. Empirical evaluations demonstrate that SPSC significantly outperforms existing non-stationary and low-rank baselines across various datasets, confirming the theoretical findings and showcasing its practical applicability in real-world scenarios.
Methodology
The authors develop the SPSC algorithm, which combines isotropic probing with windowed projected ridge-UCB exploitation. The algorithm uses quadratic measurement identities to recover the low-rank structure while dynamically adapting to changes in the subspace. A CUSUM-style adaptive variant is also introduced to detect segment boundaries online.
Results
The SPSC algorithm achieves a dynamic regret rate of eO(rβT) + eO(TΒ²/3) + O(W Vin), significantly improving upon the ambient regret rate of eO(dβT) found in traditional methods. Empirical results across eleven benchmarks show that SPSC outperforms various non-stationary and low-rank baselines, particularly when the difference between ambient dimension and intrinsic rank is large.
Implications
The findings have significant implications for applications in personalized recommendation systems, clinical dosing, and ad targeting, where understanding and adapting to changing user preferences or treatment responses is crucial. The ability to effectively identify and exploit low-dimensional structures in dynamic environments can enhance decision-making processes in these fields.
AVSD: Adaptive-View Self-Distillation by Balancing Consensus and Teacher-Specific Privileged Signals
NLP
Large Language Models
Reinforcement Learning
- AVSD utilizes multiple types of privileged information to enhance self-distillation.
- The method separates stable consensus signals from view-specific residual signals.
- AVSD outperforms traditional single-view self-distillation methods on various benchmarks.
- The approach addresses the limitations of relying on a single privileged view for training.
Read more
AVSD: Adaptive-View Self-Distillation by Balancing Consensus and Teacher-Specific Privileged Signals
Summary
The paper introduces Adaptive-View Self-Distillation (AVSD), a novel self-distillation method that enhances the training of language models by leveraging multiple types of privileged information. Traditional self-distillation methods often rely on a single view of privileged information, which can lead to performance limitations due to the inherent asymmetry between the teacher and student models. AVSD addresses this by separating the learning signals into a consensus signal, which is stable across different views, and a view-specific residual signal that adjusts the update magnitude based on alignment with the consensus. This approach allows for more robust token-level supervision and improves the model's ability to generalize across tasks. The authors validate AVSD through experiments on math competition benchmarks and code-generation tasks, demonstrating significant performance improvements over existing single-view self-distillation methods.
Methodology
The AVSD framework employs an on-policy self-distillation approach that integrates multiple privileged information views. It identifies a consensus signal shared across these views, which serves as a reliable update direction, while also incorporating view-specific signals that can enhance the learning process when they align with the consensus. This dual signal approach allows for a more nuanced and effective training process.
Results
Experiments on math competition benchmarks (AIME24, AIME25, HMMT25) show that AVSD achieves average Avg@8 gains of 3.1% and 2.2% over the strongest baselines on Qwen3-8B and Qwen3-4B, respectively. Additionally, on code-generation benchmarks (Codeforces, LiveCodeBench v6), AVSD outperforms the single-view self-distillation baseline by 2.4% on average.
Implications
The findings suggest that leveraging multiple types of privileged information can significantly enhance the training of language models, particularly in complex reasoning tasks. This approach could be applied to various domains requiring high-level reasoning and problem-solving capabilities, potentially improving the performance of AI systems in real-world applications.
APEX: Autonomous Policy Exploration for Self-Evolving LLM Agents
Large Language Models
Reinforcement Learning
Optimization
- APEX introduces a strategy map to maintain an explicit exploration space for LLM agents.
- The framework effectively addresses exploration collapse by balancing exploration and exploitation.
- APEX outperforms existing self-evolving agent frameworks across multiple benchmarks.
- The mechanisms of Fork Discovery and Policy Selection are crucial for enhancing exploratory behavior.
Read more
APEX: Autonomous Policy Exploration for Self-Evolving LLM Agents
Summary
The paper introduces APEX (Autonomous Policy Exploration), a novel framework designed to enhance the exploration capabilities of self-evolving large language model (LLM) agents. Traditional LLM agents struggle with exploration collapse, where they become fixated on familiar high-reward strategies, hindering their ability to discover new, potentially better strategies. APEX addresses this issue by maintaining an explicit strategy space through a directed acyclic graph (DAG) known as a strategy map. This map consists of milestones and prerequisite dependencies, allowing the agent to systematically explore unexplored directions. The framework incorporates two key mechanisms: Fork Discovery, which identifies and expands the strategy map with unexplored directions based on past experiences, and Policy Selection, which balances exploration and exploitation during planning. The effectiveness of APEX is validated through evaluations on nine Jericho text-adventure games and the WebArena benchmark, demonstrating superior performance compared to existing methods. The results indicate that APEX not only prevents exploration collapse but also facilitates sustained exploration and discovery of qualitatively different strategies.
Methodology
APEX employs a directed acyclic graph (DAG) to represent a strategy map, where nodes are milestones and edges denote prerequisite dependencies. It utilizes Fork Discovery to identify unexplored strategies from past episodes and expands the strategy map, while Policy Selection scores these milestones to ensure the agent explores under-utilized strategies during planning.
Results
APEX consistently outperformed all baseline models across nine Jericho text-adventure games and the WebArena benchmark. The framework demonstrated significant improvements in tasks requiring the discovery of qualitatively different strategies, effectively mitigating exploration collapse and enhancing overall agent performance.
Implications
The development of APEX has significant implications for the design of self-evolving agents in complex environments, enabling them to maintain a broader exploration strategy and continuously improve their performance over time. This could be applied in various domains such as interactive gaming, autonomous systems, and decision-making tasks.
Automated Kernel Discovery Towards Understanding High-dimensional Bayesian Optimization
Optimization
- Introduces Kernel Discovery, an LLM-driven framework for high-dimensional BO.
- Employs a two-stage approach for kernel generation and validation.
- Proposes LOO-CRPS as a robust selection criterion to avoid overfitting.
- Achieves superior performance on high-dimensional BO benchmarks compared to existing methods.
Read more
Automated Kernel Discovery Towards Understanding High-dimensional Bayesian Optimization
Summary
This paper addresses the challenges of designing effective Gaussian Process (GP) kernels for high-dimensional Bayesian optimization (BO), which often requires extensive manual engineering. The authors introduce a novel framework called Kernel Discovery, which leverages Large Language Models (LLMs) in an evolutionary algorithm to explore a broader kernel space beyond traditional composition rules. The framework operates in two stages: first, an LLM proposes new mathematical forms for kernels, and then a second LLM converts these forms into executable code. To evaluate the kernels, the authors propose a leave-one-out continuous ranked probability score (LOO-CRPS) as a selection criterion that mitigates overfitting. The method is tested on five high-dimensional BO benchmarks, achieving an average rank of 1.2 out of 17, outperforming existing competitive baselines. The analysis of the discovered kernels reveals that unexpected kernel structures, such as compositions of geometric warping, can lead to significant improvements in performance, providing insights into effective kernel design in high-dimensional settings.
Methodology
The Kernel Discovery framework utilizes a two-stage process where an LLM generates novel mathematical kernel forms, which are then converted into executable code by a second LLM. The kernels are validated through agnostic execution and positive semi-definiteness checks. The selection of the most promising kernels is guided by the LOO-CRPS metric, which is less susceptible to overfitting than traditional metrics.
Results
The proposed method achieved an average rank of 1.2 out of 17 across five high-dimensional BO benchmarks, significantly outperforming competitive baselines. The analysis of the discovered kernels indicated that certain unexpected structures could lead to performance improvements, offering new insights into kernel effectiveness in high-dimensional optimization.
Implications
The findings suggest that automating kernel design using LLMs can reduce the reliance on manual engineering in high-dimensional BO, potentially leading to more efficient optimization strategies across various applications, including hyperparameter tuning, neural architecture search, and other complex machine learning tasks.
Efficient Learning of Deep State Space Models via Importance Smoothing
Time Series
Generative Models
Efficient ML
- Introduction of parallel variational Monte Carlo (PVMC) for training DSSMs.
- PVMC combines the strengths of variational auto-encoding and DSMC methods.
- Achieves state-of-the-art results on baseline experiments.
- Demonstrates a 10Γ speed-up over existing DSMC methods.
Read more
Efficient Learning of Deep State Space Models via Importance Smoothing
Summary
This paper addresses the challenges of training deep state space models (DSSMs) efficiently, particularly in the context of time series data observed through noisy measurements. The authors identify two primary training paradigms for DSSMs: variational auto-encoding methods that optimize a variational lower bound and differentiable sequential Monte Carlo (DSMC) methods that utilize particle filtering. While the former allows for parallel training, it struggles with supervised tasks and provides loose variational bounds. Conversely, DSMC methods support supervised learning but suffer from sequential execution, leading to inefficiencies on modern hardware. To bridge these gaps, the authors propose a novel training method called parallel variational Monte Carlo (PVMC), which enables efficient parallel execution while maintaining the ability to construct an importance-weighted approximation of the marginal posterior over latent states. PVMC achieves tighter variational bounds than standard variational auto-encoder methods and is capable of training DSSMs for both generative and discriminative tasks. The proposed method demonstrates a significant speed-up in training, achieving results that are 10 times faster than the fastest competing DSMC approach and 100 times faster than unbiased DSMC methods.
Methodology
The authors developed PVMC, an end-to-end differentiable particle smoother that utilizes an importance-weighted approximation to the smoothing distribution of latent states. This method avoids the sequential proposal mechanisms of traditional particle filtering, allowing for efficient parallel execution. They also derived a new evidence lower bound (ELBO) for training generative DSSMs, which accounts for all possible trajectories through proposed particles, enhancing the accuracy of the model.
Results
The experimental results indicate that PVMC outperforms existing methods in terms of speed and accuracy. It achieves a 10Γ speed-up compared to the fastest DSMC approach and a 100Γ speed-up over unbiased DSMC methods, while also yielding state-of-the-art performance on various benchmark tasks.
Implications
The proposed PVMC method has significant implications for the efficient training of deep state space models in various applications, including time series analysis, financial modeling, and any domain where latent state dynamics are critical. Its ability to handle both generative and discriminative tasks makes it a versatile tool for researchers and practitioners in machine learning.
CAdam: Context-Adaptive Moment Estimation for 3D Gaussian Densification in Generative Distillation
Generative Models
Optimization
Computer Vision
- CAdam reduces Gaussian counts by 85%β97% compared to standard densification methods.
- The framework addresses the Densification Dilemma by leveraging statistical signal verification.
- CAdam employs a novel approach that combines momentum-based verification and context-aware selection.
- The method maintains comparable perceptual quality while improving memory efficiency.
Read more
CAdam: Context-Adaptive Moment Estimation for 3D Gaussian Densification in Generative Distillation
Summary
This paper introduces CAdam, a novel framework for 3D Gaussian densification in the context of generative distillation. The authors identify a critical issue in existing optimization-based generative 3D Gaussian Splatting (3DGS) methods, termed the 'Densification Dilemma,' which arises from the stochastic nature of generative guidance. Traditional methods rely on gradient magnitude accumulation, leading to inefficient representations filled with redundant Gaussian primitives. CAdam addresses this by reinterpreting densification as a statistical signal verification problem, utilizing the first moment of gradients to distinguish between coherent geometric signals and stochastic noise. The framework incorporates three key principles: Momentum-based Signal Verification, Context-Adaptive Selection, and Selective Structural Refinement. These principles work together to enhance the efficiency of the densification process, allowing for substantial reductions in Gaussian counts while maintaining perceptual quality. Extensive experiments demonstrate that CAdam achieves a reduction of 85%β97% in Gaussian primitives compared to standard methods, showcasing its effectiveness across various generative objectives.
Methodology
CAdam employs a framework that reinterprets densification as a statistical signal verification process. It utilizes the first moment of gradients to differentiate between coherent geometric signals and stochastic noise. The methodology includes three main components: Momentum-based Signal Verification, which accumulates gradient vectors; Context-Adaptive Selection, which evaluates candidates based on quantile ranking and Signal-to-Noise Ratio (SNR); and Selective Structural Refinement, which restricts updates to verified candidates to prevent redundancy.
Results
The experiments conducted demonstrate that CAdam significantly reduces the number of Gaussian primitives required for effective representation in generative tasks, achieving reductions of 85%β97% while preserving overall perceptual quality. This indicates a marked improvement in memory efficiency and structural representation in generative distillation processes.
Implications
The findings suggest that CAdam can enhance the efficiency of generative models in 3D synthesis, making it a valuable tool for applications in computer graphics, virtual reality, and other fields requiring high-quality 3D asset generation. The approach could lead to more efficient use of computational resources in generative tasks.
Dynamic Shapley Computation
Theory
Efficient ML
Interpretability
- D-Shap transforms Shapley computation into a reusable and incremental process.
- The framework allows for efficient updates in dynamic settings, addressing both task and player changes.
- Self-valuation enables the construction of the initial Shapley matrix directly from training data.
- D-Shap achieves substantial reductions in computational costs, making it practical for real-world applications.
Read more
Dynamic Shapley Computation
Summary
The paper introduces D-Shap, a novel framework for dynamic Shapley value computation, addressing the high computational cost associated with traditional Shapley-based data valuation methods in dynamic machine learning environments. Existing approaches treat Shapley computation as a one-time process, leading to inefficiencies when tasks and training players evolve. D-Shap represents Shapley values as a player-by-task matrix, allowing for efficient updates through structured matrix maintenance. By leveraging locality in task dependencies and similar valuations across tasks, D-Shap enables task-incremental and player-incremental updates without the need for global recomputation. The framework also introduces self-valuation to construct the initial matrix from training data, enhancing scalability. Experimental results demonstrate that D-Shap performs updates in milliseconds and significantly reduces the computational cost of player updates, while maintaining valuation quality comparable to full recomputation.
Methodology
D-Shap represents Shapley values as a player-by-task matrix, enabling efficient updates through structured matrix maintenance. It utilizes locality properties to focus on affected subsets during updates, allowing for task-incremental updates via interpolation and player-incremental updates confined to local matrix blocks.
Results
Experiments show that D-Shap can perform task updates in milliseconds and reduce player update costs by up to three orders of magnitude compared to traditional methods, while achieving valuation quality that is competitive with full recomputation.
Implications
The D-Shap framework has significant implications for scalable data valuation in dynamic machine learning systems, making Shapley-based methods more practical for applications such as data pricing, dataset curation, and model debugging.
The Devil is in the Condition Numbers: Why is GLU Better than non-GLU Structure?
Optimization
Theory
Large Language Models
- GLU structure leads to improved spectral conditioning in the NTK regime.
- Faster convergence of training error is observed with GLU compared to non-GLU models.
- The generalization gap remains similar between GLU and non-GLU models.
- The primary advantage of GLU is in accelerating optimization rather than reducing generalization error.
Read more
The Devil is in the Condition Numbers: Why is GLU Better than non-GLU Structure?
Summary
This paper investigates the advantages of Gated Linear Units (GLU) over non-gated structures in neural networks, particularly in the context of large language models. The authors analyze two-layer networks in the Neural Tangent Kernel (NTK) regime, revealing that the GLU structure reshapes the NTK spectrum, resulting in a smaller condition number and a more compact eigenvalue distribution. This improved conditioning leads to faster convergence during training, as evidenced by a characteristic loss-crossing phenomenon between GLU and non-GLU models. Despite the acceleration in optimization, the authors find that GLU does not significantly reduce the generalization gap across various models, suggesting that its primary benefit lies in enhancing optimization rather than improving generalization. The paper contributes a theoretical framework for understanding the optimization dynamics of GLU models and provides empirical evidence supporting these claims.
Methodology
The authors employ a theoretical analysis of GLU variants within the NTK framework, focusing on the spectral properties of the NTK matrix. They analyze the optimization dynamics and training behaviors, including the loss-crossing phenomenon, and conduct empirical comparisons of training loss and generalization gap across models.
Results
The analysis shows that GLU leads to a better-conditioned NTK spectrum, which facilitates faster convergence during training. Empirical results indicate that while GLU accelerates optimization, it does not significantly affect the generalization gap, leading to a better overall generalization error due to faster training.
Implications
The findings suggest that incorporating GLU structures in neural network architectures can enhance training efficiency, which is particularly beneficial for large language models and other complex architectures. Understanding the optimization dynamics of GLU can inform future model designs and training strategies.
HORST: Composing Optimizer Geometries for Sparse Transformer Training
Optimization
Computer Vision
NLP
- Introduces HORST, a new optimizer that effectively combines stability and sparsity in transformer training.
- Reveals a geometric dichotomy between steepest descent and mirror descent, impacting optimizer performance.
- Demonstrates that the entropy mirror map can overwrite the implicit bias of steepest descent optimizers.
- Shows significant performance improvements over AdamW in both vision and language tasks, especially at higher sparsity levels.
Read more
HORST: Composing Optimizer Geometries for Sparse Transformer Training
Summary
The paper addresses the challenge of sparsifying transformer models, which are critical in deep learning but often suffer from high computational costs due to their large parameter sizes. Traditional optimizers like AdamW struggle to promote both sparsity and training stability, as they exhibit an implicit Lβ bias that conflicts with the L1 bias required for sparsity. The authors propose HORST (Hyperbolic Operator for Robust Sparse Training), a novel modular optimizer that combines the stability of adaptive methods with the L1 sparsity bias through a hyperbolic mirror map. By treating optimization steps as non-commutative operators, they analyze and combine these geometries to create a more effective training regime for sparse transformers. Experimental results show that HORST significantly outperforms AdamW across various sparsity levels, particularly excelling in high-sparsity scenarios where traditional methods falter.
Methodology
The authors develop HORST by composing optimization steps from steepest descent and mirror descent geometries. They formalize the optimization process as functional operators and leverage the hyperbolic entropy mirror map to induce an L1 bias, promoting sparsity while retaining the stability of adaptive methods. The optimizer is validated through experiments on various vision and language tasks, comparing its performance against the AdamW baseline.
Results
HORST consistently outperforms AdamW across all tested sparsity levels, with particularly large gains observed in high-sparsity settings. The experiments demonstrate that HORST effectively maintains training stability while promoting sparsity, addressing a critical limitation of existing optimizers for transformer models.
Implications
The development of HORST has significant implications for the deployment of transformer models in resource-constrained environments, enabling more efficient training and inference through model sparsification. This work could lead to advancements in various applications where transformer architectures are utilized, such as natural language processing and computer vision.
CIG: Exploration via Conditional Information Gain
Reinforcement Learning
- CIG provides a new intrinsic reward that effectively combines lifelong and episodic exploration signals.
- The method is scalable to high-dimensional state spaces, overcoming limitations of previous approaches.
- CIG is evaluated across diverse tasks and consistently outperforms or matches existing exploration strategies.
- The approach is implemented without additional hyperparameters, simplifying integration into existing frameworks.
Read more
CIG: Exploration via Conditional Information Gain
Summary
The paper introduces Conditional Information Gain (CIG), a novel intrinsic reward mechanism for exploration in reinforcement learning (RL). Traditional intrinsic rewards either focus on lifelong experience or episodic context but fail to effectively combine both. CIG addresses this gap by deriving a tractable surrogate for trajectory-level information gain, which conditions on both the replay buffer and the rollout prefix. This is achieved through a log-determinant objective over an ensemble disagreement kernel, allowing for causal per-step rewards that scale to high-dimensional state spaces. The authors evaluate CIG in a model-based RL setting across twelve tasks, including both discrete (MiniGrid) and continuous control (OGBench) environments, demonstrating its robustness against stochastic distractors and its superior performance compared to existing exploration methods.
Methodology
The authors derive the CIG reward as a log-determinant of an ensemble disagreement kernel, which allows for the calculation of causal per-step rewards. This method conditions on both the replay buffer and the rollout prefix, addressing the limitations of previous intrinsic reward mechanisms. The approach is implemented in a model-based RL framework, where short imagined rollouts are used for exploration.
Results
CIG was tested on twelve tasks, including both discrete and continuous control scenarios. The results indicate that CIG outperforms or matches all baseline exploration methods while demonstrating robustness to stochastic distractors. The aggregate normalized Intrinsic Quality Measure (IQM) for CIG exceeded that of the next-best baseline by a significant margin.
Implications
The introduction of CIG has the potential to enhance exploration strategies in reinforcement learning, particularly in complex environments where traditional methods struggle. Its ability to effectively combine different contextual signals could lead to more efficient learning and better performance in real-world applications.
Axiomatizing Neural Networks via Pursuit of Subspaces
Theory
Interpretability
- Introduces the Pursuit of Subspaces (PoS) framework as an axiomatic approach to understanding neural networks.
- Establishes four geometric axioms that explain how DNNs learn compact representations.
- Provides a unified interpretation of architectural mechanisms and their roles in representation and generalization.
- Connects existing neural architectures to a geometric foundation, facilitating the design of explainable models.
Read more
Axiomatizing Neural Networks via Pursuit of Subspaces
Summary
This paper addresses the opaque nature of deep neural networks (DNNs) by proposing the Pursuit of Subspaces (PoS) framework, which aims to axiomatize neural network behavior through geometric postulates. The authors argue that despite the empirical success of DNNs, there is a significant gap in understanding their internal mechanisms. The PoS framework introduces four geometric axioms that describe how DNNs learn compact data representations, providing a unified perspective on representation, computation, and generalization. The paper discusses how these axioms lead to insights into representation structure, architectural mechanisms, and generalization behavior. By modeling representations as unions of low-dimensional smooth submanifolds, the authors derive explanations for various neural network phenomena, including generalization and hallucination control. The framework also connects existing neural architectures to a geometric foundation, enabling the design of novel architectures that are inherently explainable. Overall, the PoS framework represents a significant step towards bridging the theoretical and empirical aspects of deep learning.
Methodology
The authors develop a geometric theory of deep learning, modeling representations as unions of low-dimensional smooth submanifolds. They utilize differential geometry to extend Sparse Representation theory and analyze the implications of their axioms on neural network behavior, including the roles of nonlinear activations, residual connections, and attention mechanisms.
Results
The PoS framework provides rigorous explanations for existing neural architectures and reveals that orthogonality and disentanglement are emergent properties necessary for stable projections onto learned manifolds. The framework also shows that learned transformations can generate families of manifolds through symmetry, leading to a new understanding of representation organization in DNNs.
Implications
The findings suggest that the PoS framework could enhance the interpretability and reliability of neural networks, making them more transparent and easier to understand. This could have significant implications for deploying DNNs in safety-critical applications and advancing the theoretical foundations of artificial intelligence.
Weight Decay Regimes in Grokking Transformers: Cheap Online Diagnostics
Theory
- Weight decay is a critical parameter influencing the transition between memorization, generalization, and collapse in transformers.
- Two online diagnostics are introduced to track training dynamics effectively and at lower computational costs.
- The study identifies a critical weight decay threshold (Ξ»c = 0.0158) and an empirical power-law exponent for time-to-grok.
- The findings are consistent across various model architectures, suggesting broader applicability beyond transformers.
Read more
Weight Decay Regimes in Grokking Transformers: Cheap Online Diagnostics
Summary
This paper investigates the phenomenon of grokking in transformers trained on modular arithmetic, where models transition sharply between memorization, generalization, and collapse regimes. The author demonstrates that weight decay serves as a critical control parameter for these transitions and introduces two inexpensive online diagnosticsβmean pairwise attention-head cosine similarity and entropy standard deviationβto monitor training dynamics based on attention activations. The study spans eleven experimental conditions and three model scales, revealing that weight decay influences the transition from memorization (Ξ» < Ξ»c) to developmental grokking (Ξ» β₯ Ξ»c) and ultimately to collapse (Ξ» = 10). The critical threshold for weight decay is estimated at Ξ»c = 0.0158 with a power-law exponent Ξ½ = 0.757 for time-to-grok. The findings are validated across different architectures, showing that the weight-decay-controlled transition is consistent even in non-attention models. The paper emphasizes the importance of understanding these dynamics for future research in deep learning and provides a framework for diagnosing training behavior in real-time.
Methodology
The author conducted a series of experiments involving a dense weight-decay sweep and a sparse three-size scale probe to quantify the weight-decay critical threshold. Two online diagnostics were defined to measure attention-head coordination during training. The study also included cross-architecture tests with different model types to validate findings.
Results
The results indicate a clear separation of training regimes based on weight decay, with distinct behaviors observed in memorization, developmental grokking, and collapse phases. The critical threshold for weight decay was established at Ξ»c = 0.0158, and the power-law fit for time-to-grok yielded an empirical exponent of Ξ½ = 0.757. The diagnostics effectively tracked training dynamics, revealing insights into attention-head coordination and differentiation phases.
Implications
The findings have significant implications for understanding the training dynamics of transformers and other neural architectures, potentially guiding the design of more efficient training protocols. The introduced diagnostics can be utilized in real-time to monitor and adjust training strategies, enhancing model performance and generalization.
AGPO: Adaptive Group Policy Optimization with Dual Statistical Feedback
Reinforcement Learning
Large Language Models
Optimization
- AGPO introduces adaptive clipping and temperature sampling to improve training stability and efficiency.
- The method utilizes group-level statistics to control update magnitude and exploration dynamically.
- AGPO outperforms traditional PPO and GRPO methods on multiple benchmarks, demonstrating its effectiveness.
- The approach is critic-free, simplifying the training process while maintaining performance.
Read more
AGPO: Adaptive Group Policy Optimization with Dual Statistical Feedback
Summary
The paper introduces Adaptive Group Policy Optimization (AGPO), a novel approach to reinforcement learning that enhances the training of large language models (LLMs) by addressing the limitations of traditional Proximal Policy Optimization (PPO) and Grouped Reinforcement Policy Optimization (GRPO). AGPO replaces fixed clipping and decoding temperature settings with adaptive mechanisms driven by group-level statistics, which allows for more stable and efficient training. The method employs two controllers: one for adaptive clipping based on reward dispersion and policy entropy, and another for bidirectional adaptive temperature sampling that adjusts decoding temperature according to uncertainty. The authors demonstrate AGPO's effectiveness on nine math and STEM benchmarks in both English and Chinese, showing significant performance improvements over PPO and GRPO while maintaining the same token budget. The results indicate that AGPO not only enhances the performance of the Qwen2.5-14B model but also transfers gains to other models like Llama-3-8B and Gemma-2-9B. The implementation of AGPO is publicly available, promoting further research and application in the field.
Methodology
AGPO employs a critic-free refinement of GRPO, utilizing a shared statistical state derived from group-level statistics to control adaptive clipping and temperature sampling. The method involves sampling multiple rollouts for a single prompt, calculating rewards, and normalizing advantages to guide policy updates. The adaptive clipping mechanism adjusts the clipping radius based on various statistical measures, while the temperature sampling adjusts decoding temperature according to uncertainty relative to a baseline.
Results
AGPO achieved a performance of 67.3% on the GSM8K benchmark and 40.5% on the MATH benchmark, outperforming both PPO and GRPO under the same token generation budget. The improvements were consistent across nine benchmarks and were also observed in other models like Llama-3-8B and Gemma-2-9B, confirming the robustness of the approach.
Implications
The findings suggest that AGPO can significantly enhance the training of large language models, making it a valuable tool for researchers and practitioners in reinforcement learning and natural language processing. Its adaptive mechanisms could lead to more efficient training protocols and better model performance in various applications.
Provably Learning Diffusion Models under the Manifold Hypothesis: Collapse and Refine
Generative Models
Theory
Efficient ML
- Introduces Score-induced Latent Diffusion (SiLD) as a two-stage learning framework for diffusion models.
- Proves convergence guarantees and establishes that sample complexity depends on intrinsic dimension, not ambient dimension.
- Demonstrates empirical success on various datasets, outperforming VAE-based latent diffusion models.
- Establishes a novel training strategy that integrates manifold learning and density estimation under a single objective.
Read more
Provably Learning Diffusion Models under the Manifold Hypothesis: Collapse and Refine
Summary
This paper addresses the theoretical foundations of training diffusion models, particularly in the context of the manifold hypothesis, which posits that high-dimensional data resides on low-dimensional manifolds. The authors introduce a novel two-stage framework called Score-induced Latent Diffusion (SiLD), which leverages a collapse-and-refine mechanism driven by the geometry of the score function. In the first stage, the model learns the manifold structure at low noise levels, while in the second stage, it refines the intrinsic density on the learned manifold. The authors provide convergence guarantees for both stages, demonstrating that the sample complexity is dependent on the intrinsic dimension rather than the ambient dimension. Empirical results on datasets such as Stacked MNIST, CelebA, and molecular generation benchmarks show that SiLD matches or outperforms existing VAE-based latent diffusion models in terms of generation quality and reconstruction accuracy, validating the theoretical predictions of the framework.
Methodology
The authors propose a two-stage training approach where the first stage focuses on learning the geometry of the data manifold using low-noise score matching, and the second stage refines the density estimation on the learned manifold. This is achieved through a denoising score matching objective, avoiding the need for auxiliary losses or heuristic regularization typically used in VAE-based models.
Results
The theoretical analysis provides convergence rates for both stages of the SiLD framework, with empirical validation showing that SiLD achieves high-quality generative performance on benchmark datasets, matching or exceeding the performance of traditional VAE-based latent diffusion models. The results confirm that the score matching objective is sufficient for effective manifold learning and density estimation.
Implications
The findings suggest that diffusion models can be trained more efficiently by leveraging the manifold structure of data, potentially leading to advancements in generative modeling applications across various domains, including image synthesis and molecular generation.
Latent Process Generator Matching
Generative Models
Theory
Optimization
- Introduces latent process generator matching, extending generator matching theory to time-dependent latent processes.
- Allows for learning generators of stochastic processes that match one-time marginal distributions on the image space.
- Generalizes existing methods by accommodating a wider variety of latent spaces, including continuous and manifold-valued processes.
- Provides sufficient conditions for valid loss functions, recovering results from previous works as corollaries.
Read more
Latent Process Generator Matching
Summary
The paper introduces a novel framework called latent process generator matching, which extends existing generator matching theories to accommodate time-dependent latent processes. Traditional flow-matching and diffusion models often rely on auxiliary stochastic dynamics during training, which complicates the generation process as these auxiliary states may be intractable or irrelevant at generation time. The authors propose a method where the observed generative state is treated as a deterministic image of a tractable Markov process. This allows for the learning of a generator on the image space that matches the one-time marginal distributions of the projected process. The framework generalizes previous work by enabling the use of a broader class of latent processes, including continuous, discrete, or manifold-valued processes. The paper also discusses sufficient conditions for valid loss functions in this context and presents a potential application in protein structure generation, where the model distinguishes between chain-level rigid-body motion and internal flexibility.
Methodology
The authors develop a framework that utilizes a time-inhomogeneous Feller process on an arbitrary state space and a mapping to learn a linear parametrization of the generator of a Feller process. This approach allows for the training of a neural network against conditional generators, leading to the recovery of the correct marginal generator through a gradient equality.
Results
The proposed framework successfully generalizes previous discrete latent process results and demonstrates that the learned process at time t=1 can sample from the desired data distribution. The sufficient conditions for valid loss functions are established, confirming the applicability of the framework to various generative modeling scenarios.
Implications
This work has significant implications for generative modeling, particularly in fields requiring complex latent structures, such as biology (e.g., protein structure generation) and other domains where understanding the dynamics of latent processes is crucial. The framework could lead to the development of more efficient and flexible generative models.
Variance Reduction for Expectations with Diffusion Teachers
Generative Models
Optimization
Efficient ML
- Introduction of CARV, a framework for variance reduction in diffusion teacher gradients.
- Hierarchical Monte Carlo estimator that amortizes expensive computations over cheaper resamples.
- Significant variance reduction achieved through timestep importance sampling and stratified sampling.
- Demonstrated 2-3x effective compute multipliers in text-to-3D and attribution tasks.
Read more
Variance Reduction for Expectations with Diffusion Teachers
Summary
This paper addresses the challenge of high computational costs associated with Monte Carlo (MC) expectations in downstream applications utilizing pretrained diffusion models as teachers. The authors introduce CARV, a compute-aware variance-accounting framework that proposes a hierarchical MC estimator to reduce variance while maintaining computational efficiency. By amortizing expensive upstream computations over cheaper diffusion noise resamples, the framework employs techniques such as timestep importance sampling and stratified-inverse-CDF sampling. The experiments conducted in text-to-3D distillation and data attribution demonstrate that CARV achieves effective compute multipliers of 2-3 times, primarily through amortized reuse, while also significantly reducing gradient variance in single-step distillation. However, it is noted that the reduction in variance does not translate to improved downstream performance in terms of FrΓ©chet Inception Distance (FID). The paper emphasizes the need for a principled approach to variance reduction in teacher-guided optimization and presents a comprehensive evaluation of the proposed methods.
Methodology
The authors developed a hierarchical Monte Carlo estimator that caches expensive computations and resamples cheaper diffusion noise. They implemented timestep importance sampling based on explicit teacher weights and combined it with stratified-inverse-CDF sampling to optimize variance reduction. The CARV framework was evaluated across various applications, including text-to-3D distillation and data attribution.
Results
The CARV framework achieved effective compute multipliers of 2-3 times in text-to-3D distillation and attribution tasks, primarily due to amortized reuse of computations. In single-step distillation, the variance of gradients was reduced by an order of magnitude, although this did not lead to improvements in downstream FID metrics.
Implications
The findings suggest that variance reduction techniques can significantly enhance the efficiency of using diffusion models as teachers in various applications. The CARV framework provides a structured approach to optimizing computational resources while maintaining the integrity of the learning objectives, which could be beneficial for practitioners in the field of generative modeling and related areas.
A Dialogue between Causal and Traditional Representation Learning: Toward Mutual Benefits in a Unified Formulation
Theory
- Introduces a unified formulation for representation learning that includes both task and constraint components.
- Emphasizes the mutual benefits of integrating causal representation learning with traditional representation learning.
- Demonstrates through experiments that the effectiveness of causal constraints varies significantly with different task formulations.
- Clarifies the relationship between CRL and traditional representation learning, promoting better communication and collaboration between the two fields.
Read more
A Dialogue between Causal and Traditional Representation Learning: Toward Mutual Benefits in a Unified Formulation
Summary
This paper addresses the divergence between causal representation learning (CRL) and traditional representation learning, which have evolved along different pathsβone being more application-driven and the other more theory-driven. The authors propose a unified formulation that characterizes representation learning through two components: a task component, which defines the information the representation must preserve, and a constraint component, which imposes structure on the latent space. This formulation aims to bridge the gap between the two fields, allowing for mutual benefits. The authors argue that CRL can provide theoretical insights into when structured constraints are beneficial, while traditional representation learning can inform practical task design. Through empirical studies on the CausalVerse benchmark, they demonstrate that the effectiveness of causal constraints is highly dependent on the specific task they are paired with, highlighting the importance of task-constraint interactions in representation learning.
Methodology
The authors propose a unified framework for representation learning that consists of a task component and a constraint component. They conduct empirical studies using the CausalVerse benchmark to analyze how different task formulations (reconstruction-based, contrastive, and masked prediction) interact with structured constraints in CRL. The methodology involves comparing the performance of various task-constraint combinations to assess their effectiveness.
Results
The experimental results indicate that the effectiveness of causal constraints is significantly influenced by the specific task they are associated with. This finding underscores the importance of considering both task and constraint components in the design of CRL methods, suggesting that the practical performance of these methods is not solely determined by constraints but also by the nature of the tasks.
Implications
The proposed unified formulation can enhance communication between the causal and traditional representation learning communities, fostering collaborative advancements. The insights gained from this study may lead to improved CRL methods that are more effective in real-world applications by better aligning task objectives with structured constraints.
Fast and Stable Triangular Inversion for Delta-Rule Linear Transformers
NLP
Large Language Models
Efficient ML
- Introduces a systematic analysis of triangular matrix inversion methods for Delta-Rule Linear Transformers.
- Highlights the importance of numerical stability in maintaining model accuracy during matrix inversion.
- Demonstrates significant performance improvements with up to 4.3Γ speed-up on NPUs compared to existing methods.
- Focuses on leveraging hardware efficiency through matrix product-rich algorithms.
Read more
Fast and Stable Triangular Inversion for Delta-Rule Linear Transformers
Summary
This paper addresses the performance bottleneck associated with triangular matrix inversion in Delta-Rule Linear Transformers, which are integral to efficient long-context architectures in machine learning. The authors conduct a systematic analysis of both direct and iterative triangular inversion algorithms, emphasizing their numerical stability, computational complexity, and hardware efficiency. The study reveals that the inversion operation is sensitive to numerical errors, which can severely impact model accuracy. By evaluating various triangular inversion methods, the authors demonstrate that their proposed algorithms can leverage modern hardware capabilities effectively. Experimental results indicate that their approach achieves up to 4.3Γ speed-up compared to existing state-of-the-art implementations, while maintaining full end-to-end model accuracy. This work not only enhances the performance of linear attention mechanisms but also contributes to the broader understanding of numerical stability in machine learning applications.
Methodology
The authors analyze both direct and iterative algorithms for triangular matrix inversion, focusing on their numerical stability and computational efficiency. They conduct performance benchmarks on NPUs and evaluate the algorithms under low-precision floating-point representations to assess their practical applicability.
Results
The proposed algorithms achieve up to 4.3Γ speed-up in triangular matrix inversion compared to state-of-the-art implementations, significantly improving performance at the layer level while preserving end-to-end model accuracy.
Implications
The findings suggest that optimizing triangular matrix inversion can lead to more efficient implementations of linear attention mechanisms in large language models, potentially enhancing their scalability and performance in real-world applications.
Weasel: Out-of-Domain Generalization for Web Agents via Importance-Diversity Data Selection
NLP
Large Language Models
Efficient ML
- WEASEL is the first data selection approach designed for offline web agent training, focusing on out-of-domain generalization and training efficiency.
- The method employs a greedy algorithm to optimize trajectory selection based on importance and diversity.
- Target-centered AXTree pruning is introduced to enhance training efficiency by removing irrelevant content.
- The approach includes generating style-consistent reasoning traces to improve performance in reasoning-native models.
Read more
Weasel: Out-of-Domain Generalization for Web Agents via Importance-Diversity Data Selection
Summary
The paper introduces WEASEL, a novel trajectory selection method aimed at enhancing the out-of-domain generalization capabilities of web agents trained using large language models (LLMs). Traditional web agents often struggle to adapt to new domains due to their training on specific trajectories, which leads to performance drops in unseen environments. WEASEL addresses this issue by selecting a fixed-budget subset of trajectory steps that balances unary importance (goal relevance) with pairwise diversity (coverage of different states and interaction patterns). The authors employ a greedy algorithm for efficient subset selection and introduce target-centered AXTree pruning to retain only the most relevant content around the action target, thereby improving training efficiency. Additionally, to mitigate style mismatch in reasoning-native models, the authors propose generating style-consistent rationales instead of relying on expert traces. The effectiveness of WEASEL is demonstrated through extensive evaluations on datasets like AgentTrek and NNetNav, showing significant improvements in out-of-domain performance and training speed, achieving up to 12.5 times faster training compared to standard fine-tuning methods.
Methodology
WEASEL formulates a fixed-budget subset selection problem that balances unary importance and pairwise diversity. A greedy algorithm is used for efficient selection from large trajectory collections. Target-centered AXTree pruning is applied to retain only relevant content, and style-consistent reasoning traces are generated to address style mismatches in training data.
Results
WEASEL significantly improves out-of-domain performance, achieving a +4.8 gain in zero-shot transfer from AgentTrek to WebArena-Lite. It also provides training speedups of approximately 9.7 to 12.5 times compared to traditional fine-tuning methods, demonstrating both enhanced generalization and efficiency.
Implications
The findings suggest that WEASEL can be effectively utilized in training web agents for diverse and dynamic environments, potentially leading to more robust and adaptable AI systems capable of handling real-world tasks with varying interaction patterns.
Behavior-Consistent Deep Reinforcement Learning
Reinforcement Learning
Robotics
Theory
- Introduction of behavior-consistent reinforcement learning (BRL) as a new framework.
- Establishment of a theoretical link between policy divergence and Q-function disagreement.
- Identification of challenges in high-entropy maximum-entropy RL.
- Development of Q-value Expectile Disagreement (QED) for improved behavioral consistency.
Read more
Behavior-Consistent Deep Reinforcement Learning
Summary
This paper addresses the issue of high variance in reinforcement learning (RL) training runs, which leads to inconsistent policy performance and behavior. The authors introduce the concept of behavior-consistent RL, aiming to develop policies that are both high-performing and distributionally similar across different training runs. They leverage maximum-entropy RL to control behavioral divergence by anchoring training runs to a common prior. The authors prove that for Boltzmann policies, adjusting the temperature based on Q-function disagreement can bound the divergence between policies. They propose a novel method called Q-value Expectile Disagreement (QED), which employs a state-dependent temperature schedule to enhance behavioral consistency. Empirical evaluations across 18 continuous-control tasks demonstrate that QED significantly reduces policy divergence while maintaining competitive performance, thereby improving stability and replicability in RL applications.
Methodology
The authors formalize behavior-consistent RL and utilize maximum-entropy RL principles to control policy divergence. They introduce QED, which adapts the temperature in RL based on Q-function disagreement, allowing for early consistency and later convergence during training.
Results
QED reduces pairwise policy divergence across independent runs by up to two orders of magnitude while maintaining competitive performance. It also results in approximately 50% reduction in return variance, leading to more stable and consistent policy behaviors.
Implications
The findings suggest that controlling behavioral consistency is crucial for reliable RL applications in real-world scenarios, such as robotics and AI systems, where consistent behavior is as important as performance. This work provides a new perspective on stability and replicability in RL research and applications.
Chronicle: A Multimodal Foundation Model for Joint Language and Time Series Understanding
Multimodal
NLP
Time Series
- Chronicle is the first model to jointly pretrain on text and time series from scratch.
- It utilizes a shared transformer architecture for both modalities, enhancing cross-domain representation learning.
- Chronicle achieves competitive performance against state-of-the-art unimodal foundation models.
- The model sets new benchmarks for frozen-embedding time series classification and multimodal forecasting.
Read more
Chronicle: A Multimodal Foundation Model for Joint Language and Time Series Understanding
Summary
Chronicle is a novel multimodal foundation model designed to jointly understand natural language and time series data. Unlike existing models that adapt pretrained language models post hoc, Chronicle is trained from scratch on both modalities within a unified architecture, allowing for mutual shaping of representations. The model employs a compact 324M-parameter decoder-only transformer, where both text and time series share the same transformer blocks and attention mechanisms. The pretraining process primarily utilizes unimodal batches, with a brief alignment stage for cross-modal integration. Chronicle is the first model to be evaluated against dedicated foundation models in both language and time series domains, demonstrating its capabilities across various benchmarks. The results show that Chronicle matches or surpasses existing models in language understanding tasks and sets new performance standards for time series classification and multimodal forecasting, all while maintaining a single backbone architecture.
Methodology
Chronicle employs a decoder-only transformer architecture with 324 million parameters, trained from scratch on both natural language and time series data. The training process involves using unimodal batches predominantly, followed by a short alignment stage to interleave text and time series for cross-modal integration. This approach allows the model to learn shared representations without the need for large pretrained models.
Results
Chronicle matches the performance of Gemma-3-270M-PT on 19 natural language understanding tasks, sets a new benchmark for frozen-embedding time series classification on 24 UCR/UEA datasets, and outperforms all supervised fusion baselines on the Time-MMD multimodal forecasting task.
Implications
The development of Chronicle suggests that joint training of language and time series data can lead to improved model performance across various tasks. This approach may have significant implications for applications in forecasting, classification, and embedding extraction, particularly in domains where textual context is critical for understanding time series data.
Nonparametric Learning and Earning with One-Point Feedback under Nonstationarity
Optimization
Theory
- Proposes a nonparametric learning framework for dynamic pricing under nonstationarity.
- Utilizes one-point feedback for revenue-based gradient approximations.
- Incorporates a restarting mechanism to adapt to changing market conditions.
- Introduces a meta-learning layer to handle unknown nonstationarity levels.
Read more
Nonparametric Learning and Earning with One-Point Feedback under Nonstationarity
Summary
This paper addresses the challenges of dynamic pricing in nonstationary environments where firms can only observe revenue from a single posted price each period. The authors propose a nonparametric learning framework that utilizes revenue-based gradient approximations derived from one-point feedback. To adapt to changing market conditions, they introduce a restarting mechanism that refreshes the learning process, allowing the model to discount outdated information. Additionally, a meta-learning layer is incorporated to hedge against unknown degrees of nonstationarity by aggregating multiple restarting schedules. The authors provide performance guarantees for their approach, demonstrating that cumulative revenue loss relative to a fully informed benchmark is influenced by the time horizon and market variation. Simulation experiments with both synthetic and real-world data validate the effectiveness of the proposed methods, showcasing their potential for improving revenue management strategies in dynamic pricing scenarios.
Methodology
The authors develop a hierarchical framework that integrates a forgetting principle with a one-point feedback gradient estimation method. This approach allows for the construction of nearly unbiased gradient estimators from single revenue observations, while the restarting mechanism periodically resets the learning process to better track changes in the nonstationary environment. A meta-learning layer aggregates multiple restarting strategies to adaptively hedge against variations in market conditions.
Results
The proposed framework shows that cumulative revenue loss can be minimized relative to a fully informed benchmark, with performance guarantees depending on the time horizon and the magnitude of market variation. Simulation results indicate that the methods are effective in adapting to nonstationary environments and improving revenue outcomes.
Implications
The findings suggest that firms can enhance their dynamic pricing strategies by implementing the proposed learning framework, particularly in environments characterized by rapid changes in customer preferences and market conditions. This approach could lead to more effective revenue management practices across various industries.
Plug-and-Play Spiking Operators: Breaking the Nonlinearity Bottleneck in Spiking Transformers
NLP
Large Language Models
Efficient ML
- Introduces a plug-and-play framework for spiking operators in Transformers.
- Decomposes nonlinear computations into spike-friendly primitives.
- Supports common Transformer nonlinearities without fine-tuning.
- Demonstrates minimal accuracy loss (<1%) across various tasks.
Read more
Plug-and-Play Spiking Operators: Breaking the Nonlinearity Bottleneck in Spiking Transformers
Summary
This paper addresses the limitations of current ANN-to-SNN conversion methods for Transformer architectures, particularly the lack of efficient implementations for nonlinear operations essential for spiking neural networks (SNNs). The authors propose a plug-and-play framework that provides spike-friendly approximations for key nonlinear operators such as division, exponentiation, and β2 norms, which are critical for operations like Softmax and normalization. By utilizing population computation with leaky integrate-and-fire (LIF) neuron groups and lightweight bit-shift scaling, the proposed method allows for the integration of these nonlinearities into existing ANN-to-SNN pipelines without requiring fine-tuning. The authors demonstrate that their approach incurs less than a 1% accuracy drop across various tasks when replacing targeted nonlinear operators in large language models (LLMs). This work significantly enhances the compatibility of spiking Transformers with neuromorphic hardware, enabling more efficient and practical implementations of large-scale language models in energy-constrained environments.
Methodology
The authors developed a framework that approximates nonlinear operations in Transformers using spike-based computations. They identified three key primitives (division, exponentiation, and β2 norms) and implemented them using LIF neuron groups and bit-shift scaling to avoid floating-point arithmetic. The modular operator blocks created can be integrated into existing ANN-to-SNN conversion pipelines seamlessly.
Results
The proposed method was empirically validated on multiple large language models, showing that the selective replacement of nonlinear operators led to less than a 1% drop in accuracy across all evaluated tasks. The framework was successfully integrated into two widely used ANN-to-SNN conversion frameworks, demonstrating broad applicability.
Implications
This work has significant implications for the deployment of large language models on neuromorphic hardware, potentially leading to more energy-efficient AI systems. The ability to implement spiking Transformers without extensive retraining opens new avenues for research and application in energy-constrained environments.
Divide and Contrast: Learning Robust Temporal Features without Augmentation
Time Series
- Di-COT eliminates the need for data augmentation and multiple encoder passes, reducing computational overhead.
- The method contrasts overlapping sub-blocks within time-series instances, ensuring meaningful representation learning.
- Di-COT reformulates temporal contrastive learning as a cross-entropy classification task for dense supervision.
- The framework achieves state-of-the-art performance on multiple benchmarks while maintaining low training times.
Read more
Divide and Contrast: Learning Robust Temporal Features without Augmentation
Summary
The paper presents Divide and Contrast (Di-COT), a novel unsupervised framework for self-supervised learning of time-series representations that eliminates the need for data augmentation and multiple encoder passes. Di-COT addresses the limitations of existing methods that often rely on high computational costs and assumptions about temporal dynamics that may not hold across diverse datasets. By stochastically partitioning each time-series window into overlapping sub-blocks, Di-COT contrasts these substructures within a single instance, treating adjacent sub-blocks as positive pairs while designating others as negatives. This approach mitigates false positives during temporal transitions and enhances the scalability of the contrastive objective, making loss computation independent of sequence length. The authors demonstrate that Di-COT achieves state-of-the-art performance on various classification, clustering, kNN, and cross-dataset transfer tasks across six large-scale datasets and the UCR and UEA benchmarks, while significantly reducing training time.
Methodology
Di-COT employs a stochastic partitioning strategy to divide time-series windows into overlapping sub-blocks. It contrasts these sub-blocks within individual instances, treating temporally adjacent blocks as positive pairs and others as negatives. The loss computation is designed to be independent of sequence length, enhancing scalability and efficiency.
Results
Di-COT demonstrated superior performance in classification, clustering, kNN, and cross-dataset transfer tasks, achieving state-of-the-art results on six large-scale datasets and the UCR and UEA benchmarks, while significantly reducing training time compared to existing methods.
Implications
The findings suggest that Di-COT can be effectively applied to various time-series analysis tasks, potentially improving the efficiency and effectiveness of self-supervised learning in domains where labeled data is scarce or expensive to obtain.
Decomposing MXFP4 quantization error for LLM reinforcement learning: reducible bias, recoverable deadzone, and an irreducible floor
Reinforcement Learning
Large Language Models
Efficient ML
- Introduces a structural decomposition of MXFP4 quantization error into three components: scale bias, deadzone truncation, and grid noise.
- Demonstrates that each error component corresponds to specific RL failure modes affecting training outcomes.
- Proposes targeted corrections for each failure mode, improving the accuracy of RL post-training.
- Empirical results show significant recovery of accuracy in large language models post-quantization.
Read more
Decomposing MXFP4 quantization error for LLM reinforcement learning: reducible bias, recoverable deadzone, and an irreducible floor
Summary
This paper addresses the significant accuracy degradation caused by quantization errors in MXFP4 arithmetic during reinforcement learning (RL) post-training of large language models (LLMs). The authors present a novel three-way decomposition of quantization error into scale bias, deadzone truncation, and grid noise, each affecting RL training in distinct ways. Scale bias accumulates through the backward pass, impacting gradient accuracy; deadzone truncation reduces rollout quality by zeroing small values; and grid noise raises policy entropy. The authors propose targeted corrections for each failure mode: Macro-block scaling for scale bias, Outlier Fallback for deadzone recovery, and Adaptive Quantization Noise for controlling policy entropy. Empirical validation on Qwen2.5-3B and Qwen3-30B-A3B-Base models shows that these corrections can recover BF16 accuracy to within 0.7 and 3.0 percentage points, respectively, demonstrating the effectiveness of their approach in mitigating quantization errors in RL.
Methodology
The authors conducted a theoretical and empirical analysis to decompose the quantization error into three additive components. They then developed specific corrections aimed at mitigating the impact of each error component on RL training. The effectiveness of these corrections was validated through experiments on two different model architectures, comparing the accuracy of quantized models against their BF16 counterparts.
Results
The targeted corrections successfully recovered BF16 accuracy to within 0.7 percentage points for the Qwen2.5-3B model and 3.0 percentage points for the Qwen3-30B-A3B-Base model, demonstrating the practical effectiveness of their decomposition and correction strategies.
Implications
This work has significant implications for the deployment of large language models in resource-constrained environments, where quantization is essential for efficiency. By understanding and addressing the specific components of quantization error, researchers and practitioners can enhance the performance of RL-trained models, making them more viable for real-world applications.
OmniISR: A Unified Framework for Centralized and Federated Learning via Intermediate Supervision and Regularization
Federated Learning
Theory
Optimization
- Introduces a unified framework that integrates centralized and federated learning.
- Utilizes intermediate supervision and regularization to address optimization challenges.
- Provides theoretical guarantees for convergence and gradient alignment.
- Demonstrates significant performance improvements in both CL and FL settings.
Read more
OmniISR: A Unified Framework for Centralized and Federated Learning via Intermediate Supervision and Regularization
Summary
The paper presents OmniISR, a unified framework designed to integrate centralized learning (CL) and federated learning (FL) in the context of edge intelligence applications, such as autonomous driving. The authors highlight the challenges posed by differing legal frameworks that either allow or restrict data aggregation, necessitating a flexible approach to model training. OmniISR addresses the incompatibility of optimization regimes between CL and FL by introducing intermediate supervision and regularization (ISR) signals at multiple hidden layers of neural networks. The framework employs mutual-information (MI) for aligning internal covariate shifts in CL with client-drifting representations in FL, and negative-entropy (NE) as a regularizer to mitigate overconfidence in predictions. Theoretical contributions include a non-asymptotic convergence bound, a federated drift-bound, and guarantees for gradient alignment between CL and FL updates. Empirical results demonstrate that OmniISR significantly enhances model performance across various architectures and datasets, effectively bridging the performance gap between CL and FL.
Methodology
The OmniISR framework employs intermediate supervision and regularization strategies, utilizing mutual-information for alignment and negative-entropy for regularization. The framework is designed to be architecture-agnostic, allowing it to be applied across various neural network architectures without requiring specific adjustments. Theoretical analysis includes deriving convergence bounds and drift quantifications, while extensive experiments validate the framework's effectiveness across multiple datasets and learning algorithms.
Results
OmniISR consistently outperformed existing methods in both centralized and federated learning paradigms, achieving a 22.60% reduction in the performance gap between CL and FL. The framework yielded 37 out of 48 paired metric wins across different federated learning algorithms, demonstrating its robustness and effectiveness.
Implications
The OmniISR framework has significant implications for the deployment of machine learning models in environments with varying data governance regulations. It enables organizations to leverage both centralized and federated learning approaches, ensuring compliance with legal requirements while maintaining high model performance. This flexibility is particularly relevant for applications in autonomous driving and other edge intelligence scenarios.
Machine-Learning-Enhanced Non-Invasive Testing for MASLD Fibrosis: Shallow-Deep Neural Networks Versus FIB-4, Tabular Foundation Models, and Large Language Models
Theory
Efficient ML
- MLE-NIT can improve advanced fibrosis detection in MASLD without requiring additional biomarkers.
- The s-DNN model outperformed both TabPFN and GPT-4o in external validation cohorts.
- The study highlights the importance of local calibration and threshold selection for clinical utility.
- AST and FIB-4 were identified as the most significant variables influencing model performance.
Read more
Machine-Learning-Enhanced Non-Invasive Testing for MASLD Fibrosis: Shallow-Deep Neural Networks Versus FIB-4, Tabular Foundation Models, and Large Language Models
Summary
This study investigates the use of machine-learning-enhanced non-invasive testing (MLE-NIT) to improve the detection of advanced fibrosis in metabolic dysfunction-associated steatotic liver disease (MASLD). The authors compare the performance of a shallow-deep neural network (s-DNN), TabPFN, and a fine-tuned large language model (GPT-4o) against the traditional FIB-4 score, which is a widely used non-invasive test based on routine clinical variables. Using three biopsy-confirmed cohorts from China, Malaysia, and India (n=784), the study aims to determine if MLE-NIT can enhance fibrosis detection without requiring additional clinical data. The models utilized five variables: age, FIB-4, aspartate aminotransferase (AST), alanine aminotransferase (ALT), and platelet count (PLT). The results showed that the s-DNN outperformed the other models, achieving external ROC-AUCs of 0.77 and 0.67 in Malaysia and India, respectively. The study concludes that compact, domain-specific non-linear models can effectively enhance FIB-4-based fibrosis assessment, although further validation is needed for clinical application.
Methodology
The study utilized three biopsy-confirmed MASLD cohorts, splitting the Chinese cohort into training and validation sets. The models were trained using five clinical variables, and performance was evaluated using ROC-AUC metrics on external cohorts from Malaysia and India. Calibration analysis and decision-curve analysis were also conducted to assess clinical utility.
Results
The FIB-4 score achieved ROC-AUCs of 0.75 and 0.60 in Malaysia and India, respectively. The s-DNN achieved the highest ROC-AUCs of 0.77 and 0.67, outperforming TabPFN and GPT-4o. Calibration analysis indicated favorable Brier scores for the s-DNN, and permutation importance analysis identified AST and FIB-4 as key variables.
Implications
The findings suggest that MLE-NITs, particularly the s-DNN, can enhance the assessment of advanced fibrosis in MASLD, potentially improving patient stratification and clinical decision-making without increasing data requirements. This approach may facilitate broader implementation of non-invasive testing in clinical settings.
DASH: Fast Differentiable Architecture Search for Hybrid Attention in Minutes on a Single GPU
NLP
Large Language Models
Efficient ML
- DASH provides a differentiable search framework for hybrid attention architecture design, moving beyond manual and selector-style methods.
- The framework allows for architecture-only optimization, significantly reducing search time and token usage.
- DASH consistently outperforms existing hybrid attention design baselines and achieves better performance than Jet-Nemotron models.
- The method demonstrates that high-quality hybrid architectures can be obtained quickly, paving the way for routine design applications.
Read more
DASH: Fast Differentiable Architecture Search for Hybrid Attention in Minutes on a Single GPU
Summary
The paper introduces DASH, a novel framework for fast differentiable architecture search aimed at hybrid attention designs in large language models (LLMs). Traditional methods for hybrid architecture design often rely on manual rules or proxy signals, which can be inefficient and cumbersome. DASH addresses this by transforming the architecture search into a differentiable optimization problem, allowing for continuous operator allocation across layers. This approach significantly reduces the search cost, requiring only 12.3 million tokens and approximately 20 minutes on a single RTX Pro 6000 GPU, compared to the 200 billion tokens used by previous methods like Jet-Nemotron. The authors demonstrate that DASH not only outperforms existing selector-style baselines but also achieves superior performance on the RULER benchmark compared to Jet-Nemotron models, while remaining competitive on other benchmarks. The findings suggest that efficient hybrid attention architectures can be developed rapidly, making DASH a promising tool for future architecture design in LLMs.
Methodology
DASH formulates the architecture search as a differentiable problem, optimizing layer-wise operator allocations while keeping model and operator weights frozen. It involves preparing reusable teacher-aligned candidates and performing soft routing during the search process to optimize architecture parameters efficiently.
Results
DASH achieves superior performance on the RULER benchmark compared to a comprehensive suite of existing selector-style hybrid attention designs. Each search run requires only 12.3 million tokens and completes in about 20 minutes, demonstrating a significant efficiency improvement over previous methods like Jet-Nemotron.
Implications
The development of DASH suggests that rapid and efficient design of hybrid attention architectures is feasible, which could lead to more accessible and routine applications in LLM development. This could enhance the efficiency of LLMs in practical applications, particularly in scenarios requiring long-context inference.
Towards Understanding Self-Pretraining for Sequence Classification
Theory
Optimization
- Self-pretraining (SPT) significantly enhances Transformer model performance in sequence classification tasks.
- The ability of label supervision to learn useful Attention patterns from random initialization is a central challenge.
- Learning proximity interactions are identified as a key source of improvements from SPT.
- SPT gains persist across different model depths, data sources, and pretraining durations.
Read more
Towards Understanding Self-Pretraining for Sequence Classification
Summary
This paper investigates the concept of self-pretraining (SPT) for Transformer models in sequence classification, building on the findings of Amos et al. (2024) that demonstrated significant accuracy improvements through a masked token prediction objective without external data. The authors replicate and systematically ablate these findings to understand how SPT enhances optimization and addresses the limitations of standard supervised training in Transformers. Their analysis reveals that the main bottleneck is not merely model depth or generalization but the ability of label supervision to learn effective Attention patterns from random initialization. The authors identify learning proximity interactions as a critical factor contributing to SPT's success, transforming positional encodings into proximity-biased Attention scores. Additionally, they present a simplified theoretical framework showing that label supervision may overlook certain Attention-score directions that SPT can effectively detect through masked reconstruction. The study concludes that SPT can lead to better generalization and performance across various settings, emphasizing its potential as a valuable training paradigm.
Methodology
The authors replicate the results of Amos et al. (2024) and conduct systematic ablations to explore the effects of pretraining duration, model depth, data source, and optimization strategies. They analyze Attention patterns using a minimal synthetic example to understand the mechanisms behind SPT's effectiveness.
Results
The replication of Amos et al.'s results confirms the effectiveness of SPT, showing consistent performance improvements across various tasks and settings. The ablation studies reveal that specific Attention parameters, particularly WQ and WK, are crucial for the gains observed with SPT, indicating that these parameters can effectively learn from self-pretraining.
Implications
The findings suggest that self-pretraining could be a valuable approach for improving Transformer models in various sequence classification tasks, potentially reducing the need for extensive external datasets. This could lead to more efficient training processes and better generalization in real-world applications.
CRAFT: Conflict-Resolved Aggregation for Federated Training
Federated Learning
- CRAFT reformulates federated aggregation as a constrained least-squares problem to ensure conflict-free updates.
- The method employs a momentum-like reference direction to preserve useful temporal information during aggregation.
- Layer-wise adaptation allows for conflict resolution at varying granularities, making it suitable for deep neural networks.
- Extensive experiments demonstrate improved mean accuracy and reduced accuracy disparity across clients.
Read more
CRAFT: Conflict-Resolved Aggregation for Federated Training
Summary
CRAFT (Conflict-Resolved Aggregation for Federated Training) addresses the challenges of aggregating conflicting client updates in federated learning (FL) under heterogeneous data distributions. Traditional methods like naive averaging can lead to global updates that improve overall performance but may degrade the accuracy for specific clients. CRAFT reformulates the aggregation process as a geometric correction problem, aiming to find a global update that aligns positively with all client updates while minimizing conflicts. The authors derive a closed-form solution for this constrained optimization problem, which avoids the computational burden of iterative solvers. Additionally, CRAFT employs a layer-wise adaptation strategy to resolve conflicts at different feature granularities. The theoretical analysis demonstrates that CRAFT promotes a common-descent structure and mitigates conflicts effectively. Experimental results on heterogeneous datasets show that CRAFT not only enhances the mean accuracy of the global model but also reduces performance disparities among clients compared to state-of-the-art methods. This framework can be integrated into existing personalized FL pipelines, offering substantial performance improvements.
Methodology
CRAFT formulates the aggregation of client updates as a reference-anchored constrained least-squares problem, ensuring positive alignment between the global update and each client update. The method derives a closed-form solution based on MooreβPenrose correction, avoiding iterative solvers. It also incorporates a layer-wise adaptation strategy to address conflicts at different model layers.
Results
The experimental results indicate that CRAFT significantly improves the mean accuracy of the global model while simultaneously reducing the standard deviation of client accuracy, thereby enhancing fairness across clients. Compared to various state-of-the-art aggregation methods, CRAFT demonstrates superior performance in both accuracy and disparity reduction.
Implications
CRAFT's approach to conflict resolution in federated learning can lead to more equitable model performance across diverse client data distributions. Its integration into existing FL frameworks could facilitate broader adoption of federated learning in real-world applications where data privacy and fairness are critical.
Mitigating Label Bias with Interpretable Rubric Embeddings
Interpretability
- Rubric embeddings provide a framework to mitigate label bias in machine learning models.
- Traditional black-box embeddings can encode sensitive attributes and replicate biases from historical evaluations.
- The proposed method shows empirical success in reducing group disparities while improving cohort quality.
- Rubric embeddings are constructed from expert-defined criteria, ensuring alignment with the desired outcomes.
Read more
Mitigating Label Bias with Interpretable Rubric Embeddings
Summary
This paper addresses the issue of label bias in statistical decision algorithms, particularly in contexts where ground-truth labels are difficult to obtain, such as hiring and university admissions. The authors propose a novel approach using rubric embeddings, which are derived from expert-defined criteria that align with the underlying constructs of interest. By replacing traditional black-box embeddings with these interpretable representations, the authors aim to mitigate biases that may arise from historical human evaluations. The paper provides both theoretical and empirical evidence supporting the effectiveness of rubric embeddings in reducing label bias. The authors evaluate their method on a dataset of applications to a master's program, demonstrating that models trained on rubric embeddings not only reduce group disparities but also enhance overall cohort quality. This work suggests that using interpretable, domain-grounded representations can be a practical solution for addressing label bias in various decision-making contexts.
Methodology
The authors developed rubric embeddings by creating hundreds of rubric items that encode relevant features such as past coursework, grades, work experience, and evaluations. They utilized large language models (LLMs) to score applications along these rubric dimensions, resulting in semantically grounded representations. The effectiveness of this method was evaluated through both theoretical analysis and empirical testing on a novel dataset of master's program applications.
Results
The results indicated that models trained on rubric embeddings significantly reduced group disparities compared to those using traditional black-box embeddings. Additionally, the quality of the applicant cohorts improved, suggesting that the rubric-based approach effectively addresses the limitations of previous methods for mitigating label bias.
Implications
This research has significant implications for algorithmic fairness, particularly in high-stakes decision-making contexts like hiring and admissions. By providing a method to create interpretable and semantically meaningful representations, the findings can help organizations design more equitable algorithms that minimize bias and enhance diversity.
Gaussian Sheaf Neural Networks
Graph Learning
Theory
- Introduction of Gaussian Sheaf Neural Networks (GSNNs) for learning with Gaussian-distributed node features.
- Development of a new Laplacian operator that generalizes the sheaf Laplacian for Gaussian distributions.
- GSNNs demonstrate superior performance compared to traditional GNNs on both synthetic and real-world datasets.
- The framework effectively preserves the geometric and algebraic structure of Gaussian parameters during message passing.
Read more
Gaussian Sheaf Neural Networks
Summary
This paper introduces Gaussian Sheaf Neural Networks (GSNNs), a novel framework designed to enhance Graph Neural Networks (GNNs) by incorporating probabilistic node features represented as Gaussian distributions. Traditional GNNs typically rely on vector-valued node features, which can overlook the underlying geometric and algebraic structures inherent in Gaussian parameters (mean and covariance). The authors leverage the theory of cellular sheaves to construct a new Laplacian operator that generalizes the sheaf Laplacian, preserving its essential properties while accommodating Gaussian distributions. The paper details the construction of the sheaf of means and covariances, characterizes the new Laplacian, and demonstrates the expressivity of GSNNs through various restriction map classes. Empirical evaluations on synthetic and real-world datasets reveal that GSNNs often outperform baseline models and exhibit robustness against oversmoothing as model depth increases, highlighting their practical relevance in graph-based learning tasks.
Methodology
The authors build upon the theory of cellular sheaves to define a Gaussian sheaf, replacing vector-space stalks with spaces of Gaussian distributions characterized by means and covariances. They derive a new Laplacian operator suitable for this setting and conduct experiments to validate the performance of GSNNs against traditional GNNs on various datasets.
Results
The experimental results indicate that GSNNs frequently outperform baseline models in both synthetic and real-world scenarios. Additionally, GSNNs are shown to be robust against oversmoothing, maintaining performance even as model depth increases.
Implications
The introduction of GSNNs could significantly enhance applications that require uncertainty-aware predictions in relational data, such as traffic modeling, sensor networks, and knowledge graphs. By effectively incorporating the structure of Gaussian distributions, GSNNs may lead to more accurate and reliable models in various domains.
Beyond Numerical Features: CNN-Driven Algorithm Selection via Contour Plots for Continuous Black-Box Optimization
Optimization
- Introduces a probing-based AAS formulation using contour maps for continuous BBO.
- Demonstrates the effectiveness of CNNs in predicting solver performance from visual representations.
- Shows significant performance improvements over traditional single best solver approaches.
- Competes well with feature-based methods like ELA and Deep-ELA.
Read more
Beyond Numerical Features: CNN-Driven Algorithm Selection via Contour Plots for Continuous Black-Box Optimization
Summary
This paper presents a novel approach to automated algorithm selection (AAS) in continuous black-box optimization (BBO) by utilizing contour-map visualizations of optimization landscapes instead of traditional numerical descriptors. The authors propose a convolutional neural network (CNN) regressor that processes these contour maps to predict the performance of various solvers, allowing for the selection of the most suitable algorithm for each problem instance. The study demonstrates that this representation-driven method significantly outperforms the single best solver (SBS) and competes effectively with existing feature-based methods like Exploratory Landscape Analysis (ELA) and Deep-ELA. The results indicate that CNNs can effectively leverage the spatial structure of optimization landscapes, providing a promising alternative to handcrafted features for algorithm selection in continuous optimization tasks.
Methodology
The authors developed a CNN model that takes grayscale contour maps generated from objective evaluations of optimization landscapes as input. Two variants of the model were tested: one that stacks multiple instance views as channels and another that encodes each view separately before aggregation. The model was trained to predict the performance of different solvers based on these contour maps.
Results
The proposed CNN-driven algorithm selectors significantly outperformed the single best solver on the BBOB 2009 single-objective protocol and were competitive with ELA and Deep-ELA baselines. A follow-up bi-objective evaluation confirmed the robustness of the image-based approach, indicating its effectiveness across different optimization scenarios.
Implications
The findings suggest that visual representations of optimization landscapes can enhance algorithm selection processes, potentially leading to more efficient optimization strategies in various scientific and engineering applications. This approach may reduce reliance on handcrafted features, simplifying the AAS pipeline.
Winfree Oscillatory Neural Network
Computer Vision
Theory
Efficient ML
- WONN is the first synchronization-based oscillatory architecture to scale competitively to ImageNet-1K.
- The architecture achieves 80.1% accuracy on Maze-hard using only 1% of the parameters of prior state-of-the-art models.
- WONN combines geometric inductive biases with flexible synchronization dynamics for improved representation learning.
- The learned representations exhibit bimodal phase organization, enhancing the model's ability to capture complex structures.
Read more
Winfree Oscillatory Neural Network
Summary
The paper introduces the Winfree Oscillatory Neural Network (WONN), a novel neural architecture that leverages generalized Winfree dynamics to enhance representation learning through oscillatory interactions. Unlike traditional neural networks, WONN evolves representations on a toroidal phase space, where neurons act as phase oscillators. This architecture combines phase-based inductive biases with flexible interaction mechanisms, which can be either fixed trigonometric mappings or learnable neural networks. The authors evaluate WONN on various tasks, including image recognition on CIFAR and ImageNet, as well as complex reasoning tasks like Maze-hard and Sudoku. WONN demonstrates competitive or superior performance compared to existing models while maintaining strong parameter efficiency, achieving notable results such as 80.1% accuracy on Maze-hard using only 1% of the parameters of state-of-the-art models. The findings suggest that structured oscillatory dynamics can serve as a scalable and efficient alternative to conventional neural architectures, with implications for both visual perception and logical reasoning.
Methodology
WONN is developed by connecting classical Winfree dynamics with learnable neural architectures. It utilizes a population of coupled phase oscillators governed by the Winfree model, allowing for flexible synchronization behaviors and interaction structures. The architecture incorporates structured interactions and hierarchical dynamics to support both local coordination and global information flow.
Results
WONN achieves competitive performance on image recognition tasks, including CIFAR and ImageNet, and excels in complex reasoning tasks, such as achieving perfect accuracy on Sudoku. The architecture demonstrates strong parameter efficiency, outperforming existing models while using significantly fewer parameters.
Implications
The results suggest that oscillatory dynamics can provide a new computational principle for neural networks, potentially leading to advancements in both visual perception and logical reasoning tasks. This could inspire further research into synchronization-based architectures in machine learning.
OpenSeisML: Open Large-Scale Real Seismic and well-log Dataset for Generative AI
Generative Models
- Introduction of OpenSeisML, a large-scale dataset for seismic inversion.
- Automated data curation pipeline enhances reproducibility and efficiency.
- Dataset includes real seismic and well-log data, addressing the scarcity of high-quality datasets.
- Supports training of generative models for uncertainty quantification in subsurface properties.
Read more
OpenSeisML: Open Large-Scale Real Seismic and well-log Dataset for Generative AI
Summary
The paper introduces OpenSeisML, a large-scale collection of real seismic datasets aimed at enhancing generative AI workflows for seismic inversion. The authors highlight the limitations of existing datasets, which are often proprietary or synthetic, thus hindering the development of machine learning methods in this domain. OpenSeisML addresses this gap by providing curated seismic volumes and well-log data sourced from the UK National Data Repository. The dataset includes comprehensive petrophysical measurements and is designed to facilitate the training of generative models that can produce statistically consistent subsurface realizations. The authors present an automated data curation pipeline that ensures reproducibility and efficiency in data preparation, enabling the synthesis of multiple realizations for uncertainty quantification in seismic inversion. This initiative aims to support the development of standardized benchmarks and reproducible workflows in machine learning applications for geoscience.
Methodology
The authors developed an automated data curation pipeline that collects and preprocesses seismic data from the UK National Data Repository. The pipeline includes steps for downloading seismic volumes and well data, aligning them, and converting time-domain seismic data to depth using checkshot data. This structured approach ensures that the datasets are ready for training generative models without the need for additional preprocessing.
Results
The OpenSeisML dataset provides a comprehensive collection of real seismic and well-log data, enabling the training of generative models that capture the statistical distribution of subsurface properties. The automated curation process ensures that the data is structured and consistent, facilitating its use in machine learning applications.
Implications
The availability of OpenSeisML can significantly advance the field of seismic inversion by providing researchers and practitioners with access to high-quality, realistic datasets. This can lead to improved machine learning models that are capable of better generalization across diverse geological scenarios, ultimately enhancing the efficiency and accuracy of subsurface exploration and characterization.
Smaller Abstract State Spaces Enable Cross-Scale Generalization in Reinforcement Learning
Reinforcement Learning
Theory
Efficient ML
- Introduces a theoretical model for OOD generalization in RL agents using POMDPs.
- Extends state abstraction frameworks to POMDPs and proposes a novel successor-weighted model reduction.
- Derives a performance loss bound that highlights the relationship between abstract state space size and OOD generalization.
- Demonstrates that smaller abstract state spaces improve test performance and facilitate generalization to complex tasks.
Read more
Smaller Abstract State Spaces Enable Cross-Scale Generalization in Reinforcement Learning
Summary
This paper presents a theoretical model for achieving Out-of-Distribution (OOD) generalization in Reinforcement Learning (RL) agents, focusing on Partially Observable Markov Decision Processes (POMDPs). The authors extend existing state abstraction frameworks to POMDPs and introduce a successor-weighted model reduction technique that allows for the compression of state spaces into smaller abstract representations. They derive a performance loss bound that decomposes the agent's performance into approximation and estimation errors, demonstrating that reducing the size of the abstract state space enhances OOD generalization. The findings suggest that constraining agents to operate over a limited set of abstract states is crucial for generalizing to more complex tasks, motivating further research into scalable RL architectures that can adapt across varying task complexities.
Methodology
The authors extend existing proof techniques for performance loss bounds from finite to countably infinite state spaces in POMDPs. They define a successor-weighted model reduction that compresses state spaces and derive a bound on the performance of RL agents under different start-state distributions, allowing for the analysis of OOD generalization conditions.
Results
The main result is a bound on the performance of RL agents when tested on start-state distributions that differ from their training distributions. This bound reveals that reducing the number of abstract states can lead to improved OOD generalization, even when training and test tasks are probabilistically distinct.
Implications
The findings have significant implications for the design of RL systems, suggesting that smaller, more focused abstract state spaces can enhance generalization capabilities. This could lead to more efficient and adaptable AI systems that perform well across a range of tasks with varying complexities, potentially improving human-AI interaction and understanding of cognitive processes.
GraphDiffMed: Knowledge-Constrained Differential Attention with Pharmacological Graph Priors for Medication Recommendation
Graph Learning
Time Series
Interpretability
- Introduction of the first dual-scale application of Differential Attention v2 for medication recommendation.
- Demonstrated improvements in recommendation quality and safety performance over existing methods.
- Provided a transparent analysis of the impact of knowledge constraints on safety-performance balance.
- Showed that higher DDI rates in recommendations can reflect more comprehensive solutions for complex cases.
Read more
GraphDiffMed: Knowledge-Constrained Differential Attention with Pharmacological Graph Priors for Medication Recommendation
Summary
GraphDiffMed is a novel framework designed to enhance medication recommendation from electronic health records (EHRs) by integrating dual-scale Differential Attention v2 with pharmacological graph priors. The challenge of recommending safe and effective medication combinations is compounded by the complexity of patient data, which is often long, noisy, and heterogeneous. Existing methods typically excel in either temporal modeling or pharmacological knowledge integration but struggle to achieve both effectively. GraphDiffMed addresses this gap by applying differential attention mechanisms at both intra-visit and inter-visit levels, allowing for the filtering of irrelevant signals while incorporating pharmacological constraints during the learning process. The framework was evaluated using the MIMIC-III dataset, demonstrating significant improvements in recommendation quality and safety performance compared to strong baseline models. Notably, the best-performing configuration utilized only demographic auxiliary features, suggesting a streamlined approach to medication recommendation. The findings indicate that combining noise-aware attention with pharmacological constraints can lead to more reliable and clinically relevant recommendations, ultimately aiding clinicians in making informed decisions in complex polypharmacy scenarios.
Methodology
GraphDiffMed employs dual-scale Differential Attention v2 to process EHR data, applying attention mechanisms at both intra-visit and inter-visit levels to mitigate noise. The framework integrates pharmacological constraints derived from drug-drug interaction (DDI) graphs during the training process, enhancing the model's ability to make clinically sound recommendations.
Results
Experiments conducted on the MIMIC-III dataset revealed that GraphDiffMed consistently outperformed strong baseline models in terms of recommendation quality and safety. The best configuration achieved optimal results using only demographic features, indicating the effectiveness of the model in simplifying input requirements while maintaining high performance.
Implications
The findings from this research could significantly impact clinical decision support systems by providing more reliable medication recommendations, thereby improving patient safety and treatment efficacy in polypharmacy situations. The open-source nature of the code allows for further exploration and adaptation in various healthcare settings.
TriForces: Augmenting Atomistic GNNs for Transferable Representations
Graph Learning
- TriForces introduces a three-stream architecture for atomistic GNNs, enhancing representation transferability.
- The framework utilizes self-supervised learning to improve the organization and quality of learned representations.
- Significant performance improvements were observed on multiple benchmarks without the need for Density Functional Theory (DFT) labels.
- The model enables efficient similarity retrieval in compositional, structural, or joint embedding spaces.
Read more
TriForces: Augmenting Atomistic GNNs for Transferable Representations
Summary
The paper introduces TriForces, a novel framework designed to enhance the transferability of atomistic Graph Neural Networks (GNNs) by separating composition and structure information. Traditional Machine Learning Interatomic Potentials (MLIPs) often struggle with transferability across different chemical domains due to their reliance on task-specific datasets and the entanglement of composition and geometry in their representations. TriForces addresses these issues by employing a three-stream architecture that distinctly encodes composition, structure, and their interactions. This separation allows for the application of self-supervised learning (SSL) techniques, which improve the organization of representations and facilitate efficient retrieval of similar structures. The framework demonstrates significant improvements in performance on benchmarks like MatBench and QM9, achieving a 57% reduction in energy Mean Absolute Error (MAE) on the OMat24 dataset with limited data. The authors provide pretrained models and code to support further research and application in the field.
Methodology
TriForces employs a three-stream architecture that separates composition, structure, and interaction information. It integrates self-supervised learning techniques, including denoising and masking, to enhance the stability and organization of representations. The framework is validated across various GNN architectures and benchmarks, demonstrating its effectiveness in improving transfer performance and data efficiency.
Results
TriForces achieved a 57% reduction in energy MAE on the OMat24 dataset with only 20K samples, alongside improvements in force MAE across different sample sizes. The framework outperformed baseline models on MatBench and QM9, showcasing enhanced transferability and representation quality.
Implications
The TriForces framework has the potential to significantly advance the field of materials science by enabling more efficient and effective modeling of atomistic systems. Its ability to facilitate transfer across different chemical domains could streamline the development of MLIPs and enhance exploratory analysis in materials discovery.
FedCoE: Bridging Generalization and Personalization via Federated Coordinated Dual-level MoEs
Federated Learning
- FedCoE balances global generalization and local personalization in federated learning.
- The framework utilizes a dual-level mixture-of-experts architecture to handle heterogeneous data.
- A shared gating network synchronizes expert selection across clients, addressing gating inconsistency.
- An adaptive mechanism allows new clients to quickly access global experts, improving cold-start performance.
Read more
FedCoE: Bridging Generalization and Personalization via Federated Coordinated Dual-level MoEs
Summary
The paper presents FedCoE, a novel framework for Federated Learning (FL) that addresses the challenges of generalization and personalization in non-IID data environments. Traditional FL methods often struggle with parameter divergence and overfitting to local data, particularly in cold-start scenarios where new clients lack sufficient local training. FedCoE employs a dual-level mixture-of-experts (MoE) architecture, maintaining multiple independent global expert models on the server. A shared gating network dynamically models client-expert correlations, mitigating expert drift and ensuring consistent expert selection across clients. Additionally, an adaptive mechanism allows new clients to leverage the global expert pool immediately, enhancing their performance without extensive local training. Experimental results show that FedCoE achieves an average global accuracy of 78.00% and personalized accuracy of 89.32%, outperforming baseline methods significantly, especially in cold-start situations where it achieves 77.27% accuracy without local fine-tuning.
Methodology
FedCoE introduces a federated coordinated dual-level mixture-of-experts framework that includes a correlation-based aggregation mechanism to update global experts with semantically aligned gradients, a shared server-side gating network for consistent expert selection, and an adaptive expert assembly strategy for new clients to access personalized experts immediately.
Results
FedCoE achieved an average global accuracy of 78.00% and personalized accuracy of 89.32%, surpassing baseline methods by 8.82% and 29.19%, respectively. In cold-start scenarios, it reached 77.27% accuracy without local fine-tuning, outperforming baselines by over 12.54%.
Implications
The FedCoE framework has significant implications for enhancing federated learning systems, particularly in applications where data privacy is crucial, and diverse client distributions are present. It can improve model performance in various domains, including healthcare, finance, and IoT, where data is often fragmented and sensitive.
Choose Wisely and Privately: Proactive Client Selection for Fair and Efficient Federated Learning
Federated Learning
Optimization
Efficient ML
- Introduction of a Potential Federation Loss (PFL) that balances predictive utility and fairness in client selection.
- Development of a proactive client selection framework that identifies optimal client subsets before training.
- Utilization of mutual information to assess data suitability and fairness concerns.
- Demonstration of improved model performance and efficiency over traditional reactive methods.
Read more
Choose Wisely and Privately: Proactive Client Selection for Fair and Efficient Federated Learning
Summary
This paper addresses the challenges of Federated Learning (FL), particularly the issues arising from non-IID data across clients, which can hinder model convergence and accuracy. Traditional FL methods often reactively filter or down-weight client contributions, leading to inefficiencies and potential privacy risks. The authors propose a proactive client selection framework that identifies an optimal subset of clients before training, ensuring that the combined data meets both utility and fairness criteria. This is achieved through a novel Potential Federation Loss (PFL) objective that utilizes mutual information to evaluate candidate federations based on predictive utility and fairness risks. The framework employs simulated annealing to solve the optimal subset search problem while maintaining strong privacy guarantees. Experimental results demonstrate that this proactive approach yields faster, fairer, and more accurate models compared to conventional methods, even outperforming state-of-the-art adaptive aggregation strategies.
Methodology
The authors propose a proactive client selection framework that formulates the selection as an optimal subset search problem over the Potential Federation Loss (PFL) objective. This is solved using simulated annealing, leveraging mutual information derived from differentially private contingency tables to evaluate the relevance of client data features.
Results
Experimental evaluations on four benchmarks indicate that the proposed proactive client selection method results in models that are faster, fairer, and more accurate than those trained using uniform sampling or existing adaptive aggregation strategies.
Implications
This work has significant implications for improving the efficiency and fairness of Federated Learning systems, particularly in applications where data privacy is critical, such as healthcare and finance. The proactive approach can help mitigate biases and enhance model performance across diverse client data distributions.
Spectral Souping: A Unified Framework for Online Preference Alignment
NLP
Large Language Models
Reinforcement Learning
- Introduction of Spectral Souping for online preference alignment in LLMs.
- Discovery of a universal spectral representation that aids in model merging.
- Two-phase methodology: offline training of specialized policies and online adaptation.
- Significant performance improvements over existing methods.
Read more
Spectral Souping: A Unified Framework for Online Preference Alignment
Summary
This paper introduces Spectral Souping, a novel framework aimed at improving online preference alignment for Large Language Models (LLMs) by addressing the limitations of existing methods that rely on aggregated human feedback. The authors identify a universal spectral representation within LLMs that facilitates model merging, allowing for efficient adaptation to individual user preferences without the need for extensive retraining. The proposed methodology consists of two phases: an offline phase where specialized policies are trained to capture distinct preference dimensions, followed by an online phase where these policies are dynamically combined at inference time. This approach not only enhances scalability and computational efficiency but also achieves significant performance improvements over state-of-the-art techniques. The theoretical foundation of the framework provides provable sub-optimality bounds, demonstrating that the spectral souping method can perform nearly as well as fully fine-tuned models tailored to specific user preferences.
Methodology
The methodology consists of a two-phase process: first, an offline training phase where a basis of specialized policies is learned, each corresponding to a distinct preference dimension. Second, an online adaptation phase where these policies are dynamically combined at inference time to generate responses tailored to individual user preferences, thus avoiding costly per-user fine-tuning.
Results
Experiments conducted on online preference alignment benchmarks show that the Spectral Souping framework achieves substantial performance gains compared to existing state-of-the-art approaches, demonstrating its effectiveness in adapting LLMs to diverse user preferences.
Implications
The implications of this work extend to enhancing user experience in applications involving LLMs by providing more personalized interactions. It also opens avenues for further research in efficient model adaptation techniques and the theoretical understanding of preference alignment in reinforcement learning contexts.
A Deployment Audit of Release-Side Risk in Conformal Triage under Prevalence Shift
Theory
- Introduces a leakage-aware deployment audit for evaluating release-side risk in conformal triage.
- Demonstrates that traditional metrics can obscure the safety of release decisions under prevalence shift.
- Identifies the necessity of separating correction, calibration, and evaluation to ensure safety in deployment.
- Shows that lower review rates can lead to the unsafe release of patients who should not be cleared.
Read more
A Deployment Audit of Release-Side Risk in Conformal Triage under Prevalence Shift
Summary
This paper addresses the critical issue of release-side risk in conformal triage systems, particularly under conditions of prevalence shift, where the distribution of target events changes. The authors introduce a leakage-aware deployment audit that evaluates how many patients who truly experience the target event are released without human review. The audit separates subjects into three roles: prevalence correction, conformal calibration, and held-out release-safety evaluation. This allows for a direct assessment of release actions, revealing that traditional metrics like marginal coverage can be misleading. The study applies this audit to a retrospective cohort of non-small cell lung cancer (NSCLC) patients, demonstrating that a lower review rate can lead to unsafe releases of event-positive patients. The findings emphasize the importance of auditing the action level rather than relying solely on aggregate metrics, highlighting the need for sufficient event labels for safe low-review release decisions.
Methodology
The authors propose a three-part audit framework that separates the roles of prevalence correction, conformal calibration, and release-safety evaluation. This framework allows for a detailed analysis of release actions, focusing on the actual deployment decisions made by the system. The methodology includes deriving action-level safety metrics and a fail-safe condition to assess the adequacy of event labels for safe releases.
Results
The application of the audit to a retrospective NSCLC pilot revealed that the pooled conformal calibration could lower review rates by releasing more patients, some of whom were event-positive. The findings indicated that the pilot lacked sufficient event labels to ensure safe low-review releases, highlighting the inadequacy of relying on marginal coverage as a safety certificate.
Implications
The results of this study have significant implications for the deployment of AI systems in clinical settings, particularly in ensuring patient safety when making automated release decisions. The proposed audit framework can help refine conformal triage systems and improve their reliability under varying prevalence conditions.
A Machine Learning Framework for Weighted Least Squares GNSS Positioning based on Activation Functions
Optimization
- Proposes a machine learning framework to enhance GNSS positioning accuracy.
- Utilizes activation functions to transform machine learning predicted scores into weights for WLS.
- Demonstrates significant reductions in positioning errors in urban environments.
- Shows strong geographical transferability of the proposed algorithm.
Read more
A Machine Learning Framework for Weighted Least Squares GNSS Positioning based on Activation Functions
Summary
This paper presents a novel machine learning framework aimed at improving the accuracy of Global Navigation Satellite Systems (GNSS) positioning, particularly in challenging urban environments where signal degradation is prevalent. The authors address the issues of signal obstruction, non-line-of-sight (NLOS) reception, and multipath effects that lead to significant errors in GNSS pseudorange measurements. The proposed framework integrates activation functions into the weighted least squares (WLS) algorithm to enhance positioning accuracy by effectively weighting the contributions of various GNSS signals based on their quality. The framework employs ensemble learning algorithms trained on multiple signal quality indicators to identify and score poor-quality signals. The activation functions then transform these scores into appropriate weights for the WLS positioning process. The performance of the proposed method is evaluated using real-world datasets from urban areas in Hong Kong and Tokyo, demonstrating that the sigmoid activation function consistently yields the best improvements across different machine learning algorithms and GNSS configurations. The results indicate substantial reductions in positioning errors for both single- and multi-constellation scenarios, along with strong geographical transferability, as the algorithm maintains performance when applied to data from other similarly urbanized regions.
Methodology
The methodology involves training ensemble learning algorithms on various signal quality indicators to classify and score GNSS signal quality. Activation functions are then applied to these scores to derive weights for the weighted least squares (WLS) positioning algorithm, allowing for improved handling of degraded signals in GNSS data.
Results
The proposed framework resulted in substantial reductions in positioning errors when tested with real-world datasets from urban areas. The use of sigmoid activation functions provided the most significant improvements in accuracy across different machine learning models and GNSS configurations. The algorithm also demonstrated strong performance when applied to datasets from other urban regions, indicating its robustness and transferability.
Implications
The findings suggest that integrating machine learning techniques with traditional GNSS positioning methods can significantly enhance accuracy, particularly in urban environments. This has potential applications in various fields, including transportation, autonomous vehicles, and location-based services, where precise positioning is critical.
Training distribution determines the ceiling of drug-blind cancer sensitivity prediction
Theory
- Drug-blind sensitivity prediction has plateaued due to metric artifacts rather than limitations in drug representations.
- Global Pearson r conflates between-drug potency and within-drug cell sensitivity rankings, masking true performance ceilings.
- Per-drug Pearson r reveals that no drug representation improves upon cell-only features.
- Mechanism-of-action (MoA) as a training distribution constraint significantly enhances predictive performance.
Read more
Training distribution determines the ceiling of drug-blind cancer sensitivity prediction
Summary
This paper addresses the stagnation in drug-blind cancer sensitivity prediction, which has plateaued despite advancements in drug representation techniques. The author argues that this plateau is a result of a metric artifact rather than limitations in drug representations. The study reveals that the standard benchmark, global Pearson correlation coefficient (r), conflates between-drug potency differences with within-drug cell sensitivity rankings. By employing per-drug Pearson r, which isolates the ranking of cell responses within each drug, the author demonstrates that no drug encoding surpasses the performance of cell-only features across multiple datasets. A controlled experiment contrasting the use of mechanism-of-action (MoA) as a drug feature versus a training-distribution constraint shows that aligning training distribution with MoA significantly enhances predictive performance for targeted kinase inhibitors. The findings suggest that the stagnation in predictive performance can be addressed through mechanism-stratified training and response matching, which recover key sources of predictive gain in drug-blind sensitivity prediction.
Methodology
The study employs a comparative analysis of global and per-drug Pearson correlation coefficients to evaluate drug sensitivity predictions. It conducts controlled experiments to assess the impact of mechanism-of-action (MoA) as a drug feature versus a training-distribution constraint. The analysis includes ridge regression and a Transformer encoder across multiple independent datasets.
Results
The results indicate that the standard global Pearson r metric is misleading, as it conflates different signals, leading to an overestimation of model performance. The per-drug Pearson r shows that no drug representation improves upon the baseline established by cell-only features. Mechanism-stratified training significantly enhances predictive accuracy for targeted kinase inhibitors, demonstrating the importance of training distribution alignment.
Implications
The findings have significant implications for precision oncology, suggesting that improving drug sensitivity predictions requires a focus on training distribution rather than solely on enhancing drug representations. This could lead to more effective strategies for predicting patient responses to cancer therapies.
Same Target, Different Basins: Hard vs. Soft Labels for Annotator Distributions
Computer Vision
Theory
Optimization
- Hard-label delivery methods can improve learning outcomes when annotations are sparse.
- Multipass and SLS methods match soft-label training when full annotator distributions are available.
- The preservation of the example-to-distribution match is crucial for effective learning.
- SLS and soft-label cross-entropy optimize the same expected objective, allowing for clearer comparisons.
Read more
Same Target, Different Basins: Hard vs. Soft Labels for Annotator Distributions
Summary
This paper investigates the impact of using hard-label delivery methods in the context of supervised learning with multiple annotations per example, particularly when annotators disagree. The authors propose two primary hard-label methods: multipass, which cycles through observed votes while maintaining a fixed dataset size, and stochastic label sampling (SLS), which samples one label per example at the beginning of each training epoch. The study is conducted on the CIFAR-10H dataset, revealing that hard-label delivery outperforms soft-label training when only a limited number of annotations per example are available, especially when the sparse empirical target diverges from the full annotator distribution. When complete annotator distributions are accessible, both hard-label methods perform comparably to soft-label training. The paper also highlights that SLS and soft-label cross-entropy optimize the same expected objective, providing insights into the optimization process. The findings suggest that multipass is a robust default method when raw vote counts are available, while SLS serves as a lightweight alternative that remains competitive under limited annotation conditions.
Methodology
The authors compare hard-label delivery methods (multipass and SLS) against soft-label training while keeping the annotator target fixed. They conduct experiments on the CIFAR-10H dataset and utilize controls to assess the impact of traversal order and the pairing of examples with their annotator distributions.
Results
The experiments demonstrate that hard-label delivery is generally superior to soft-label training in scenarios with limited annotations, particularly when the empirical target is far from the full distribution. In cases where full annotator distributions are available, both hard-label methods perform on par with soft-label training. The controls indicate that the example-to-distribution match significantly influences the results.
Implications
The findings suggest that hard-label delivery methods could be effectively utilized in scenarios with limited annotations, potentially improving model performance in various applications involving human annotator distributions. This could have implications for tasks in computer vision and other domains where multiple annotations are common.
Learning to Think in Physics: Breaking Shortcut Learning in Scientific Diffusion via Representation Alignment
Generative Models
- Introduction of REPA-P, a teacher-free framework for aligning intermediate representations with physical states.
- Demonstration of improved convergence and reduced physics residuals across multiple PDE tasks.
- Validation of the hypothesis that aligning latent features with physical quantities enhances model robustness.
- Architecture-agnostic approach applicable to both U-Net and Diffusion Transformer backbones.
Read more
Learning to Think in Physics: Breaking Shortcut Learning in Scientific Diffusion via Representation Alignment
Summary
This paper addresses the challenge of shortcut learning in physics-informed diffusion models, which typically enforce partial differential equation (PDE) constraints only on final outputs, leaving intermediate representations unconstrained. The authors propose a novel framework called REPA-P, which aligns intermediate features with physical states by using lightweight projection heads to decode hidden activations into physical quantities. This method applies PDE residual losses during training, effectively guiding the model to internalize physical laws rather than relying on spurious correlations. The framework is architecture-agnostic and can be applied to various backbone models, including U-Net and Diffusion Transformer. The authors validate REPA-P across four PDE tasks: Darcy flow, topology optimization, electrostatic potential, and turbulent channel flow. The results demonstrate that REPA-P accelerates convergence by up to 2Γ, reduces physics residuals by up to 66.4%, and improves out-of-distribution robustness by up to 49.3%. The findings indicate that supervising a small set of intermediate layers captures most benefits and complements output-level physics losses, suggesting a significant advancement in the robustness and generalizability of scientific generative models.
Methodology
The authors developed REPA-P by inserting lightweight 1Γ1 projection heads into intermediate layers of diffusion models. These heads decode latent features into physical state variables, and PDE residuals are computed on these decoded states to provide supervision during training. This approach encourages the model to internalize physical laws and enhances the learning dynamics by enforcing physical decodability constraints.
Results
REPA-P was tested on four PDE tasks, achieving up to 2Γ faster convergence, a reduction of physics residuals by up to 66.4%, and an improvement in out-of-distribution robustness by up to 49.3%. The results indicate that the model effectively internalizes physical laws, leading to better generalization and performance compared to standard baselines.
Implications
The findings suggest that integrating physical principles into the training of generative models can significantly enhance their robustness and generalization capabilities, making them more reliable for scientific applications. This approach may pave the way for more effective AI tools in scientific research and engineering, where understanding underlying physical mechanisms is crucial.
Modular Multimodal Classification Without Fine-Tuning: A Simple Compositional Approach
Multimodal
- CoMET framework allows for multimodal classification without fine-tuning.
- PCA is sufficient for effective dimensionality reduction in embeddings.
- PALPooling improves representation quality without backpropagation.
- Achieves state-of-the-art results across various multimodal benchmarks.
Read more
Modular Multimodal Classification Without Fine-Tuning: A Simple Compositional Approach
Summary
This paper introduces CoMET (Composing Modality Encoders with Tabular foundation models), a novel framework for multimodal classification that operates without the need for fine-tuning. The authors propose a method where each modality is processed through a frozen pre-trained backbone, and the resulting embeddings are compressed using PCA before being concatenated and fed into a Tabular Foundation Model (TFM) for prediction. The study reveals that PCA alone can effectively serve as an adaptor, yielding robust performance across various modalities. To address issues with CLS tokens misaligning with downstream tasks, the authors introduce PALPooling, a lightweight adaptive pooling technique that enhances representation quality without requiring backpropagation. The results demonstrate that CoMET achieves state-of-the-art performance on diverse multimodal benchmarks, enabling efficient classification even in hierarchical tasks with extensive class spaces, all while managing large datasets without fine-tuning. This work challenges the conventional reliance on complex training pipelines for multimodal learning, suggesting that a modular approach can be both simple and powerful.
Methodology
The methodology involves processing each modality through a frozen pre-trained backbone to generate embeddings, which are then reduced in dimensionality using PCA. These embeddings are concatenated and input into a Tabular Foundation Model for classification. PALPooling is introduced as an adaptive pooling method that enhances the quality of representations without requiring fine-tuning.
Results
The proposed CoMET framework achieves state-of-the-art performance on multimodal classification tasks, demonstrating strong robustness and efficiency. It successfully handles datasets with over 500,000 samples and 2,000 classes without any fine-tuning, showcasing the effectiveness of the PCA and PALPooling techniques.
Implications
The findings suggest that complex end-to-end training pipelines may not be necessary for effective multimodal classification. This opens up new avenues for applying modular approaches in various domains, potentially reducing the time and resources required for model training and deployment.
TimeSRL: Generalizable Time-Series Behavioral Modeling via Semantic RL-Tuned LLMs -- A Case Study in Mental Health
Time Series
Reinforcement Learning
Large Language Models
- TimeSRL introduces a two-stage framework for time-series behavioral modeling that enhances generalizability.
- The model uses semantic abstractions to improve reasoning over longitudinal behavioral data.
- TimeSRL achieves state-of-the-art performance in mental health prediction, outperforming traditional ML and LLM methods.
- The approach demonstrates robustness against distribution shifts across different datasets.
Read more
TimeSRL: Generalizable Time-Series Behavioral Modeling via Semantic RL-Tuned LLMs -- A Case Study in Mental Health
Summary
The paper introduces TimeSRL, a novel two-stage framework designed to enhance the generalizability of time-series behavioral modeling, particularly in the context of mental health prediction. Traditional machine learning models often struggle with distribution shifts across datasets, leading to performance decay. TimeSRL addresses this issue by employing a semantic bottleneck that abstracts raw numerical data into natural language descriptions before making predictions. This two-stage process consists of: (1) semantic abstraction, where numerical signals are converted into high-level behavioral descriptions, and (2) semantic inference, where outcomes are predicted based solely on these abstractions. The authors optimize this framework using Group Relative Policy Optimization (GRPO) combined with Reinforcement Learning from Verifiable Rewards (RLVR), allowing the model to learn effective abstractions without requiring gold-standard annotations. The case study on mental health prediction demonstrates that TimeSRL significantly outperforms traditional ML and LLM baselines, achieving state-of-the-art results in cross-cohort generalization, thus highlighting the potential of semantic abstractions in behavioral modeling.
Methodology
The methodology involves a two-stage process: first, raw numerical behavioral data is abstracted into natural language descriptions (semantic abstraction), and then predictions are made based on these abstractions (semantic inference). The optimization is performed using Group Relative Policy Optimization (GRPO) and Reinforcement Learning from Verifiable Rewards (RLVR), allowing the model to learn without gold intermediate annotations.
Results
TimeSRL achieved a reduction in mean absolute error (MAE) by 3.1β10.1% for anxiety and 3.2β9.6% for depression compared to strong non-LLM ML and LLM baselines. It also demonstrated significant improvements in cross-benchmark transfer, rivaling within-domain performance without the need for target-domain fine-tuning.
Implications
The findings suggest that leveraging semantic abstractions can lead to more robust and generalizable models for behavioral prediction, particularly in mental health applications. This approach could be extended to other domains where longitudinal data is collected, enhancing the reliability of predictions across diverse populations.
Less Data, Faster Training: repeating smaller datasets speeds up learning via sampling biases
Optimization
Theory
Efficient ML
- The small-vs-large gap exists across various tasks and architectures, indicating that fewer samples can lead to faster learning.
- Sampling biases from smaller datasets enhance optimization by modulating layer-wise updates, facilitating quicker convergence.
- Empirical evidence shows that even random labels can yield speedups similar to those with real labels, underscoring the role of sampling bias.
- Adjustments to initialization and learning rates can significantly reduce the small-vs-large gap, highlighting the importance of parameter-wise interventions.
Read more
Less Data, Faster Training: repeating smaller datasets speeds up learning via sampling biases
Summary
This paper explores the phenomenon known as the 'small-vs-large gap', where training on smaller datasets with repeated samples can lead to faster convergence and reduced computational costs compared to larger datasets. The authors argue that this speedup is due to sampling biases that enhance layer-wise growth during training, particularly in reasoning tasks. Through both theoretical analysis and empirical evidence, the study confirms that the small-vs-large gap exists across various tasks, architectures, and optimizers, challenging the conventional belief that more data always leads to better performance. The authors provide insights into how sampling biases can modulate the optimization process, making models more robust to hyperparameter choices. The findings suggest that leveraging smaller datasets with increased repetitions can be a proactive strategy for optimization rather than merely a fallback in data-scarce situations.
Methodology
The authors conducted a series of theoretical analyses and empirical experiments across different algorithmic tasks, architectures, and optimizers. They examined the effects of dataset size and repetition on convergence rates and computational efficiency, utilizing both mini-batch and full-batch updates. Theoretical insights were formalized to explain the observed phenomena, and various parameter-wise interventions were tested to validate their hypotheses.
Results
The study found that training on smaller datasets with repeated samples consistently resulted in faster convergence and lower computational costs across multiple settings. The authors demonstrated that the small-vs-large gap is not adequately explained by existing theories, and they provided empirical evidence that supports their claims about the influence of sampling biases on optimization processes.
Implications
The findings suggest that practitioners in machine learning can benefit from strategically using smaller datasets with increased repetitions, particularly in scenarios where data is limited. This approach may lead to more efficient training processes and improved performance in reasoning tasks, challenging the traditional view that larger datasets are always preferable.
Most Transformer Modifications Still Do Not Transfer at 1-3B: A 2020-2026 Update to Narang et al. (2021) with Downstream Evaluation and a Noise Floor
Large Language Models
NLP
Theory
- Most Transformer modifications do not transfer effectively at larger scales (1-3B parameters).
- Only two out of 20 modifications showed significant improvements at 1.2B, with one failing at 3B.
- Downstream evaluation metrics are more reliable than pretraining perplexity for assessing model performance.
- The gap between validation loss and downstream task accuracy has increased for attention-output modifications.
Read more
Most Transformer Modifications Still Do Not Transfer at 1-3B: A 2020-2026 Update to Narang et al. (2021) with Downstream Evaluation and a Noise Floor
Summary
This paper revisits the findings of Narang et al. (2021), which evaluated over 40 modifications to the Transformer architecture at a smaller scale (T5-base). The authors conduct a new study focusing on 20 modifications proposed after 2021, testing them at larger scales (1.2B and 3B parameters) under controlled conditions. The study emphasizes the importance of downstream evaluation metrics over pretraining perplexity, which has been shown to be an unreliable predictor of task performance. The results reveal that, similar to the original findings, most modifications do not transfer effectively, with only two modifications showing significant improvements at the 1.2B scale, and one failing to train stably at the 3B scale. The authors also highlight the increased gap between loss and downstream performance, particularly for attention-output modifications. The paper concludes with recommendations for future architecture comparisons, emphasizing the need for noise-floor reporting, downstream evaluation, and cross-scale stability testing.
Methodology
The authors established a controlled benchmark by testing 20 post-2021 Transformer modifications at 1.2B and 3B parameters. They maintained strict iso-data, iso-compute, and iso-recipe controls, using a multi-seed baseline to establish a noise floor. The evaluation was conducted using the CLIMB-12 downstream benchmarks instead of pretraining loss.
Results
The study found that of the 20 modifications tested, only two showed significant improvements at 1.2B after Bonferroni correction, while one of these failed to train stably at 3B. The loss-downstream gap was significantly larger for attention-output modifications, indicating that a low validation loss does not guarantee good downstream performance.
Implications
The findings suggest that many proposed modifications to Transformer architectures may not be effective at larger scales, highlighting the need for rigorous evaluation protocols. This has implications for researchers developing new architectures, as it emphasizes the importance of downstream task performance over pretraining metrics.
Fast Reconstruction of Exact Maxwell Dynamics from Sparse Data
Theory
Efficient ML
Optimization
- Introduction of FLASH-MAX, a shallow neural network that predicts electromagnetic fields from sparse data.
- Each hidden neuron in FLASH-MAX represents an exact solution to Maxwell's equations, ensuring physical validity by construction.
- Achieves sub-1% relative validation error from about 1,000 observations in seconds, with zero PDE residual.
- Demonstrates that embedding governing structures into the model improves the trade-off between accuracy and optimization speed.
Read more
Fast Reconstruction of Exact Maxwell Dynamics from Sparse Data
Summary
This paper introduces FLASH-MAX, a novel shallow neural network architecture designed for the rapid reconstruction of homogeneous electromagnetic fields from sparse data points. The architecture is unique in that each hidden neuron corresponds to an exact solution of Maxwell's equations, ensuring that the network inherently satisfies these governing equations throughout the training process. The authors demonstrate that FLASH-MAX can be trained end-to-end from approximately 1,000 sparse observations, achieving a relative validation error of less than 1% in a matter of seconds, while maintaining a zero PDE residual. This approach contrasts with traditional methods that enforce physical validity through loss functions, suggesting that embedding governing structures directly into the model can significantly enhance both accuracy and optimization speed. The paper also presents a universality theorem, confirming that this exact model class remains effective across arbitrary domains, and provides empirical evidence of its performance even with as few as 100 observations. Overall, FLASH-MAX represents a significant advancement in scientific machine learning, particularly in applications requiring the reconstruction of electromagnetic fields from limited data.
Methodology
The methodology involves constructing a shallow neural network where hidden neurons are exact solutions to Maxwell's equations. The architecture incorporates weight-sharing constraints that enforce the governing equations directly, allowing for efficient training and accurate predictions from sparse observations. The authors also provide a theoretical foundation through a universality theorem, ensuring the model's applicability across various domains.
Results
FLASH-MAX achieves a relative validation error of less than 1% with around 1,000 sparse observations, and maintains a zero PDE residual. The model demonstrates robustness, achieving single-digit errors even with only 100 observations sampled from 3D space. The training process is significantly faster compared to traditional residual-based methods.
Implications
The findings suggest that integrating governing physical structures into machine learning models can lead to more efficient and accurate solutions in scientific applications, such as antenna measurements, electromagnetic diagnostics, and subsurface exploration. This approach may also influence future research in scientific machine learning by promoting the development of models that inherently satisfy physical laws.