AI-generated summaries
Today's ML research,
without the noise.
Daily summaries of the latest machine learning papers from arXiv, processed every 8 hours.
48
Papers today
8h
Update frequency
7
Days of history
Geodesics of Dynamic Graphs for Regime Change Detection
Graph Learning
Time Series
- Introduces a framework for detecting regime changes in dynamic graphs based on geodesics.
- Defines regimes as periods of coherent dynamics and regime changes as drifts in these dynamics.
- Utilizes graph regression methods to measure deviations from estimated geodesics.
- Outperforms existing change point detection methods in experiments on synthetic and real-world data.
Read more
Geodesics of Dynamic Graphs for Regime Change Detection
Summary
This paper addresses the limitations of traditional change point detection methods in dynamic networks, which typically assume abrupt transitions between stationary states. The authors propose a novel framework that defines regimes as periods of coherent dynamics in temporal graphs, characterized by trajectories along geodesics in a defined graph space. This approach allows for the detection of regime changes as significant drifts in dynamics, either toward new trajectories or with changes in pace. The methodology leverages graph regression techniques to measure the cumulative distance of observed graph sequences from estimated geodesics, which can be integrated with change point detection algorithms. The authors conduct experiments on dynamic networks with varying trajectories and speeds, demonstrating superior performance compared to existing change point detection models. Additionally, they analyze mobility data during the Covid-19 pandemic, showing that their model aligns change points more closely with external events than baseline methods. This work represents a significant advancement in modeling and detecting changes in evolving regimes within graph space, providing a robust tool for analyzing complex temporal graph data.
Methodology
The authors formulate the change point detection problem by identifying geodesics within observed graph sequences. They propose a regression-based framework that uses a Residual Sum of Squares (RSS) cost function to quantify the alignment of graph subsequences with geodesics connecting their endpoints. The methodology includes two implementations based on different graph metrics and strategies for sampling graphs from continuous geodesics.
Results
The empirical study shows that the proposed method consistently outperforms established change point detection techniques on synthetic data and effectively identifies meaningful shifts in real-world mobility data during the Covid-19 pandemic, correlating with major lockdown events.
Implications
This work provides a powerful tool for analyzing dynamic networks in various applications, such as social networks and communication systems, where understanding gradual changes in relationships is crucial. It enhances the ability to detect regime changes in complex temporal graph data, potentially informing decision-making processes in real-world scenarios.
Accelerating Multi-Objective Bayesian Optimisation via Predictive-Gradient Catalysts
Optimization
- Introduction of a catalytic framework for MOBO that leverages GP predictive gradients.
- Development of two catalyst instantiations: MGDA-based and predefined-weight approaches.
- Integration of catalytic signals with standard Pareto-compliant acquisition functions.
- Demonstrated significant acceleration in convergence on the DTLZ benchmark suite.
Read more
Accelerating Multi-Objective Bayesian Optimisation via Predictive-Gradient Catalysts
Summary
This paper introduces a novel acceleration mechanism for multi-objective Bayesian optimisation (MOBO) that utilizes Gaussian process predictive gradients as auxiliary signals to enhance the performance of existing acquisition functions. The proposed framework does not replace these functions but augments them with local stationarity information derived from surrogate gradients, facilitating faster convergence towards the global Pareto set, especially under limited evaluation budgets. Two catalyst instantiations are explored: an adaptive Multiple-Gradient Descent Algorithm-Based Catalyst (MGDA) and a predefined-weight variant for focused exploration. The authors conducted extensive experiments on the DTLZ benchmark suite, demonstrating that the predictive gradient catalysis significantly accelerates convergence compared to traditional acquisition functions, particularly in stationary problem settings. This work addresses the gap in the use of gradient information in multi-objective settings, providing a more efficient approach to exploring the Pareto front in complex optimization problems.
Methodology
The authors propose a framework that augments existing acquisition functions with predictive gradients from Gaussian process models. They investigate two specific catalyst strategies: an adaptive MGDA-based approach and a predefined-weight variant. The integration of these catalysts with standard acquisition functions is achieved through augmented Tchebycheff scalarisation, allowing for efficient exploration of the Pareto set.
Results
The experimental results indicate that the predictive gradient catalysis approach consistently outperforms traditional acquisition functions (such as EHVI, AugTch, tMPoI, SAF) in terms of convergence speed, particularly in scenarios where the surrogate models are accurate and the problems are stationary. The findings suggest that the proposed methods can significantly enhance the efficiency of multi-objective optimization tasks.
Implications
This research has potential implications for various fields requiring multi-objective optimization, such as engineering design, machine learning, and simulation-based optimization. By improving the efficiency of MOBO, the proposed methods can lead to faster and more effective decision-making in complex scenarios where evaluations are costly.
Cross-Epoch Adaptive Rollout Optimization for RL Post-Training
Reinforcement Learning
Large Language Models
Optimization
- CERO optimizes rollout allocation across epochs rather than within a single batch.
- The method uses Bayesian estimates to assess the value of additional rollouts for each prompt.
- CERO demonstrates a significant improvement in sample efficiency compared to traditional methods.
- Theoretical guarantees are provided, establishing a regret bound against offline benchmarks.
Read more
Cross-Epoch Adaptive Rollout Optimization for RL Post-Training
Summary
This paper addresses the inefficiencies in reinforcement learning (RL) post-training for large language models (LLMs) by proposing a novel method called Cross-Epoch Adaptive Rollout Optimization (CERO). Traditional approaches allocate a fixed number of rollouts per prompt, which fails to account for the varying training signals provided by different prompts. CERO formulates the problem as an online resource allocation challenge with prompt-level diminishing returns, allowing for adaptive rollout allocation under a fixed global budget. The method employs a Beta posterior to estimate each prompt's success probability and uses this to derive a utility function that captures diminishing returns. The authors develop a Fenchel-dual reformulation of the allocation problem and implement an online primal-dual algorithm to optimize rollout distribution across epochs. Theoretical analysis shows that CERO achieves an O(√K) regret bound against offline benchmarks. Empirical results demonstrate that CERO outperforms existing methods, such as GRPO, across various mathematical reasoning tasks, highlighting its effectiveness in improving sample efficiency in RL post-training.
Methodology
CERO employs a Bayesian approach to maintain a Beta posterior over each prompt's success probability. It constructs a concave utility function to represent diminishing returns on cumulative allocations. The algorithm utilizes a Fenchel-dual reformulation to derive an online primal-dual optimization strategy, updating prompt-level and budget-level variables through projected online gradient descent.
Results
CERO consistently outperformed the GRPO method in experiments focused on mathematical reasoning problems, demonstrating enhanced sample efficiency. The theoretical analysis confirmed an O(√K) regret bound, indicating that CERO effectively optimizes rollout allocation under a fixed budget.
Implications
The findings suggest that adaptive rollout budgeting can significantly enhance the efficiency of RL post-training processes for LLMs. This approach could be applied to various domains requiring resource allocation in machine learning, potentially leading to improved performance in tasks that involve complex decision-making.
Generative Molecular Morphing for Flexible-Size Design via Unbalanced Optimal Transport
Generative Models
Graph Learning
- Morph is a flexible-size generative model that adapts the number of atoms during molecular generation.
- The model integrates existing structural priors, enhancing property steering in molecular design.
- Morph matches the performance of fixed-size models while offering superior sampling flexibility.
- The methodology utilizes unbalanced optimal transport for training geometric graphs of varying sizes.
Read more
Generative Molecular Morphing for Flexible-Size Design via Unbalanced Optimal Transport
Summary
This paper presents Morph, a novel flexible-size generative model designed for 3D molecular design that addresses the limitations of existing diffusion and flow-based models, which typically fix the number of atoms in molecular structures. The authors argue that many molecular properties are inherently linked to molecular size, making it crucial to capture the joint distribution of molecular properties and atom counts. Morph allows for dynamic adjustment of molecular size during the generation process, enabling the integration of structural priors and significantly enhancing the model's ability to steer toward high-reward samples. The methodology builds on flexible-size sequence generation techniques and employs unbalanced optimal transport for training. Empirical results demonstrate that Morph matches the performance of state-of-the-art fixed-size models while providing greater sampling flexibility and improved out-of-distribution generation capabilities.
Methodology
The authors introduce Morph, a discrete-continuous flow-based model that performs insertion and deletion operations to dynamically adjust the number of nodes in a geometric graph. The model employs a matching algorithm based on unbalanced optimal transport to align geometric graphs of different sizes, facilitating the training process. The generative dynamics are governed by a combination of continuous flow and discrete jumps, allowing for the transformation of prior graphs into valid molecular structures.
Results
Morph successfully matches the performance of existing fixed-size generative models in unconditional generation tasks. Additionally, it demonstrates enhanced property steerability and excels in generating out-of-distribution samples, showcasing its robustness and flexibility in molecular design.
Implications
The development of Morph has significant implications for the field of molecular design, enabling more efficient exploration of chemical space and potentially reducing the need for costly physical experiments. Its ability to generate flexible-sized molecular structures could accelerate the discovery of new compounds with desirable properties.
Tight list replicability bounds via a novel sphere covering theorem
Theory
- Introduces a novel sphere covering theorem that sharpens existing bounds on list replicability.
- Establishes that the list size for VC classes is at least d for accuracy parameters less than 1/2.
- Demonstrates that for large-margin half-spaces, the list replicability number can be exactly d or minimized to ⌈d/2⌉ + 1 depending on the margin.
- Provides a framework for understanding list replicability through topological properties of distribution spaces.
Read more
Tight list replicability bounds via a novel sphere covering theorem
Summary
This paper addresses the concept of list replicability in learning theory, which formalizes reproducibility in randomized learning algorithms. The authors introduce a novel sphere covering theorem derived from the Borsuk-Ulam theorem, which provides sharper bounds on the relationship between list size and accuracy for various hypothesis classes. The main contribution is proving that if the d-sphere is covered by open sets contained in open hemispheres, then at least d + 1 of these sets must intersect. This result leads to improved lower bounds on list replicability, showing that for VC classes, the list size is at least d for accuracy parameters less than 1/2. For large-margin half-spaces, the authors demonstrate that the list replicability number is exactly d for margins less than 1/sqrt(2) and can be minimized to ⌈d/2⌉ + 1 for very large margins. The paper also explores the implications of these results for linear classifiers, establishing that their optimal list replicability is d, aligning with previous upper bounds.
Methodology
The authors utilize topological methods, specifically a novel sphere covering theorem, to derive bounds on list replicability. They analyze the intersection properties of open sets covering the d-sphere and apply these results to various hypothesis classes, including VC classes and large-margin half-spaces.
Results
The paper establishes that the list size for any list-replicable learner for a VC class is at least d for accuracy parameters less than 1/2. For large-margin half-spaces, it shows that the list replicability number is exactly d for margins less than 1/sqrt(2) and can be minimized to ⌈d/2⌉ + 1 for very large margins. Additionally, it confirms that linear classifiers have an optimal list replicability of d.
Implications
These findings enhance the understanding of reproducibility in machine learning, particularly in randomized algorithms. The results can inform the design of learning algorithms that require consistent outputs across multiple runs, with applications in various fields including differentially private learning and scientific reproducibility.
Scaling Laws for Behavioral Foundation Models over User Event Sequences
Theory
Optimization
Efficient ML
- The optimal size for the event embedder is approximately 2% of the total model parameters.
- Behavioral models initially require a data-heavy approach, transitioning towards the Chinchilla heuristic as compute increases.
- The choice of evaluation metrics is integral to determining the scaling laws and optimal training configurations.
- Negative sampling strategies evolve from being compute-focused to memory-focused at higher compute budgets.
Read more
Scaling Laws for Behavioral Foundation Models over User Event Sequences
Summary
This paper investigates the scaling laws applicable to behavioral foundation models, which are increasingly utilized for analyzing sequences of user actions in various domains such as recommendations and commerce. The study focuses on a two-part architecture comprising a feature-based event embedder and a decoder-only transformer. Through approximately 600 experimental runs across a range of training FLOPs (from 10^15 to 10^19), the authors explore the optimal configurations for model parameters, batch sizes, and negative sampling strategies. The findings reveal that a small embedder (approximately 2% of total parameters) is optimal across all tested compute budgets due to its efficiency in handling repeated items. The research also highlights that the compute-optimal data-to-parameter ratio shifts from being data-heavy at lower compute levels to aligning with the Chinchilla heuristic at higher levels. Furthermore, the evaluation metrics significantly influence the scaling laws, indicating that the choice of metrics can alter the compute-optimal training strategies. The paper emphasizes the importance of understanding these scaling laws to improve the training of behavioral foundation models, which differ fundamentally from traditional text-based models.
Methodology
The authors conducted a series of experiments using a two-part behavioral model architecture, consisting of a feature-based event embedder and a decoder-only transformer. They varied key parameters such as the embedder size, batch size, model/data allocation, and negative sampling counts across a wide range of compute budgets. The experiments were designed to assess the impact of these parameters on model performance using real interaction data.
Results
The study found that the compute-optimal event embedder size is consistently around 2% of the total parameters across different compute budgets. The data-to-parameter ratio decreased from approximately 340 at 10^15 FLOPs to about 36 at 10^19 FLOPs, aligning with established heuristics for text models. Additionally, the results indicated that the optimal number of negatives for sampling is metric-dependent and shifts towards memory constraints at higher compute levels.
Implications
The findings suggest that practitioners can leverage these scaling laws to optimize the training of behavioral foundation models, leading to more efficient and effective systems in applications such as recommendation engines, fraud detection, and user behavior analysis. Understanding the interplay between model architecture, compute resources, and evaluation metrics can significantly enhance model performance.
Steering Vectors are an Adversarial Attack Surface
NLP
Large Language Models
Theory
- Identification of contrastive steering datasets as a novel attack surface.
- Demonstration of a stealthy data poisoning attack that aligns steering vectors with anti-refusal directions.
- Validation of the attack across multiple model families and attributes, achieving a significant increase in attack success rates.
- Proposition of a defense mechanism that mitigates the attack's effectiveness while preserving benign model behavior.
Read more
Steering Vectors are an Adversarial Attack Surface
Summary
This paper investigates the vulnerabilities associated with activation steering in Large Language Models (LLMs), a technique that allows users to control model behavior without fine-tuning. The authors reveal that the practice of sharing steering datasets and precomputed vectors introduces a significant attack surface for stealthy data poisoning. By subtly altering 4-6% of tokens in a contrastive dataset, an attacker can align the resulting steering vector with an anti-refusal direction, effectively jailbreaking the model while maintaining its intended behavior on benign prompts. The study demonstrates that this attack can be executed without detection, even when accompanied by cryptographic certificates that verify the integrity of the dataset. The authors validate their findings on two open-weight model families and eight model-attribute combinations, achieving a notable absolute attack success rate (ASR) of 20-55%, significantly higher than clean references. They also propose a defense mechanism that can recover approximately 82% of the ASR gap without compromising benign behavior.
Methodology
The authors employed a GCG-style optimization approach constrained to embedding-space neighbors, utilizing fluency penalties and safe-vocabulary filtering to create poisoned contrastive pairs. They tested the effectiveness of their attack on two open-weight model families and various model-attribute combinations, measuring the absolute attack success rate (ASR) and the impact on benign prompts.
Results
The study found that the poisoned vectors achieved an ASR of 20-55%, which is an increase of 19-51 percentage points over clean vectors. The attack was effective while maintaining the intended steering effect on benign prompts, with the â„“2 norm of the vectors remaining largely unchanged across most combinations tested. Additionally, the proposed defense mechanism was able to recover about 82% of the ASR gap without negatively affecting benign behavior.
Implications
The findings highlight a critical vulnerability in the ecosystem of shared steering datasets for LLMs, suggesting that malicious actors could exploit this weakness to compromise model safety. The proposed defense mechanism offers a potential pathway to enhance the robustness of LLMs against such stealthy attacks, emphasizing the need for improved security measures in the deployment of activation steering techniques.
Federated Foundation Models over Vehicular Networks
Federated Learning
Multimodal
Robotics
- Introduction of M3T FedFMs as a novel approach for vehicular networks.
- Demonstration of M3T FedFMs' potential through a case study on the Waymo Open Dataset.
- Identification of unique challenges in deploying M3T FedFMs in vehicular environments.
- Release of implementation code to facilitate reproducibility and further research.
Read more
Federated Foundation Models over Vehicular Networks
Summary
This paper explores the integration of multi-modal multi-task federated foundation models (M3T FedFMs) into vehicular networks, aiming to combine the capabilities of M3T foundation models with the privacy-preserving features of federated learning (FL). The authors introduce the training and fine-tuning principles of M3T FedFMs and discuss various use cases in vehicular networks, highlighting their potential to enhance vehicular intelligence. The paper identifies challenges related to the decentralized nature of data in vehicular environments and proposes future research directions to address these issues. A case study using the Waymo Open Dataset demonstrates the effectiveness of M3T FedFMs in real-world scenarios, and the authors provide their implementation to encourage further research in this area.
Methodology
The authors present a modular architecture for M3T FedFMs and conduct a case study using real-world vehicular data. They analyze the training and fine-tuning principles of M3T FedFMs and discuss the implications of decentralized data in vehicular networks.
Results
The case study demonstrates the promise of M3T FedFMs for improving vehicular intelligence, showcasing their ability to handle multi-modal data and perform multiple tasks effectively while preserving privacy through federated learning.
Implications
The integration of M3T FedFMs into vehicular networks could lead to advancements in autonomous driving, traffic management, and enhanced driver assistance systems, while maintaining user privacy and reducing communication overhead.
Data-efficient flood depth prediction through domain-aware coreset selection and tabular foundation models
Efficient ML
- Introduces a two-stage coreset selection process that stratifies data by storm return period and spatial structure.
- Achieves competitive flood depth prediction accuracy with only 0.7% of the training data typically required.
- Demonstrates the ability to predict flood depths in held-out watersheds without task-specific retraining.
- Shows that the model can extrapolate effectively to out-of-distribution storms while maintaining accuracy.
Read more
Data-efficient flood depth prediction through domain-aware coreset selection and tabular foundation models
Summary
This paper presents a novel approach for flood depth prediction that emphasizes data efficiency and transferability across watersheds. The authors introduce a domain-aware coreset construction pipeline that stratifies storm events by return period and affected watersheds, allowing for the selection of a representative subset of training data. By utilizing a tabular foundation model (TFM) conditioned on this coreset, the model achieves competitive performance with significantly less training data compared to traditional supervised models. The methodology enables in-context learning, allowing the model to make predictions for new watersheds without the need for retraining. The results demonstrate that the proposed method can accurately predict flood depths across multiple watersheds, even for out-of-distribution storm events, thus providing a scalable solution for real-time flood forecasting.
Methodology
The authors propose a two-stage coreset construction pipeline that first stratifies storm events based on return period and affected watersheds. This is followed by a spatial selection process using a target-aware facility-location strategy to sample hexagons from the dataset. A tabular foundation model is then conditioned on this coreset during inference, allowing for predictions without the need for per-watershed training.
Results
The model achieved a mean R² of 0.663 across nine Houston-area watersheds, which is within 98.5% of the performance of a fully supervised reference model. It successfully transferred to held-out watersheds and outperformed a coreset-trained supervised baseline on real storm events, demonstrating its robustness and efficiency.
Implications
This research has significant implications for emergency management and infrastructure resilience, as it provides a scalable and efficient method for flood depth prediction. The ability to make accurate predictions across different watersheds without extensive retraining can enhance real-time decision-making during flood events.
Sparsely gated tiny linear experts
NLP
Large Language Models
Efficient ML
- Introduction of sparsely gated linear neurons (sgatlin) for transformer models.
- Significant improvements in compute efficiency and interpretability by reducing experts to single neurons.
- Demonstrated competitive performance in language modeling perplexity across compute budgets.
- Interpretability study reveals semantically structured clusters in sgatlin feedforward circuits.
Read more
Sparsely gated tiny linear experts
Summary
This paper introduces a novel approach to enhancing the efficiency and interpretability of transformer models by utilizing sparsely gated linear neurons (sgatlin). The author argues that while mixture of experts (MoE) models have become increasingly sparse, the individual experts remain large and dense, which limits compute efficiency. By reducing each expert to a single neuron and employing a gating mechanism to select a small subset of these neurons, the sgatlin architecture achieves significant improvements in computational efficiency without the nonlinearities typically associated with expert models. The paper demonstrates that replacing traditional transformer feedforward layers with sgatlin layers leads to improved perplexity in language models across various compute budgets. Additionally, the linearity and sparsity of sgatlin layers facilitate model interpretability, allowing for the identification of semantically structured clusters within the feedforward circuits without the need for additional training. The findings suggest a promising direction for developing more efficient and interpretable transformer architectures.
Methodology
The sgatlin architecture replaces traditional feedforward layers in transformers with a layer that consists of a gating network and a large pool of linear neurons. The gating network selects a sparse subset of neurons based on computed scores, allowing for efficient linear combinations of their outputs. The architecture is evaluated through isoflop comparisons using the SlimPajama 627B dataset, focusing on language modeling perplexity.
Results
The sgatlin layers show improved perplexity in language models compared to traditional feedforward layers across various compute budgets. The architecture allows for less than 0.1% of parameters to be activated per token, enhancing compute efficiency. The interpretability study indicates that the feedforward circuits form semantically structured clusters and are causally linked to factual recall.
Implications
The findings suggest that sgatlin could lead to the development of more efficient transformer architectures that maintain high performance while being easier to interpret. This could have significant implications for applications in natural language processing and other areas where model transparency is crucial.
Causal Modeling of Selection in Evolution
Theory
Graph Learning
- Distinction between static and evolutionary selection is crucial for accurate causal discovery.
- Existing models fail to correctly represent data influenced by evolutionary selection.
- A new model is proposed to specifically address evolutionary selection mechanisms.
- The methodology allows for identification of selection models across multiple generations.
Read more
Causal Modeling of Selection in Evolution
Summary
This paper addresses the critical role of selection in causal discovery, distinguishing between two forms of selection: static and evolutionary. Static selection is a one-time filtering process, while evolutionary selection involves repeated rounds of differential fitness affecting the data over generations. The authors argue that existing methods conflate these two forms, leading to inaccuracies in causal inference, particularly in evolutionary contexts. They introduce a new model specifically designed to characterize evolutionary selection and develop a comprehensive procedure for identifying such models from data across multiple environments or generations. Experimental results demonstrate the effectiveness of their approach in uncovering the mechanisms underlying evolutionary processes from observed data.
Methodology
The authors developed a new causal model that differentiates between static and evolutionary selection. They created a sound and complete procedure for identifying these models from data, which can be applied across various environments or generations. The methodology includes experimental validation to assess the model's performance in revealing underlying evolutionary mechanisms.
Results
The experimental results confirmed that the proposed model effectively identifies the relevant mechanisms of evolutionary selection, outperforming traditional static selection models in terms of accuracy and reliability in causal inference.
Implications
The findings have significant implications for fields such as evolutionary biology, social sciences, and any domain where understanding the dynamics of selection is crucial. The new model can enhance causal discovery processes and improve the design of interventions based on evolutionary insights.
Accelerating Reproducible Research in Synthetic EHR Generation
Generative Models
- Introduces a unified benchmarking framework for synthetic EHR generation.
- Reimplements and standardizes existing generative models for better comparison.
- Develops a privacy-utility evaluation suite applicable to various architectures.
- Addresses reproducibility challenges in the field of synthetic EHR generation.
Read more
Accelerating Reproducible Research in Synthetic EHR Generation
Summary
This paper addresses the challenges of reproducibility in the generation of synthetic Electronic Health Records (EHR) by introducing a comprehensive benchmarking framework. The authors highlight the issues caused by disparate codebases, incompatible data loaders, and inconsistent evaluation protocols that hinder effective comparison of existing generative models. They propose a unified pipeline that encompasses data ingestion, standardized model training, and architecture-agnostic evaluation, specifically targeting the generation of longitudinal ICD diagnosis codes. The framework is built on the PyHealth library and includes reimplementations of established models such as MedGAN, CorGAN, PromptEHR, and HALO, along with a new lightweight GPT-2 baseline. A key contribution is the development of a privacy-utility evaluation suite applicable to both GAN and transformer-based generators, with results presented as bootstrapped confidence intervals. The authors also analyze the long-tailed performance issues of existing models and suggest that their framework can be extended beyond diagnosis codes, ultimately aiming to lower the barriers for researchers and promote community-driven reproducibility in synthetic EHR research.
Methodology
The authors developed a lightweight, end-to-end benchmarking framework that integrates data ingestion, model training, and evaluation. They reimplemented existing models under a unified architecture and created a privacy-utility evaluation suite that standardizes metrics across different generative approaches.
Results
The framework successfully allows for rigorous evaluation of synthetic EHR generation models, reporting bootstrapped confidence intervals for various metrics. The analysis revealed significant long-tailed performance issues in existing models, emphasizing the need for improved methodologies in synthetic data generation.
Implications
This work has the potential to enhance reproducibility in synthetic EHR research, facilitating better comparisons between models and promoting advancements in the generation of high-fidelity synthetic patient data while maintaining privacy. It could lead to more robust applications in clinical research and healthcare analytics.
PC Layer: Polynomial Weight Preconditioning for Improving LLM Pre-Training
Large Language Models
Optimization
Theory
- Introduction of the PC layer for polynomial weight preconditioning.
- Theoretical proof linking bounded weight spectrum to convergence in deep linear networks.
- Empirical validation showing improved training efficiency and accuracy in LLMs.
- No additional inference cost after training with the PC layer.
Read more
PC Layer: Polynomial Weight Preconditioning for Improving LLM Pre-Training
Summary
This paper introduces a novel preconditioning (PC) layer designed to enhance the training of large language models (LLMs) by ensuring stable weight conditioning. The PC layer utilizes a polynomial preconditioner to reshape the singular-value spectrum of weight matrices, which promotes better signal propagation and optimization during training. The authors demonstrate that the preconditioned weights can be seamlessly integrated back into the original architecture without incurring any inference overhead. The theoretical foundation of the PC layer is established by proving that bounding the singular values of each layer leads to geometric convergence of gradient descent to global minima in specific deep linear networks. Empirical evaluations on Llama-1B and Llama-271M models show that the PC layer significantly improves token efficiency and zero-shot downstream accuracy when compared to standard transformer architectures, using both AdamW and Muon optimizers. The findings suggest that controlling the weight spectrum is crucial for effective LLM training, providing a new avenue for optimization in neural network design.
Methodology
The authors propose a PC layer that applies polynomial preconditioning to weight matrices, reshaping their singular-value spectrum. This method avoids computationally expensive matrix decompositions by directly modifying the singular values through a polynomial function. The PC layer is integrated into the neural network architecture, allowing for automatic differentiation during training.
Results
The PC layer was tested on Llama-1B and Llama-271M models, resulting in a 2× reduction in training tokens needed to achieve the same loss with AdamW and a 1.13× improvement with Muon. Additionally, the PC layer enhanced zero-shot downstream accuracy and improved the conditioning of weight spectra.
Implications
The findings suggest that polynomial weight preconditioning can be a valuable technique for improving the training of large language models, potentially leading to more efficient training processes and better performance in various NLP tasks. This approach may also inspire further research into weight conditioning methods in deep learning.
The Evaluation Blind Spot: A Stereological Theory of Benchmark Coverage for Large Language Models
NLP
Large Language Models
Theory
- Introduces a stereological framework for understanding benchmark coverage in LLMs.
- Identifies a significant structural blind spot in LLM evaluations that exceeds statistical noise.
- Develops a submodular greedy algorithm for efficient benchmark selection.
- Empirical results show effective dimensionality (deff) ranges between 2.86 and 4.80 across leaderboards.
Read more
The Evaluation Blind Spot: A Stereological Theory of Benchmark Coverage for Large Language Models
Summary
This paper presents a stereological theory to analyze benchmark coverage for large language models (LLMs). It establishes that the visible Hausdorff distance between two convex capability profiles, which yield the same scores, is constrained by a specific formula involving effective dimensionality (deff). Empirical studies on three leaderboards reveal that deff ranges from 2.86 to 4.80, indicating a significant structural blind spot that overshadows the observed score gaps. The author introduces a submodular greedy algorithm that identifies a stable core of benchmarks, demonstrating that a small subset can achieve substantial coverage. Additionally, the paper resolves Gardner’s Problem 1.5, providing a minimax rate for support functions in various dimensions. The findings underscore the limitations of current LLM evaluations and propose methods to enhance benchmark selection and coverage.
Methodology
The paper employs a theoretical framework based on stereology to analyze benchmark coverage, utilizing mathematical theorems to derive bounds on indistinguishability and effective dimensionality. Empirical analysis is conducted on multiple leaderboards to validate theoretical claims, and a submodular greedy algorithm is implemented to optimize benchmark selection.
Results
The study finds that the effective dimensionality across various leaderboards is between 2.86 and 4.80, with a structural blind spot that is significantly larger than the observed score gaps. The submodular greedy algorithm successfully identifies a core set of benchmarks that can provide 90% coverage, with high retention across temporal quarters. The resolution of Gardner’s Problem 1.5 establishes a minimax rate for support functions in general dimensions.
Implications
The findings suggest that current LLM evaluations may be misleading due to inherent limitations in benchmark coverage. The proposed methods for benchmark selection could lead to more reliable assessments of model capabilities, ultimately improving the development and evaluation of LLMs.
TorchKM: A GPU-Oriented Library for Kernel Learning and Model Selection
Efficient ML
Optimization
Theory
- TorchKM integrates model selection with training, reducing computational overhead.
- The library supports a variety of kernel methods beyond SVM, including logistic regression and quantile regression.
- It utilizes GPU acceleration to enhance performance and efficiency in kernel machine learning tasks.
- The API is designed to be user-friendly, resembling the widely-used scikit-learn interface.
Read more
TorchKM: A GPU-Oriented Library for Kernel Learning and Model Selection
Summary
TorchKM is an open-source library designed for kernel machine learning, including support vector machines (SVM), kernel logistic regression, and kernel quantile regression, with a focus on GPU acceleration. The library adopts a scikit-learn-style API, making it user-friendly for those familiar with existing libraries. A significant innovation of TorchKM is its integrated model selection pipeline, which allows for simultaneous training and tuning of hyperparameters, thereby reducing computational costs associated with traditional separate training and tuning processes. The library leverages GPU-friendly linear algebra to optimize performance, featuring an exact cross-validation formula and a spectral algorithm that enhance the efficiency of the training and tuning pipeline. Benchmarks indicate that TorchKM achieves competitive predictive performance while providing substantial speedups compared to standard implementations like scikit-learn and ThunderSVM. The library is easily accessible via PyPI and includes comprehensive documentation.
Methodology
TorchKM employs algorithm-hardware co-design principles to optimize its computational algorithms for GPU execution. It features an integrated approach for model training and hyperparameter tuning, utilizing an exact cross-validation formula and a spectral algorithm to streamline the process. The library is built to handle various kernel methods and provides a consistent interface for users.
Results
The benchmarks conducted show that TorchKM not only matches the predictive performance of existing libraries but also offers substantial speed improvements, particularly in scenarios involving extensive hyperparameter tuning and cross-validation. This efficiency is achieved through the library's unique design that minimizes redundant computations.
Implications
TorchKM has the potential to revitalize the use of kernel machines in machine learning by making them more accessible and efficient, particularly in applications requiring high predictive accuracy and computational efficiency. Its GPU acceleration capabilities can significantly benefit large-scale data analysis and real-time applications.
A Sliced-Wasserstein Framework on Correlation Matrices for EEG Decoding
Time Series
- Introduction of a Sliced Wasserstein framework for EEG decoding using correlation matrices.
- Development of Pullback Euclidean Metric Sliced Wasserstein (PEMSW) for non-Euclidean spaces.
- Instantiation of Correlation Sliced-Wasserstein discrepancies using OLM and LSM.
- Demonstrated improved generalization in EEG decoding tasks with minimal computational overhead.
Read more
A Sliced-Wasserstein Framework on Correlation Matrices for EEG Decoding
Summary
This paper introduces a novel framework for EEG decoding using Sliced Wasserstein (SW) discrepancies on correlation matrices, addressing the limitations of traditional covariance descriptors that are sensitive to scaling. The authors propose the Pullback Euclidean Metric Sliced Wasserstein (PEMSW) framework, which allows for the computation of SW distances on manifolds equipped with Pullback Euclidean Metrics. They specifically develop two Correlation Sliced-Wasserstein (CorSW) discrepancies based on the Off-Log Metric (OLM) and Log-Scaled Metric (LSM) for full-rank correlation matrices. The framework is applied to a domain generalization (DG) task in EEG decoding, demonstrating its effectiveness in improving generalization under distribution shifts. Experiments conducted on three EEG datasets show that the proposed method achieves better performance with low training overhead and no additional inference costs, highlighting its practical applicability in neuroscience and healthcare.
Methodology
The authors propose a general framework for Sliced Wasserstein discrepancies on manifolds with Pullback Euclidean Metrics. They derive two specific Correlation Sliced-Wasserstein discrepancies for full-rank correlation matrices using two correlation geometries (OLM and LSM). The framework is then applied to EEG decoding, focusing on domain generalization to handle distribution shifts effectively.
Results
The experiments on three EEG datasets indicate that the proposed CorSW discrepancies significantly enhance generalization performance compared to traditional methods. The approach maintains low training overhead and incurs no additional costs during inference, making it a practical solution for EEG decoding.
Implications
This work has significant implications for EEG analysis and decoding in neuroscience and healthcare, providing a robust method for handling noisy and non-stationary EEG signals. The framework can potentially be extended to other domains where correlation matrices are relevant, such as finance and brain-computer interfaces.
Learning to Route LLMs from Implicit Cost-Performance Preferences via Meta-Learning
Large Language Models
NLP
Optimization
- Introduction of a perceptive LLM routing paradigm that learns user preferences interactively.
- Development of MetaRouter, a meta-learning framework for preference-aware LLM routing.
- Demonstrated superior performance of MetaRouter over strong baselines across multiple datasets.
- Showed efficiency in learning user preferences and adaptability to different LLMs.
Read more
Learning to Route LLMs from Implicit Cost-Performance Preferences via Meta-Learning
Summary
This paper addresses the challenge of efficiently routing queries to large language models (LLMs) based on user-specific cost-performance preferences. The authors propose a novel perceptive LLM routing paradigm that learns users' implicit preferences through minimal interaction. The proposed MetaRouter framework employs meta-learning to treat distinct user preferences as separate tasks within a contextual bandit framework. This allows the system to rapidly adapt to varying user needs by inferring a latent preference representation from pairwise comparisons of LLM responses. The results demonstrate that MetaRouter outperforms existing routing methods on both in-distribution and out-of-distribution tasks, showcasing its efficiency in learning user preferences, robustness to changes in LLMs, and scalability for multi-model routing. The study highlights the potential for personalized LLM routing that dynamically balances performance and cost based on individual user requirements.
Methodology
The authors formulated the routing decision as a contextual bandit problem, treating distinct user preferences as different tasks. MetaRouter was trained through a meta-learning approach, where users provided pairwise comparisons of responses from different LLMs. This feedback was used to infer a latent preference representation, which informed the routing policy to select the optimal model for each query.
Results
MetaRouter outperformed existing routing methods on both in-distribution and out-of-distribution tasks. The framework demonstrated high efficiency in learning user preferences, robustness to changes in the routable LLMs, and scalability for multi-model routing, indicating its effectiveness in personalized LLM applications.
Implications
The findings suggest that personalized LLM routing can significantly enhance user experience by adapting to individual cost-performance preferences. This approach could lead to more efficient deployment of LLMs in various applications, reducing costs while maintaining high performance. It opens avenues for further research into user-centric AI systems that dynamically adjust to user needs.
A Held-Out Transition-Pair Falsifier for Long-Horizon Non-Abelian State Tracking
Theory
- Introduces a held-out transition-pair falsifier to evaluate sequence models in non-Abelian state tracking.
- Demonstrates that a projected recurrent state model can achieve perfect accuracy in long-horizon predictions.
- Baseline models fail to perform under the same evaluation conditions, indicating the effectiveness of the proposed falsifier.
- Mechanism diagnostics reveal the significance of projection temperature in model performance.
Read more
A Held-Out Transition-Pair Falsifier for Long-Horizon Non-Abelian State Tracking
Summary
This paper addresses the limitations of sequence models in state tracking, particularly when the relevant signal is an ordered latent state evolving through non-commutative transformations. The author introduces a held-out transition-pair falsifier for finite non-Abelian group tracking, which prevents specific ordered generator pairs from being included in training sequences while requiring them in evaluation sequences. This approach effectively blocks memorization of local transition templates. The study utilizes a controlled S3 × S3 benchmark, demonstrating that a projected recurrent state model trained on length-8 sequences can achieve error-free predictions across evaluation horizons up to 1,048,576 tokens. In contrast, baseline models, including GRUs and structured state-space models, fail under the same conditions. The paper also provides mechanism diagnostics that reveal the importance of hard projection for maintaining accuracy and low homomorphism error. The findings suggest that explicit projected non-commutative state composition serves as a beneficial inductive bias for long-horizon hidden-state tracking, although the results are scoped to the specific falsifier protocol rather than a general model ranking.
Methodology
The methodology involves defining a held-out transition-pair falsifier protocol that excludes specific ordered generator pairs from training sequences while requiring them in evaluation sequences. A projected recurrent state model is employed, which maintains a non-commutative recurrent hidden state and produces symbolic group elements through temperature-controlled projection. Mechanism diagnostics are conducted to assess model performance across various projection temperatures.
Results
The projected recurrent state model achieved perfect final-state predictions (250/250) across evaluation horizons up to 1,048,576 tokens in the S3 × S3 benchmark. In contrast, baseline models, including GRUs and structured state-space models, performed poorly under the same protocol. Mechanism diagnostics indicated that hard projection correlated with low homomorphism error and state-consistency drift, while softened projection led to reduced accuracy.
Implications
The findings suggest that the proposed falsifier and the use of projected non-commutative state composition could enhance the design of sequence models for tasks requiring long-horizon state tracking. This has potential applications in various domains where maintaining accurate internal states over extended sequences is critical, such as robotics and complex system control.
Benchmarking Counterfactual Prediction in Epidemic Time Series with Time-Varying Interventions
Time Series
- Introduction of EpiCF-Bench, a benchmark for counterfactual prediction in epidemic time series.
- Utilization of a calibrated agent-based model to generate realistic epidemic trajectories.
- Support for both single-policy and multi-policy intervention settings.
- Evaluation of various causal inference methods, revealing performance disparities.
Read more
Benchmarking Counterfactual Prediction in Epidemic Time Series with Time-Varying Interventions
Summary
This paper addresses the challenges in causal inference for epidemic time series, particularly the lack of realistic benchmarks with observable counterfactual outcomes. The authors introduce EpiCF-Bench, a large-scale benchmark designed for counterfactual prediction in epidemic time series under dynamic interventions. This benchmark is built on a calibrated agent-based model (ABM) that incorporates real-world demographic, mobility, epidemiological, and policy data, generating factual and counterfactual trajectories across over 150 U.S. counties. EpiCF-Bench supports both static and time-varying treatments, allowing for comprehensive evaluation of causal inference methods across various intervention scenarios. The authors evaluate several widely used and state-of-the-art causal inference methods using this benchmark, revealing significant performance differences and highlighting the complexities of realistic time-series causal reasoning. The paper contributes to bridging the gap between realism and evaluability in causal inference, providing a robust framework for future research.
Methodology
The authors developed EpiCF-Bench using a calibrated agent-based model (ABM) that simulates epidemic dynamics based on real-world data. This model generates both factual and counterfactual trajectories, allowing for the evaluation of causal inference methods under different intervention scenarios. The benchmark includes tasks for both single-policy and multi-policy interventions, facilitating comprehensive assessments of causal inference techniques.
Results
The evaluation of causal inference methods using EpiCF-Bench demonstrated substantial performance differences among the methods tested. The results highlighted the challenges associated with realistic time-series causal reasoning, emphasizing the need for robust benchmarks in this domain.
Implications
The development of EpiCF-Bench has significant implications for public health decision-making, as it provides a reliable framework for evaluating the effects of interventions in epidemic contexts. The benchmark can aid researchers and policymakers in understanding the causal impacts of various strategies, ultimately improving epidemic response efforts.
End-to-End Subgraph Detection with GraphDETR
Graph Learning
- GraphDETR reformulates subgraph detection as a set prediction problem, enabling joint predictions of multiple subgraphs.
- The model employs a GNN for graph encoding and a transformer decoder for predicting subgraph occurrences.
- GraphDETR supports both exact and approximate matching, extending beyond traditional combinatorial methods.
- Empirical results demonstrate strong performance in detecting molecular functional groups and other graph patterns.
Read more
End-to-End Subgraph Detection with GraphDETR
Summary
The paper introduces GraphDETR, a novel deep learning framework for subgraph detection that reformulates the problem as a set prediction task, similar to the DETR model in object detection. Traditional combinatorial methods for subgraph isomorphism are limited by their NP-completeness and are only applicable to small patterns or moderately sized graphs. GraphDETR addresses these limitations by using a graph neural network (GNN) to encode the target graph and employing a transformer decoder with a fixed set of learnable query vectors to predict all occurrences of query patterns in a single forward pass. This end-to-end training approach utilizes bipartite matching, allowing for both exact and approximate matching of subgraphs. The framework is evaluated on various patterns, including molecular structures and functional groups, demonstrating its capability to detect diverse patterns in graphs with up to 1000 nodes. The results show that GraphDETR achieves a strong performance in molecular functional group detection on the ChEMBL dataset, with an average precision of 91.2, highlighting its effectiveness in graph reasoning tasks.
Methodology
GraphDETR encodes the target graph using a graph neural network (GNN) and employs a transformer decoder with a fixed set of learnable query vectors. The model is trained end-to-end using bipartite matching to align predictions with ground-truth subgraphs, allowing for simultaneous detection of multiple subgraphs without duplicates.
Results
GraphDETR successfully detects diverse patterns, including molecular structures and fuzzy patterns of up to 50 nodes, in target graphs with up to 1000 nodes. In molecular functional group detection on the ChEMBL dataset, it achieves an average precision of 91.2, demonstrating its effectiveness and robustness.
Implications
The introduction of GraphDETR has significant implications for various scientific domains, particularly in molecular analysis and network science. Its ability to detect subgraphs efficiently and accurately can enhance tasks such as reaction prediction, retrosynthesis planning, and molecular property modeling.
Drifting Models for Surrogate Flow Modeling
Generative Models
Efficient ML
Optimization
- Conditional drifting models can generate flow fields accurately in a single step.
- The approach utilizes a learned VAE for latent-space drifting and incorporates label-aware masking.
- The model runs two orders of magnitude faster than traditional iterative diffusion methods.
- A spatial-conditioning variant shows promise for generalizing to new geometries.
Read more
Drifting Models for Surrogate Flow Modeling
Summary
This paper addresses the limitations of Computational Fluid Dynamics (CFD) in optimizing indoor environments due to its high computational cost. The authors propose a novel generative surrogate modeling approach using a drifting framework specifically adapted for fluid mechanics. The key innovation is a conditional architecture that performs drifting in a learned Variational Autoencoder (VAE) latent space, allowing for efficient one-step generation of flow fields while maintaining accuracy and consistency. The model incorporates label-aware masking to ensure generated samples align with boundary conditions, achieving results comparable to iterative diffusion models but at significantly reduced computational costs—two orders of magnitude faster. Additionally, a spatial-conditioning variant is introduced to enhance generalization to unseen geometries. The findings suggest that conditional drifting can serve as a viable alternative to traditional diffusion-based methods, enabling real-time CFD surrogates critical for applications in indoor air quality management.
Methodology
The authors adapted a drifting model framework to fluid mechanics, utilizing a conditional architecture that operates in a learned VAE latent space. The model employs label-aware masking to enforce boundary conditions during generation. Training involves evolving pushforward distributions to match data distributions without iterative sampling, allowing for rapid inference.
Results
The conditional drifting model demonstrated high accuracy and flow consistency comparable to iterative diffusion models while achieving significantly faster inference times. The spatial-conditioning variant indicated potential for effective generalization across different room geometries.
Implications
This work has significant implications for real-time indoor airflow prediction and optimization, particularly in enhancing indoor air quality management systems. The efficiency of the proposed method could facilitate rapid design exploration and control in various applications involving fluid dynamics.
Less is MoE: Trimming Experts in Domain-Specialist Language Models
NLP
Large Language Models
Efficient ML
- Fisher importance outperforms traditional metrics for identifying critical dimensions in MoE models.
- Fisher-MoE enables fine-grained compression at the intermediate dimension level, preserving model capabilities.
- At a 50% compression ratio, Fisher-MoE reduces memory usage by approximately 45% and improves inference throughput by 21%.
- The study reveals that model capabilities are concentrated in a small subset of intermediate dimensions rather than being localized at the expert level.
Read more
Less is MoE: Trimming Experts in Domain-Specialist Language Models
Summary
This paper addresses the challenges of deploying Mixture-of-Experts (MoE) models due to their large parameter sizes. Previous compression methods that remove or merge experts often fail when evaluated on general-purpose benchmarks, primarily because they operate at the expert level rather than focusing on the distribution of capabilities across intermediate dimensions. The authors propose a novel approach called Fisher-MoE, which utilizes Fisher importance to identify and remove less critical intermediate dimensions within the feedforward neural network (FFN) layers of MoE models. This method allows for a more fine-grained compression that preserves model performance while significantly reducing memory usage and improving inference speed. The study demonstrates that by removing only a small number of low-Fisher-score dimensions, the model can maintain its factual knowledge performance while suffering a collapse in tasks that require mathematical reasoning. The findings suggest that focusing on intermediate dimensions rather than entire experts can lead to more effective model compression strategies.
Methodology
The authors employ Fisher importance as a metric to rank intermediate dimensions within MoE models. They conduct controlled experiments to compare Fisher importance with other heuristics (activation frequency, router scores, and weight magnitudes) and demonstrate its effectiveness in identifying critical dimensions. The proposed Fisher-MoE method compresses the model by removing low-Fisher-score dimensions, allowing for a more efficient model structure without discarding entire experts.
Results
The results show that Fisher-MoE maintains model performance on general-purpose benchmarks while achieving significant reductions in memory footprint and improvements in inference speed. Specifically, at a 50% compression ratio, the method preserves factual knowledge performance and enhances throughput by 21%, while removing as few as 12 out of 1.35 million dimensions can lead to performance collapse in mathematical reasoning tasks.
Implications
The findings suggest that focusing on intermediate dimensions for compression can lead to more efficient deployment of large language models, making them more practical for real-world applications. This approach could be beneficial in scenarios where computational resources are limited, such as mobile devices or edge computing environments.
Performance Variation in Deep Reinforcement Learning
Reinforcement Learning
- Deep RL algorithms often suffer from significant performance variation across independent runs.
- Conventional uncertainty measures may underreport performance variability.
- The proposed min-max IPR-90 statistic provides a more robust and interpretable measure of performance variation.
- Normalization techniques can effectively reduce performance variation in certain algorithms.
Read more
Performance Variation in Deep Reinforcement Learning
Summary
This paper addresses the significant issue of performance variation in deep reinforcement learning (RL) algorithms, which often exhibit low robustness across independent runs. The authors highlight the limitations of conventional methods for estimating uncertainty and performance variation, which tend to underreport variability. To tackle this, they propose a percentile-based statistic, min-max IPR-90, and a visualization method called run-wise percentile highlighting. These tools provide a clearer interpretation of performance variation. The paper includes three case studies demonstrating the effectiveness of the proposed methods. The first study shows that normalization techniques like LayerNorm reduce performance variation in the Proximal Policy Optimization (PPO) algorithm, while the Soft Actor-Critic (SAC) remains largely unchanged. The second study compares various algorithms, revealing that TD-MPC has the least variation and highest data efficiency. Lastly, a comparison of DQN and Rainbow on Atari environments indicates similar performance variation levels. Overall, the proposed methods offer a more reliable means of evaluating performance variation in deep RL, which is crucial for both research and practical applications.
Methodology
The authors propose a percentile-based statistic, min-max IPR-90, which quantifies performance variation by normalizing the interpercentile range from the 5th to the 95th percentiles. They also introduce run-wise percentile highlighting for visualizing performance variation. The effectiveness of these methods is demonstrated through three case studies involving various deep RL algorithms.
Results
The case studies reveal that normalization techniques can significantly reduce performance variation in PPO, while SAC remains largely unaffected. TD-MPC is shown to have the least performance variation and is the most data-efficient among the algorithms compared. Additionally, DQN and Rainbow exhibit similar levels of performance variation across multiple Atari environments.
Implications
The findings suggest that addressing performance variation is crucial for improving the reliability of deep RL algorithms in both research and real-world applications. The proposed methods can aid researchers and practitioners in better evaluating and comparing the robustness of different RL approaches.
Intercomparison of Machine Learning Algorithms for Remote Sensing-based In-season Crop Mapping
Optimization
Time Series
Theory
- In-season crop mapping is essential for timely responses to climate-related agricultural threats.
- Support Vector Machines outperformed other algorithms in mapping accuracy, achieving a mean F1 score of 0.74 for almonds.
- The study utilized a comprehensive evaluation approach, considering interannual variability in crop distribution.
- Future research could expand the methodology to include all crop types and improve yield forecasting.
Read more
Intercomparison of Machine Learning Algorithms for Remote Sensing-based In-season Crop Mapping
Summary
This study addresses the critical need for in-season crop type mapping to enhance food security amidst climate-related threats. The USDA's Cropland Data Layer provides crop type labels only after harvest, which limits timely responses to crop threats. The authors combine Harmonized Landsat-Sentinel surface reflectance imagery with crop rotation history to accurately map corn in Iowa and almonds in California at a 30m resolution by early June in unseen years. They evaluate thousands of model configurations across ten machine learning algorithms using year-wise cross-validation and various metrics. The hyperparameter optimization revealed that Support Vector Machines (SVM) achieved the highest performance, with a mean F1 score of 0.74 for almonds and 0.59 for corn across five unseen validation years. The study highlights interannual variability as a significant source of uncertainty but suggests that ensemble approaches or additional data could enhance performance. Future work aims to extend these methods to multiclass crop mapping, CONUS-wide applications, and in-season crop yield forecasting.
Methodology
The authors employed a combination of remote sensing imagery and crop rotation history to create crop maps. They conducted a hyperparameter search across ten machine learning algorithms, utilizing year-wise cross-validation to assess model performance and uncertainty quantification due to phenology and crop distribution.
Results
The study found that Support Vector Machines provided the best overall performance, with an F1 score of 0.74 for almond mapping and 0.59 for corn mapping by early June across five unseen years. Interannual variability was identified as a major source of uncertainty, indicating potential for improvement through ensemble methods.
Implications
The findings suggest that machine learning can significantly enhance the accuracy of in-season crop mapping, which is crucial for food security and agricultural management in the face of climate change. The methodologies developed could be applied to broader geographic areas and various crop types, aiding in real-time agricultural decision-making.
Network Recovery from Cascade Data: A Debiased Jacobian-Based Machine Learning Approach
Graph Learning
Theory
Time Series
- CascadeNet does not require specifying a diffusion model, reducing the risk of misspecification.
- The framework utilizes a flexible estimator for the one-step transition function.
- Neyman-orthogonal debiasing ensures unbiased estimates of the network Jacobian.
- CascadeNet achieves high accuracy in network recovery in both simulated and real-world scenarios.
Read more
Network Recovery from Cascade Data: A Debiased Jacobian-Based Machine Learning Approach
Summary
The paper addresses the challenge of recovering hidden influence networks from dynamic cascades, such as product adoption and disease spread. Existing methods often rely on specific diffusion models, which can lead to significant performance degradation if the model is misspecified. The author proposes CascadeNet, a novel Jacobian-based machine learning framework that does not require a predefined diffusion mechanism. The core idea is to characterize the influence structure using the Jacobian of the one-step transition function. CascadeNet constructs a flexible estimator for this transition function and employs Neyman-orthogonal debiasing to ensure that the resulting Jacobian is consistent and allows for formal inference on the network structure. The methodology is validated through simulations and a real-world application to COVID-19 transmission in Spain, demonstrating superior network recovery accuracy compared to existing methods.
Methodology
CascadeNet employs a Jacobian-based approach to network recovery by estimating the one-step transition function without assuming a specific diffusion model. It uses a flexible machine learning estimator and applies Neyman-orthogonal debiasing to correct for bias introduced by regularization, ensuring that the Jacobian entries are consistent and allow for statistical inference.
Results
In simulations, CascadeNet outperformed existing methods, achieving the highest accuracy in network recovery across various data-generating processes. In a real-world application to COVID-19 transmission across Spain's provinces, the networks recovered by CascadeNet showed significant correlation with the true inter-province mobility network, while baseline methods did not.
Implications
The findings suggest that CascadeNet can be effectively used in various fields such as marketing, epidemiology, and social network analysis, where understanding the underlying influence structure is crucial for decision-making. The ability to recover networks without model assumptions enhances the applicability of the method across different contexts.
SCALE: Scalable Cross-Attention Learning with Extrapolation for Agentic Workflow Scheduling
Reinforcement Learning
Optimization
Large Language Models
- SCALE generalizes to unseen cluster sizes without retraining.
- Introduces Structured Representation Regularization (SRR) to stabilize feature statistics.
- Achieves competitive performance in response time across different cluster sizes.
- Formalizes agentic workflow scheduling as an MDP, capturing task dependencies and resource heterogeneity.
Read more
SCALE: Scalable Cross-Attention Learning with Extrapolation for Agentic Workflow Scheduling
Summary
The paper introduces SCALE, a novel deep reinforcement learning (DRL) scheduler designed for agentic workflow scheduling in heterogeneous clusters. Traditional DRL schedulers are limited by their fixed cluster size, necessitating retraining when the number of servers changes. SCALE addresses this limitation by employing a cross-attention pointer network that allows for scalability across different cluster sizes without the need for fine-tuning. The authors identify that simply using a permutation-invariant architecture does not ensure performance across varying scales due to distribution shifts in attention features. To mitigate this issue, they propose Structured Representation Regularization (SRR), which stabilizes feature statistics through a decorrelation loss and a KL penalty towards a standard normal distribution. The effectiveness of SCALE is demonstrated through experiments where it is trained on 16 nodes and tested on larger configurations (32 and 48 nodes), achieving an 8.9% reduction in average response time compared to a similar architecture without SRR. This work formalizes agentic workflow scheduling as a Markov Decision Process (MDP) and presents a scalable solution that retains performance across different cluster sizes.
Methodology
SCALE utilizes a cross-attention pointer network architecture where task features query against server features, allowing it to handle any number of servers. The model incorporates Structured Representation Regularization (SRR) to maintain stable feature statistics across varying cluster sizes. It is trained on a heterogeneous simulated cluster with 16 nodes and evaluated on larger clusters (32 and 48 nodes) without fine-tuning.
Results
The experiments show that SCALE, when trained on 16 nodes, maintains competitive response times when deployed on larger clusters (32 and 48 nodes), achieving an 8.9% reduction in average response time compared to a similar architecture lacking SRR. This confirms the effectiveness of SRR in closing the scale-generalization gap.
Implications
The findings suggest that SCALE can be effectively applied in dynamic environments where cluster sizes frequently change, such as cloud computing and large-scale data processing. The ability to generalize across different scales without retraining could lead to more efficient resource utilization and reduced latency in task scheduling.
Self-evolving LLM agents with in-distribution Optimization
Large Language Models
Reinforcement Learning
Robotics
- Q-Evolve unifies process-reward labeling and policy learning in a reinforcement learning paradigm.
- The framework stabilizes learning in sparse-reward environments using a hybrid off-policy dataset.
- Q-Evolve enables dense supervision without the need for manual annotations or environment backtracking.
- The method shows improved sample efficiency and robustness across various interactive environments.
Read more
Self-evolving LLM agents with in-distribution Optimization
Summary
This paper introduces Q-Evolve, a novel framework designed for training Large Language Model (LLM) agents to enhance their decision-making capabilities in complex environments. The primary challenge addressed is the issue of credit assignment in reinforcement learning, particularly in scenarios where agents receive delayed rewards only at the end of episodes. Q-Evolve integrates automatic process-reward labeling with policy learning through an in-distribution reinforcement learning approach. The framework utilizes a hybrid off-policy dataset that combines expert demonstrations with agent-generated trajectories to stabilize Bellman backups in sparse-reward settings. By estimating step-wise process rewards through advantage estimation, Q-Evolve provides dense supervision without requiring environment backtracking or human annotation. This self-evolving mechanism allows the agent to iteratively improve its policy while minimizing distribution shifts. The authors evaluate Q-Evolve on multiple environments, including AlfWorld, WebShop, and ScienceWorld, demonstrating significant improvements in sample efficiency, robustness, and overall task performance compared to existing baselines. The findings suggest that stable self-evolution of agents is achievable through the co-evolution of process-level supervision and policy within a shared in-distribution learning framework.
Methodology
Q-Evolve employs a self-evolving framework that combines automatic process-reward labeling with policy learning. It utilizes a hybrid off-policy dataset consisting of expert demonstrations and agent-generated trajectories to derive step-wise process rewards through advantage estimation. The framework stabilizes learning using a weighted Implicit Q-Learning objective and performs behavior-proximal policy optimization to facilitate iterative self-improvement.
Results
Q-Evolve was evaluated on AlfWorld, WebShop, and ScienceWorld, outperforming strong baselines in terms of sample efficiency, robustness, and overall task performance. The results indicate that the proposed method effectively addresses the challenges of credit assignment and distribution shift in LLM agents.
Implications
The findings of this research have significant implications for the development of autonomous agents in dynamic environments, particularly in applications requiring long-horizon decision-making. The ability to achieve stable self-evolution in LLM agents could enhance their deployment in real-world scenarios, such as robotics, gaming, and interactive systems.
Temporal Preference Concepts and their Functions in a Large Language Model
Large Language Models
Interpretability
- Identification of a subgraph for temporal preference in LLMs using mechanistic interpretability techniques.
- LLMs exhibit a less steep discounting of future outcomes compared to humans, indicating behavioral inconsistencies.
- Successful interventions can steer temporal preferences, emphasizing the need for explicit control mechanisms.
- Convergence of multiple localization methods provides strong evidence for the model's internal structure.
Read more
Temporal Preference Concepts and their Functions in a Large Language Model
Summary
This paper investigates how Large Language Models (LLMs) represent and manage temporal preferences, particularly in decision-making scenarios that involve trade-offs between short-term gains and long-term consequences. The authors focus on the Qwen3-4B-Instruct-2507 model and employ mechanistic interpretability techniques to identify a subgraph responsible for temporal preference within the model's architecture. They utilize various methods, including gradient-based attribution and activation patching, to localize this subgraph in the mid-to-upper layers of the model. The findings reveal that LLMs discount future outcomes less steeply than humans, indicating a potential instability in their temporal preferences across different contexts. The authors also demonstrate that steering vectors can effectively shift these preferences, suggesting that explicit control over temporal decision-making in LLMs is necessary rather than relying on implicit training. This work highlights the importance of understanding the internal mechanisms of LLMs to ensure reliable decision-making in high-stakes applications.
Methodology
The authors employed a combination of mechanistic interpretability techniques, including causal and attributional localization, activation patching, and linear probing, to isolate and analyze the temporal preference subgraph in the Qwen3-4B-Instruct-2507 model. They conducted behavioral analyses and steering interventions to assess the stability and control of temporal preferences.
Results
The study successfully localized a temporal preference subgraph within the LLM, revealing that the geometry of time horizon is encoded in specific layers. Behavioral analysis indicated that LLMs discount future outcomes less steeply than humans, and steering interventions demonstrated the ability to shift temporal preferences effectively.
Implications
Understanding and controlling temporal preferences in LLMs is crucial for their deployment in high-stakes decision-making scenarios, such as autonomous systems. This research provides a foundation for developing more reliable and interpretable AI systems that can manage complex intertemporal trade-offs.
Generative Modeling of Discrete Latent Structures via Dynamic Policy Gradients
Reinforcement Learning
Generative Models
Graph Learning
- GReinSS introduces a novel policy learning framework for inferring discrete latent states.
- The method utilizes dynamically rescaled rewards to optimize the likelihood of observed data.
- GReinSS outperforms traditional methods and existing generative models in reconstructing latent states.
- The framework is validated on both simulated datasets and real RNA sequencing data.
Read more
Generative Modeling of Discrete Latent Structures via Dynamic Policy Gradients
Summary
The paper addresses the challenge of inferring mechanistic latent states from indirect observations, a common requirement in various scientific domains. Traditional methods like expectation maximization struggle with large combinatorial spaces, while deep learning approaches often yield artificial latent states. The authors propose GReinSS, a novel framework that utilizes dynamically rescaled rewards within a reinforcement learning paradigm to optimize the distribution of latent states. This method aims to maximize the likelihood of observed data by accurately reconstructing latent sets and graphs. The authors demonstrate that GReinSS outperforms existing generative modeling techniques and policy learning baselines in both simulated environments and real-world applications, such as RNA isoform reconstruction from short-read sequencing data. The results indicate that GReinSS provides a principled and effective approach for generative modeling and inference of complex latent structures.
Methodology
GReinSS employs a reinforcement learning approach where policy gradients are used to optimize the distribution of latent states. The framework dynamically adjusts rewards to ensure that the policy parameters are updated towards maximizing the likelihood of observed data, addressing the limitations of classical methods in combinatorial latent state spaces.
Results
The experimental results show that GReinSS reliably reconstructs ground-truth latent states with higher accuracy than standard policy gradients, GFlowNets, and generalized expectation maximization techniques. In real-world applications, GReinSS successfully reconstructs isoforms from RNA sequencing data that align more closely with those identified by long-read sequencing methods compared to traditional algorithms.
Implications
GReinSS has significant implications for fields requiring accurate inference of latent structures from indirect observations, such as genomics, chemical pathways, and network analysis. Its ability to handle combinatorial complexities makes it a valuable tool for researchers in these domains.
CausalPOI: Spatio-Temporal Graph-Based Causal Modeling for Cold-Start POI Check-in Forecasting
Graph Learning
Time Series
Causal Inference
- Introduces cold-start POI check-in forecasting as a novel research problem.
- Develops CausalPOI, a framework that models causal relationships and functional interactions between POIs.
- Utilizes Spatio-Temporal Functional Interaction Graphs for enhanced semantic and spatial relationship modeling.
- Demonstrates superior performance compared to existing forecasting methods on real-world datasets.
Read more
CausalPOI: Spatio-Temporal Graph-Based Causal Modeling for Cold-Start POI Check-in Forecasting
Summary
The paper addresses the challenge of forecasting check-in patterns for newly introduced Points of Interest (POIs) in urban environments, a problem termed cold-start POI check-in forecasting. Traditional forecasting methods often rely on proximity-based graphs and correlation-driven modeling, which fail to capture the functional dependencies and causal effects of urban interventions. To overcome these limitations, the authors propose CausalPOI, a spatio-temporal graph-based causal representation learning framework. CausalPOI utilizes a Spatio-Temporal Functional Interaction Graph to model the semantic and spatial relationships between POIs and constructs treatment and control graphs to simulate factual and counterfactual scenarios. The framework is validated through extensive experiments on real-world SafeGraph datasets, demonstrating significant improvements over state-of-the-art baselines in spatio-temporal forecasting, semantic interaction modeling, and causal effect estimation. The findings suggest that CausalPOI provides a more interpretable and actionable foundation for urban intervention analysis, essential for data-driven urban planning and commercial decision-making.
Methodology
CausalPOI employs a spatio-temporal graph-based approach, leveraging a Spatio-Temporal Functional Interaction Graph to capture the relationships between POIs. It constructs treatment and control graphs to simulate both factual and counterfactual scenarios, enabling the model to estimate causal effects and interactions effectively.
Results
The experiments conducted on SafeGraph datasets show that CausalPOI significantly outperforms existing state-of-the-art methods in terms of accuracy in forecasting check-in patterns, modeling semantic interactions, and estimating causal effects, validating the framework's effectiveness.
Implications
The findings from this research have significant implications for urban planning and commercial decision-making, providing a robust tool for predicting the impact of new POIs on urban mobility patterns and enabling better resource allocation and strategic planning.
PandaAI: A Practical Agent CQ2 for Neuro-symbolic Data Analysis And Integrated Decision-Making in Quantitative Finance
NLP
Large Language Models
Time Series
- PandaAI integrates LLM reasoning with financial rigor to address challenges in quantitative finance.
- The model employs a closed-loop system for continuous adaptation to non-stationary market conditions.
- Constrained MCTS alpha mining is used to ensure the financial viability of generated factors.
- PandaAI demonstrates significant improvements over existing time-series models in financial decision-making.
Read more
PandaAI: A Practical Agent CQ2 for Neuro-symbolic Data Analysis And Integrated Decision-Making in Quantitative Finance
Summary
PandaAI is introduced as a neuro-symbolic agent designed to enhance decision-making in quantitative finance by integrating the reasoning capabilities of Large Language Models (LLMs) with financial rigor. The paper addresses the challenges posed by low Signal-to-Noise Ratio (SNR) and non-stationarity in financial data, which complicate the application of deep learning in sequential decision-making. The authors propose a closed-loop system that incorporates market regime modeling and constrained alpha generation, allowing for explicit risk awareness and adaptation to changing market conditions. The methodology includes fine-tuning a domain-specific LLM and employing a modular architecture that facilitates continuous adaptation. Extensive experiments on CSI 300 stock data demonstrate that PandaAI outperforms state-of-the-art time-series models, achieving an 18.2% higher Rank Information Coefficient (IC) and a 25.7% lower maximum drawdown. The proposed framework offers a novel approach to deploying LLMs in high-stakes financial environments, emphasizing the importance of integrating financial constraints into the model generation process.
Methodology
The authors developed a neuro-symbolic agent that combines a fine-tuned LLM with a modular architecture. They implemented a constrained Monte Carlo Tree Search (MCTS) for alpha mining, integrating financial constraints throughout the LLM generation process. Market regime modeling is achieved through latent variable modeling, allowing the system to adapt dynamically to market changes.
Results
PandaAI achieved an 18.2% increase in Rank IC and a 25.7% reduction in maximum drawdown compared to state-of-the-art time-series models when tested on CSI 300 stock data, demonstrating its effectiveness in navigating complex financial environments.
Implications
The findings suggest that integrating neuro-symbolic approaches with LLMs can significantly enhance decision-making in quantitative finance. The framework could be applied to various financial applications, improving risk management and investment strategies in dynamic market conditions.
GenPO++: Generative Policy Optimization with Jacobian-free Likelihood Ratios
Reinforcement Learning
Generative Models
Robotics
- GenPO++ provides a solution to the challenge of evaluating action probabilities in generative policies for on-policy RL.
- The framework utilizes history states for exact inversion, avoiding the need for dummy actions and preserving the original action dimension.
- It achieves Jacobian-free likelihood-ratio computation, enhancing computational efficiency and training stability.
- GenPO++ outperforms existing methods in large-scale simulated control and real-world robotic manipulation tasks.
Read more
GenPO++: Generative Policy Optimization with Jacobian-free Likelihood Ratios
Summary
The paper introduces GenPO++, a novel framework for generative policy optimization in reinforcement learning (RL) that addresses the challenges of likelihood-based on-policy learning. Generative policies, particularly flow-based ones, are capable of producing multimodal action distributions, which are beneficial for complex continuous-control tasks. However, existing methods struggle with accurately evaluating the probability of executed actions, leading to biased updates or increased computational costs. GenPO++ overcomes these limitations by utilizing history states as auxiliary memory in a high-order reversible ODE solver, allowing for exact inversion without altering the action dimension. This results in a Jacobian-free likelihood-ratio computation, maintaining the expressiveness of generative flow policies while avoiding the pitfalls of dummy-action augmentation. The framework is evaluated across various tasks, demonstrating competitive or superior performance compared to state-of-the-art on-policy RL methods, alongside improvements in training stability and computational efficiency.
Methodology
The authors propose a high-order reversible generative policy optimization framework that leverages history states from a reversible ODE solver to compute exact likelihood ratios without the need for dummy actions. This design allows for efficient computation of the log-determinant of the policy map, which is independent of the neural network's parameters.
Results
GenPO++ was evaluated on large-scale simulated control tasks, fine-tuning scenarios, and real-world robotic manipulation tasks, achieving competitive or superior performance compared to Gaussian PPO, diffusion policy fine-tuning, FPO, and GenPO. The framework also demonstrated improved training stability and reduced computational overhead.
Implications
The advancements presented in GenPO++ could lead to more efficient and stable applications of generative policies in reinforcement learning, particularly in complex environments where multimodal action distributions are necessary. This could enhance the performance of robotic systems and other continuous-control applications.
Learn to Match: Two-Sided Matching with Temporally Extended Feedback
Reinforcement Learning
Theory
Optimization
- Introduces a framework for two-sided matching with temporally extended feedback.
- Models matching as a partially observable Markov game to capture evolving preferences.
- Presents LEARN2MATCH, a benchmark for evaluating multi-agent reinforcement learning in dynamic matching markets.
- Demonstrates that independent PPO outperforms bandit-style methods in social welfare and regret.
Read more
Learn to Match: Two-Sided Matching with Temporally Extended Feedback
Summary
This paper addresses the limitations of traditional two-sided matching models that rely on immediate feedback about fixed preferences. The authors introduce a novel framework that incorporates temporally extended feedback, modeling two-sided matching as a partially observable Markov game. This framework accounts for evolving latent profiles, costly pre-match screening, and noisy post-match observations. The authors present LEARN2MATCH, a multi-agent reinforcement-learning benchmark designed for dynamic matching markets, allowing agents to make decentralized decisions regarding interviews, matches, and relationship continuations or dissolutions. The evaluation of the framework reveals that the independent Proximal Policy Optimization (PPO) algorithm outperforms a bandit-style baseline (CA-ETC) in terms of cumulative social welfare and regret, although it incurs a higher information-friction loss. This highlights the potential of multi-agent reinforcement learning (MARL) in dynamic matching scenarios while indicating the need for improved coordination in exploration strategies. LEARN2MATCH serves as a benchmark for future research, encouraging the development of algorithms that effectively navigate the complexities of matching markets with temporally extended feedback.
Methodology
The authors develop a framework that models two-sided matching as a partially observable Markov game. They instantiate this framework in LEARN2MATCH, which supports decentralized decision-making and evaluates policies based on regret, social welfare, and information-friction loss. The performance of the independent PPO algorithm is compared against a bandit-style baseline (CA-ETC).
Results
The independent PPO algorithm achieves higher cumulative social welfare and lower cumulative regret compared to the CA-ETC baseline under temporally extended feedback. However, it incurs a higher information-friction loss, indicating that while MARL shows promise, it lacks the coordinated exploration structure of traditional matching-bandit methods.
Implications
The findings suggest that MARL can significantly enhance decision-making in dynamic matching markets, paving the way for the development of more sophisticated algorithms that can adapt to evolving preferences over time. LEARN2MATCH also serves as a call to action for researchers to bridge the gap between MARL and matching theory.
Design a Reliable LLM-Integrated Interface for Mortality Forecasting
NLP
Large Language Models
Time Series
- The integration of LLMs can make mortality forecasting more accessible to non-technical users.
- A three-phase methodology was employed to ensure accuracy, usability, and transparency in the forecasting process.
- The system maintains statistical rigor while allowing for natural language interaction, bridging the gap between complex models and user accessibility.
- The prototype demonstrates that LLMs can effectively translate user requests into structured forecasting tasks.
Read more
Design a Reliable LLM-Integrated Interface for Mortality Forecasting
Summary
This paper presents a novel approach to mortality forecasting by integrating a large language model (LLM) into a user-friendly interface designed for non-expert users. The proposed system aims to simplify the complex technical processes involved in mortality forecasting while ensuring statistical accuracy and transparency. The methodology consists of three phases: first, establishing a baseline forecasting pipeline using the CoMoMo package to reproduce established results; second, extending the pipeline to generate multi-step forecasts through rolling-origin evaluation and mean squared error (MSE) metrics; and third, developing a prototype interface that allows users to input natural language requests, which the LLM translates into structured configurations for the forecasting pipeline. The findings indicate that the LLM can enhance accessibility to mortality forecasting without sacrificing reproducibility or actuarial validity, thus supporting better decision-making in actuarial and policy contexts.
Methodology
The methodology consists of three phases: 1) implementing a baseline mortality forecasting pipeline using the CoMoMo package, 2) extending this pipeline to produce multi-step forecasts evaluated through rolling-origin and MSE, and 3) developing a prototype interface that utilizes a local LLM to process natural language inputs into structured forecasting requests.
Results
The results indicate that the LLM-integrated interface successfully translates natural language requests into valid forecasting configurations, enhancing user accessibility while preserving the accuracy and transparency of the forecasting process. The system demonstrated effective performance in generating reproducible analytical outputs.
Implications
The implications of this research are significant for actuarial practice, as it provides a tool that supports evidence-based decision-making in public and private sectors. By making complex forecasting models more accessible, it can facilitate quicker and more informed policy decisions regarding retirement security, healthcare planning, and resource allocation.
A Machine Learning-Based Framework for Discovering Huntington's Disease Stages: Integrating Graph Representation Learning and clustering to Uncover Progression Dynamics in Longitudinal Enroll-HD Dataset
Graph Learning
Time Series
Multimodal
- Developed an unsupervised machine learning framework for HD stage discovery.
- Utilized graph representation learning to capture temporal relationships in longitudinal data.
- Achieved robust clustering performance with significant clinical distinctions between identified stages.
- Provided an objective, data-driven foundation for staging HD, reducing reliance on expert assessments.
Read more
A Machine Learning-Based Framework for Discovering Huntington's Disease Stages: Integrating Graph Representation Learning and clustering to Uncover Progression Dynamics in Longitudinal Enroll-HD Dataset
Summary
This paper presents a novel unsupervised machine learning framework aimed at identifying stages of Huntington's Disease (HD) by utilizing graph representation learning and clustering techniques. Traditional clinical staging methods often rely on predefined thresholds and expert assessments, which can lead to inconsistencies and overlook intra-stage variability. To overcome these limitations, the authors developed a framework that encodes longitudinal clinical data into compact latent representations, capturing the temporal relationships among patients. The framework employs K-means++ clustering to determine the number of distinct disease stages, iteratively assessing cluster robustness through stability analysis. The study analyzed data from 302 individuals in the Enroll-HD cohort, encompassing 1,477 visits and 44 clinical variables. The results indicated robust clustering performance, revealing two primary disease stages and four meaningful stages with significant clinical distinctions. This data-driven approach provides a more objective understanding of HD progression, potentially integrating multimodal biomarkers for enhanced insights into the disease's course.
Methodology
The authors implemented an unsupervised machine learning framework that integrates graph-based representation learning to encode longitudinal clinical data. They applied K-means++ clustering to identify distinct disease stages and conducted stability analysis to assess cluster robustness, iteratively adjusting the number of clusters.
Results
The framework achieved a Silhouette score of 0.67, a Davies–Bouldin index of 0.56, and a Calinski–Harabasz score of 453, indicating robust clustering performance. The analysis revealed two primary disease stages and four meaningful stages with statistically significant distinctions, aligning with clinical measurement boundaries and showing minimal overlap compared to traditional staging methods.
Implications
This framework offers a more objective and data-driven approach to understanding Huntington's Disease progression, which could enhance patient care and treatment strategies. It also opens avenues for integrating objective multimodal biomarkers, potentially leading to more personalized healthcare solutions.
Sparse Subspace-to-Expert Sharing for Task-Agnostic Continual Learning
NLP
Large Language Models
Efficient ML
- Introduces SETA, a framework that separates task-specific and shared knowledge in continual learning.
- Utilizes adaptive sparse subspace decomposition to create distinct expert modules.
- Implements a Split-on-Share mechanism to dynamically assign parameters as shared or unique experts.
- Demonstrates superior performance in retaining early-task knowledge and improving backward transfer.
Read more
Sparse Subspace-to-Expert Sharing for Task-Agnostic Continual Learning
Summary
This paper addresses the challenges of continual learning in Large Language Models (LLMs), particularly the plasticity-stability dilemma that leads to catastrophic forgetting of previously learned knowledge when adapting to new tasks. The authors propose a novel framework called Mixture of Sparse Experts for Task Agnostic Continual Learning (SETA), which introduces a mechanism for adaptive sparse subspace decomposition into task-specific expert modules. Unlike traditional methods that treat parameters uniformly, SETA separates knowledge into unique experts for task-specific patterns and shared experts for common features. This separation is maintained through adaptive elastic anchoring and routing-aware regularization, which protect shared knowledge at both the weight and routing levels. The framework allows for a unified gating network to dynamically retrieve the appropriate expert combination during inference without relying on external task identifiers. Extensive experiments demonstrate that SETA achieves competitive or superior performance compared to state-of-the-art continual learning methods, showing strong retention of early-task knowledge and improved backward transfer across various domain-specific benchmarks.
Methodology
The methodology involves a Mixture-of-Experts paradigm where the model is decomposed into distinct expert modules based on sparse parameter patterns. The Split-on-Share mechanism dynamically classifies parameters into shared and unique experts, with protective measures against semantic drift. The routing-aware regularization shapes the gating decisions to ensure effective retrieval of relevant expert modules during inference.
Results
The experiments conducted across diverse benchmarks indicate that SETA outperforms existing continual learning baselines, particularly in terms of retaining knowledge from earlier tasks and enhancing backward transfer capabilities. The results highlight the effectiveness of the proposed framework in addressing the plasticity-stability trade-off inherent in continual learning.
Implications
The findings suggest that SETA can significantly improve the deployment of LLMs in continual learning scenarios, making them more adaptable to new tasks while preserving previously acquired knowledge. This has potential applications in various NLP tasks, including text classification, reasoning, and dialogue systems.
MDP-GRPO: Stabilized Group Relative Policy Optimization for Multi-Constraint Instruction Following
Reinforcement Learning
Large Language Models
Optimization
- Identification of three failure modes in GRPO: low-variance amplification, mean-centering blindness, and zero-variance collapse.
- Introduction of multi-temperature sampling to improve reward diversity in small-batch scenarios.
- Development of dual-anchor advantages to enhance learning signals in homogeneous reward groups.
- Application of prospect-theoretic shaping to control update magnitudes and emphasize constraint violations.
Read more
MDP-GRPO: Stabilized Group Relative Policy Optimization for Multi-Constraint Instruction Following
Summary
This paper addresses the challenges of reinforcement learning (RL) in multi-constraint instruction following, particularly focusing on the instability of standard group-relative policy optimization (GRPO) under discrete, low-dispersion rewards. The authors identify three critical pathologies associated with z-score group normalization: low-variance amplification, mean-centering blindness, and zero-variance collapse. To mitigate these issues, they propose MDP-GRPO, which incorporates several innovative strategies: multi-temperature sampling to enhance reward dispersion, dual-anchor advantages to restore gradients in homogeneous groups, prospect-theoretic shaping to bound updates and penalize violations, and asymmetric KL regularization. The proposed method is evaluated on various benchmarks, including FollowBench and IFEval, demonstrating significant improvements in constraint satisfaction and stable convergence, particularly with small group sizes, while maintaining general capabilities on additional tasks. The findings suggest that MDP-GRPO effectively stabilizes learning in environments with strict compliance requirements, making it a valuable contribution to the field of RL.
Methodology
The authors propose MDP-GRPO, which includes multi-temperature sampling for reward diversity, dual-anchor advantages to address learning signal issues, and prospect-theoretic shaping for bounded updates. Asymmetric KL regularization is also employed to enhance stability during training.
Results
MDP-GRPO outperforms standard GRPO, achieving up to a 5.0% improvement in strict constraint satisfaction on the Llama-3.2-3B model. The method also enables stable convergence with small group sizes while preserving performance on general knowledge tasks.
Implications
The findings suggest that MDP-GRPO can be effectively applied in scenarios requiring strict adherence to multiple constraints, such as legal and technical document generation, enhancing the reliability of RL systems in real-world applications.
Plug-and-Play Guidance for Discrete Diffusion Models via Gradient-Informed Logit Correction
Generative Models
- GILC is a training-free framework that utilizes pretrained diffusion networks for value function estimation.
- The method introduces logits correction guidance to stabilize gradient computation in discrete spaces.
- A formal connection to policy gradients is established, allowing for handling non-differentiable objectives.
- GILC demonstrates superior performance and efficiency in constrained generation tasks across scientific domains.
Read more
Plug-and-Play Guidance for Discrete Diffusion Models via Gradient-Informed Logit Correction
Summary
This paper introduces Gradient-Informed Logit Correction (GILC), a novel plug-and-play framework designed to enhance controllable generation in discrete diffusion models without the need for retraining. The authors address the challenges posed by high computational overhead and gradient instability in high-dimensional discrete spaces. GILC repurposes a pretrained denoising network as a variational proxy to efficiently estimate guidance signals. The framework employs a Jacobian-free mechanism to correct clean prediction logits, ensuring stable and effective guidance. GILC is compatible with both differentiable and non-differentiable reward functions, making it versatile for various applications. The authors conduct extensive experiments in domains such as DNA sequence design, protein engineering, and molecular generation, demonstrating that GILC achieves state-of-the-art performance while avoiding the computational costs associated with traditional fine-tuning methods. The results indicate that GILC not only outperforms existing training-free guidance methods but also competes favorably with fine-tuning approaches, showcasing its potential for advancing discrete diffusion guidance.
Methodology
The GILC framework employs a variational method where a pretrained denoising network serves as a proxy for estimating the value function. It combines the Gumbel-Softmax trick with a Straight-Through estimator to maintain gradient flow in discrete spaces. The Jacobian-free update mechanism directly corrects clean prediction logits, facilitating stable guidance. The framework is designed to work with off-the-shelf objective functions without requiring additional training.
Results
GILC significantly outperforms existing training-free discrete diffusion guidance methods in terms of sample quality and computational efficiency. It achieves state-of-the-art results in controlled discrete generation tasks, matching the performance of fine-tuning-based approaches while eliminating the need for additional training.
Implications
The GILC framework has the potential to revolutionize the field of discrete diffusion models by providing a flexible, efficient method for controllable generation across various scientific and industrial applications, including biological sequence synthesis and molecular design.
PJ-RoPE: A Fourier-Jet-Affine Position Space for Relative Attention
NLP
Large Language Models
Audio & Speech
- PJ-RoPE unifies multiple relative-position representations into a single framework.
- The framework separates scalar and rotary implementations for better adaptability.
- Light-cone coordinates are introduced to manage stability in high-order jets.
- Empirical evaluations across various tasks validate the framework's effectiveness.
Read more
PJ-RoPE: A Fourier-Jet-Affine Position Space for Relative Attention
Summary
The paper introduces PJ-RoPE, a novel framework that integrates the Fourier phase from RoPE, finite jets from Jordan-RoPE, and affine recency from ALiBi into a unified, learnable relative-position space for attention mechanisms in Transformers. This framework allows for the exploration of how different tasks select regions within this space. PJ-RoPE is characterized by its Fourier–Jet–Affine formulation, which includes an optional Poincaré-type completion to enhance the positional representation. The author presents a mathematical foundation where the same primitives can represent various attention mechanisms, such as RoPE and Jordan-RoPE, and highlights the importance of separating scalar PJ-bias kernels from PJ-rotary feature transforms. The framework also addresses stability issues associated with high-order jets, proposing a light-cone approach to manage the growth of transformed features. The experiments conducted demonstrate the effectiveness of PJ-RoPE in recovering designed sectors, adapting to task requirements, and revealing trade-offs between stability and resolution in long-context scenarios.
Methodology
The methodology involves formulating PJ-RoPE as a Fourier–Jet–Affine relative-position space, conducting experiments to evaluate sector containment and selection, and employing adaptive diagnostics to measure task-level performance. The framework is tested through controlled kernels, synthetic sequence tasks, and real-world applications like language modeling and music token streams.
Results
The results indicate that PJ-RoPE effectively recovers designed sectors and adapts to various tasks, demonstrating strong performance in language runs and music-token streams. The framework also highlights a stability-resolution trade-off, where the use of light-cone coordinates improves stability at the cost of phase resolution.
Implications
The implications of this research suggest that PJ-RoPE can enhance the performance of Transformers in long-context scenarios, providing a more flexible and robust approach to relative attention mechanisms. This could lead to improvements in various applications, including natural language processing and music generation.
Elmes*: Automated Construction of Fine-Grained Evaluation Rubrics for Large Language Models in Long-Tail Educational Scenarios
NLP
Large Language Models
- ELMES+ automates the generation of evaluation rubrics tailored for educational scenarios.
- The framework combines a multi-agent evaluation engine with a self-evolving rubric synthesis module.
- Edu-330 benchmark includes 330 scenarios across 11 subjects and reveals multidimensional educational capabilities of LLMs.
- Top-performing models excel in creativity but may lack in specific pedagogical tasks like Socratic questioning.
Read more
Elmes*: Automated Construction of Fine-Grained Evaluation Rubrics for Large Language Models in Long-Tail Educational Scenarios
Summary
The paper presents ELMES+, an innovative framework designed to automate the construction and application of fine-grained evaluation rubrics for large language models (LLMs) in educational contexts, particularly addressing long-tail scenarios. Traditional benchmarks for evaluating LLMs often focus on general correctness and lack the granularity needed for educational effectiveness. ELMES+ integrates a multi-agent evaluation engine with a self-evolving rubric synthesis module (SCENEGEN), allowing for the dynamic generation and refinement of evaluation criteria based on expert-defined pedagogical dimensions. The authors constructed a comprehensive benchmark, Edu-330, which encompasses 330 scenarios across various subjects and grade levels, featuring over 1,000 second-level indicators. Experimental results indicate that educational capabilities of LLMs are multidimensional, revealing that top-tier models excel in creativity and values integration but may struggle with Socratic scaffolding. The education-specialized model InnoSpark achieved the highest human-evaluated scores. The study also highlights the biases present in LLM judges and demonstrates that expert-scored few-shot anchoring can significantly enhance human-LLM alignment, while reasoning constraints and decoding strategies vary in effectiveness across models. ELMES+ thus offers a scalable solution for pedagogically grounded LLM evaluation.
Methodology
The methodology involves the development of ELMES+, which includes a declarative multi-agent evaluation engine for role-based interactions and a self-evolving rubric synthesis module (SCENEGEN) that co-optimizes evaluation criteria and test data based on expert-defined pedagogical dimensions. The framework allows for iterative refinement and scaling of evaluation rubrics across diverse educational scenarios.
Results
The experiments conducted using the Edu-330 benchmark and expert-authored scenarios demonstrated that educational capabilities of LLMs are complex and multifaceted. The InnoSpark model achieved the highest average score in human evaluations, while LLM judges showed lower scoring variance but exhibited biases. The introduction of expert-scored few-shot anchoring improved alignment with human evaluations by approximately 30%.
Implications
The findings suggest that ELMES+ can significantly enhance the evaluation of LLMs in educational contexts, providing a scalable and systematic approach to assess pedagogical effectiveness. This framework could lead to improved educational technologies and personalized learning experiences by ensuring that LLMs are evaluated on their teaching capabilities rather than just factual knowledge.
Learning Explicit Behavioral Models with Adaptive Questions and World-Model Probes
Reinforcement Learning
Interpretability
- Introduces the Explicit Symbolic Behavioral Model (ESBM) to enhance understanding of agent behavior.
- Combines task performance with grounded question answering and mechanism prediction.
- Utilizes adaptive questions and world-model probes to refine the behavioral model after each training rollout.
- Demonstrates that high-scoring policies can be learned while maintaining explicit understanding of mechanisms.
Read more
Learning Explicit Behavioral Models with Adaptive Questions and World-Model Probes
Summary
The paper addresses the limitations of interactive agents that are trained solely based on task performance, which can lead to brittle behaviors due to a lack of understanding of the underlying mechanisms that drive their actions. The authors propose an Explicit Symbolic Behavioral Model (ESBM) that integrates task performance with evidence-based question answering and mechanism prediction. The ESBM uses typed predicates, weighted rules, bounded options, and a mechanism memory to represent behavior and predict outcomes under action interventions. After each training rollout, the model employs adaptive questions and active world-model probes to convert failures and errors into constraints for refining the ESBM. This approach allows for the identification of high-scoring policies while ensuring that the model can provide explicit answers and executable predictions about its mechanisms. The authors demonstrate the effectiveness of the ESBM in Atari-style environments, showing that adaptive questions can serve as both a training pressure and a benchmark for mechanistic policy learning.
Methodology
The authors developed the ESBM framework, which incorporates symbolic representations of behavior through predicates, rules, and mechanism memory. They employed adaptive questioning and active probing techniques to generate training signals based on the agent's performance and understanding errors. The model was evaluated in Atari-style environments to assess its ability to learn effective policies while maintaining interpretability.
Results
The ESBM successfully learned high-scoring policies in the tested Atari environments, while also providing explicit answers to questions about its behavior and making accurate predictions about the consequences of actions. The integration of adaptive questions and world-model probes proved effective in refining the model and enhancing its understanding of the underlying mechanisms.
Implications
This research has significant implications for the development of more interpretable and adaptable AI agents. By emphasizing the importance of understanding the mechanisms behind agent behavior, the ESBM framework can lead to the creation of agents that are better equipped to handle dynamic environments and complex tasks, ultimately improving their robustness and reliability.
Gradient Descent with Large Step Size Restores Symmetry in Deep Linear Networks with Multi-Pathway
Theory
Optimization
- Discrete Gradient Descent with large step sizes leads to pathway re-balancing rather than symmetry breaking.
- Single-path solutions correspond to sharp minima, while balanced solutions across pathways are flatter.
- The relationship between the number of pathways, depth, and sharpness of minima is theoretically derived.
- Training dynamics exhibit two phases: initial symmetry breaking followed by a re-balancing phase due to oscillations.
Read more
Gradient Descent with Large Step Size Restores Symmetry in Deep Linear Networks with Multi-Pathway
Summary
This paper investigates the training dynamics of Deep Linear Networks (DLNs) with multiple pathways, focusing on the effects of discrete Gradient Descent (GD) with large step sizes. The authors challenge the prevailing notion derived from Gradient Flow (GF) analyses, which suggest that multi-pathway DLNs undergo 'winner-takes-all' specialization leading to symmetry breaking. Instead, they demonstrate that large-step GD induces a re-balancing phase where signals are redistributed across pathways, countering the sharp minima associated with single-path solutions. The authors provide theoretical insights into how the sharpness of minima is affected by the number of pathways and the depth of the network, establishing that balancing signals across pathways reduces sharpness significantly. Their findings reveal that while initial training may exhibit symmetry breaking, oscillations at the Edge of Stability lead to a stabilization of representations across pathways, favoring shared representations over single-pathway dominance. This work emphasizes the importance of discrete optimization dynamics in shaping the final structure of neural networks and suggests a need to revisit GF-based predictions in light of these findings.
Methodology
The authors analyze the geometry of the loss landscape in deep linear networks with multiple pathways. They derive theoretical results regarding the sharpness of minima based on the number of pathways and network depth, and they explore the dynamics of training under large-step GD, particularly focusing on the Edge of Stability.
Results
The study proves that balancing signals across multiple pathways reduces sharpness by a factor dependent on the number of pathways and depth. It identifies two training phases: an initial phase of symmetry breaking followed by a re-balancing phase driven by oscillations. Additionally, an upper bound on the learning rate is established to ensure stability during training.
Implications
The findings suggest that the implicit biases of optimization algorithms like GD can significantly influence the training outcomes of neural networks, particularly in multi-pathway architectures. This has implications for designing more effective training strategies and understanding the dynamics of deep learning models.
On the Geometry of On-Policy Distillation
NLP
Large Language Models
Reinforcement Learning
- OPD occupies a relaxed off-principal regime in parameter space, showing unique update dynamics compared to SFT and RLVR.
- The phenomenon of subspace locking indicates that OPD updates converge into a stable low-dimensional channel early in training.
- Objective composition plays a critical role in maintaining the locked trajectory of OPD updates.
- Control experiments demonstrate that certain perturbations do not affect OPD's rank dynamics, while others do.
Read more
On the Geometry of On-Policy Distillation
Summary
This paper investigates the training dynamics of On-Policy Distillation (OPD) in large language models, comparing it with Supervised Fine-Tuning (SFT) and Reinforcement Learning with Verifiable Rewards (RLVR). The authors characterize OPD's parameter-space trajectory, revealing that it occupies a 'relaxed off-principal regime' where updates are less constrained than RLVR but more selective than SFT. They identify a phenomenon termed 'subspace locking,' where OPD updates rapidly converge into a low-dimensional channel during training. This channel is functionally sufficient for OPD performance, as constraining training to this subspace preserves performance while degrading SFT. The study also examines the factors influencing this trajectory, finding that while token sparsification and off-policy rollouts maintain the rank dynamics, the objective composition significantly impacts the update trajectory. Overall, the findings suggest that OPD follows a unique update geometry distinct from both SFT and RLVR, providing insights for future geometry-aware OPD algorithms.
Methodology
The authors employed a suite of parameter-space diagnostics to analyze OPD's update trajectory, including effective dimension tracking, update scale, and spectral shape diagnostics. They conducted control experiments to assess the impact of various factors on OPD's update dynamics.
Results
The analysis revealed that OPD updates are less dense and more selective than SFT, while being less constrained than RLVR. The study identified a persistent low-dimensional update channel for OPD, which is functionally sufficient for maintaining performance. Additionally, the results indicated that while certain perturbations do not disrupt OPD's rank dynamics, changes in objective composition can significantly alter the update trajectory.
Implications
The findings suggest that understanding the geometric properties of OPD can lead to improved training strategies for large language models, potentially enhancing their reasoning capabilities. This could inform the design of future algorithms that leverage OPD's unique characteristics for better performance in complex tasks.
Structure-Preserving Correction Learning for Sparse Bayesian Inference in Brain Source Imaging
Theory
Optimization
Interpretability
- Introduces a structure-preserving framework for learning hyperparameter updates in M/EEG source imaging.
- Unfolds classical Type-II Bayesian methods into a trainable neural architecture, enhancing interpretability.
- Implements progressively expressive correction mechanisms to improve reconstruction performance.
- Demonstrates significant improvements in convergence and accuracy over traditional methods.
Read more
Structure-Preserving Correction Learning for Sparse Bayesian Inference in Brain Source Imaging
Summary
This paper addresses the challenges of sparse Bayesian inference in M/EEG brain imaging, particularly the limitations of classical Type-II Bayesian methods that rely on fixed iterative update rules for estimating source and noise hyperparameters. The authors propose a novel framework that learns the update mechanism while preserving the Bayesian structure by unfolding a classical solver into a trainable neural architecture. This architecture mirrors the original iterative updates and is initialized to replicate the classical solver before training. The framework incorporates progressively expressive correction-learning mechanisms, including learnable biases, adaptive multi-layer perceptrons (MLPs), and attention-based refinements. The approach aims to enhance empirical reconstruction performance without sacrificing the interpretability of the original Bayesian inference. Experimental results demonstrate that the learned correction variants significantly improve reconstruction accuracy and convergence behavior compared to the baseline unfolded solver, while maintaining algorithmic transparency.
Methodology
The authors develop a neural architecture that unfolds a classical convex-bounding update solver for joint Type-II learning of source and noise hyperparameters. This architecture is initialized to replicate the classical solver and is enhanced through a series of learned corrections, including bias-only adaptations, residual MLPs, and attention-augmented variants. The training process optimizes these corrections while preserving the underlying Bayesian structure.
Results
The experimental results indicate that the proposed correction-learning variants outperform the baseline unfolded solver in terms of reconstruction accuracy and convergence behavior. The residual-based refinements particularly demonstrate substantial improvements while retaining the interpretability of the original Bayesian algorithm.
Implications
This work has significant implications for improving the accuracy and efficiency of brain source imaging techniques, potentially leading to better diagnostic tools in neuroscience and clinical applications. The approach also highlights the potential of integrating deep learning with classical probabilistic models in other domains.
CoMetaPNS: Continually Meta-learning Personalized Neural Surrogates for Cardiac Electrophysiology Simulations
Generative Models
Optimization
Efficient ML
- Introduces a continual meta-learning framework for personalized cardiac simulations.
- Addresses the challenges of computational cost and data shifts in clinical settings.
- Utilizes a Bayesian Gaussian Mixture Model for effective data management.
- Demonstrates superior performance in simulation accuracy and efficiency compared to existing methods.
Read more
CoMetaPNS: Continually Meta-learning Personalized Neural Surrogates for Cardiac Electrophysiology Simulations
Summary
The paper introduces CoMetaPNS, a novel framework designed to enhance the personalization of neural surrogates for cardiac electrophysiology simulations. Traditional methods struggle with the computational costs and time required for personalizing models due to the need for extensive subject-specific data and the challenges posed by data shifts in clinical settings. CoMetaPNS addresses these issues by employing a continual meta-learning approach that allows for the integration of new data without the need for retraining on previous datasets, thus mitigating catastrophic forgetting. The framework utilizes a continual Bayesian Gaussian Mixture Model to manage incoming data and identify its source, facilitating effective meta-learning. Empirical evaluations on synthetic cardiac data demonstrate that CoMetaPNS outperforms existing methods in terms of simulation accuracy, computational efficiency, and resilience to forgetting, making it a promising solution for clinical applications.
Methodology
The methodology involves a continual meta-learning framework that integrates a continual Bayesian Gaussian Mixture Model over a memory buffer. This allows the model to infer identifiers and relationships of incoming data over time, facilitating effective personalization of neural surrogates using limited subject-specific data. The approach leverages few-shot generative modeling and amortized inference to learn the process of personalizing surrogates.
Results
Empirical results indicate that CoMetaPNS significantly improves simulation forecasting accuracy and computational scalability while effectively preventing catastrophic forgetting. The framework demonstrates its ability to adapt to new subjects and data dynamics without the need for full retraining, outperforming existing baseline models.
Implications
The implications of this research are substantial for clinical applications in cardiac electrophysiology, where personalized simulations can enhance treatment planning and risk stratification. The ability to efficiently adapt to new data and subjects could lead to more timely and accurate patient-specific interventions.
Maximising the Set-Piece Return: Optimising Football Corner Tactics with Graph Reinforcement Learning
Reinforcement Learning
Graph Learning
Optimization
- Introduces a Graph Reinforcement Learning framework for optimizing football corner tactics.
- Formulates corner kick optimization as a Markov Decision Process to enable novel tactical discoveries.
- Demonstrates significant performance improvements over traditional optimization methods.
- Utilizes a predictive Graph Neural Network to inform the Expected First Contact Shot probability.
Read more
Maximising the Set-Piece Return: Optimising Football Corner Tactics with Graph Reinforcement Learning
Summary
This paper presents a novel approach to optimizing football corner kick tactics using Graph Reinforcement Learning (Graph RL). Unlike traditional methods that focus on historical data and imitation of past actions, this work aims to discover new, generalizable strategies by formulating the corner kick optimization as a Markov Decision Process (MDP). The authors propose a central policy that adjusts player positions and velocities to maximize the Expected First Contact Shot (xFCS) probability. The methodology integrates Graph Neural Networks (GNN) with deep reinforcement learning techniques, specifically Soft Actor-Critic (SAC) and Proximal Policy Optimization (PPO). The empirical evaluation, conducted on over 3,000 Premier League corner scenarios, demonstrates that the proposed method significantly outperforms baseline optimization techniques such as Random Search and Simulated Annealing, even under matched computational budgets. The results indicate that Graph RL can effectively transition set-piece analysis from historical evaluation to reward-driven tactical innovation.
Methodology
The authors formulate the corner kick optimization problem as a Markov Decision Process (MDP) and employ a Graph Reinforcement Learning approach that combines Graph Neural Network embeddings with deep reinforcement learning algorithms (SAC and PPO) to learn a general policy for adjusting player positions and velocities during corner kicks.
Results
The proposed method was evaluated on a dataset of over 3,000 Premier League corners, showing that it yields significantly higher tactical rewards compared to traditional optimization techniques, achieving increases in xFCS probabilities of up to 190% and 105% in different scenarios. The method maintained competitive performance even when traditional methods were given disproportionately larger search budgets.
Implications
This research suggests that Graph Reinforcement Learning can revolutionize tactical analysis in football by enabling coaches to explore innovative strategies that have not been previously observed, thereby enhancing the effectiveness of set-piece plays and potentially increasing goal-scoring opportunities.
HoT-SSM:Higher-order Temporal Knowledge Graph Reasoning with State Space Models for Health Care
Graph Learning
Time Series
Interpretability
- Introduces a temporal knowledge-infused hypergraph framework for modeling EHR data.
- Develops a dynamic hypergraph state space model to capture higher-order relationships and long-range temporal information.
- Demonstrates significant performance improvements over existing models on clinical prediction tasks.
- Establishes theoretical guarantees for the robustness of the learned representations.
Read more
HoT-SSM:Higher-order Temporal Knowledge Graph Reasoning with State Space Models for Health Care
Summary
The paper presents HoT-SSM, a novel framework designed to enhance the modeling of electronic health records (EHRs) through higher-order temporal knowledge graphs. Traditional medical knowledge graph (MKG) approaches struggle to capture complex interactions among clinical concepts and often overlook long-range temporal dependencies, which are crucial for accurate clinical predictions. HoT-SSM addresses these limitations by constructing hypergraphs that group semantically related clinical concepts into hyperedges, thus preserving the context of clinical visits. Additionally, it employs a dynamic hypergraph-based state space model to effectively capture the evolution of patient states over time while maintaining long-range information. The framework was empirically validated using the MIMIC-III and MIMIC-IV datasets, demonstrating significant performance improvements in clinical prediction tasks compared to existing state-of-the-art models. The results indicate that HoT-SSM's ability to jointly model higher-order interactions and temporal dynamics leads to more interpretable and accurate predictions in healthcare applications.
Methodology
HoT-SSM constructs hypergraphs from clinical visit data by grouping related clinical concepts into hyperedges, preserving the visit-level context. It utilizes a dynamic state space model to learn the evolving latent state of patients, effectively capturing both higher-order relationships and long-range temporal dynamics. The framework was tested on MIMIC-III and MIMIC-IV datasets to evaluate its performance in clinical prediction tasks.
Results
The empirical evaluation on MIMIC-III and MIMIC-IV datasets showed that HoT-SSM significantly outperformed existing state-of-the-art models in various clinical prediction tasks, demonstrating its effectiveness in capturing complex clinical interactions and temporal dependencies.
Implications
HoT-SSM has the potential to improve clinical decision-making processes by providing more accurate and interpretable predictions in healthcare. Its ability to model higher-order interactions and long-range temporal dynamics can enhance the understanding of patient health trajectories and inform treatment strategies.
MacArena: Benchmarking Computer Use Agents on an Online macOS Environment
Computer Vision
Reinforcement Learning
Robotics
- MacArena benchmarks 421 tasks across 50 applications on macOS, filling a gap in existing evaluation tools.
- The benchmark includes a mix of ported tasks, existing tasks, and new macOS-specific tasks to enhance complexity and coverage.
- Evaluation results show that current CUAs struggle more with macOS-native tasks, suggesting that macOS presents unique challenges.
- All tasks in MacArena are human-verified, ensuring high quality and reproducibility for future research.
Read more
MacArena: Benchmarking Computer Use Agents on an Online macOS Environment
Summary
The paper introduces MacArena, a comprehensive benchmark designed to evaluate computer-use agents (CUAs) on macOS, addressing the limitations of existing benchmarks like macOSWorld. MacArena consists of 421 manually verified tasks across 50 applications, combining tasks from OSWorld, macOSWorld, and 49 new macOS-native tasks. The authors argue that macOS presents unique GUI challenges that are not adequately captured by Linux-based benchmarks. Their evaluation reveals that existing models perform significantly worse on macOS-native tasks compared to ported tasks, indicating that macOS is a more challenging environment for CUAs. The paper emphasizes the need for tailored benchmarks to better assess and improve the capabilities of CUAs in diverse operating systems.
Methodology
MacArena was developed by curating tasks from existing benchmarks (OSWorld and macOSWorld) and creating new tasks specifically for macOS. The tasks were manually verified for clarity and execution feasibility. The benchmark operates within Apple's Virtualization framework on Apple Silicon, allowing for accurate performance evaluation of CUAs in a native environment.
Results
The evaluation of CUAs using MacArena revealed a consistent performance drop across all models when compared to their performance on Linux benchmarks. Notably, a leading model showed a 26% decrease in performance on macOS-native tasks, indicating that these tasks are significantly more challenging than their ported counterparts.
Implications
The introduction of MacArena has significant implications for the development of CUAs, as it provides a robust framework for evaluating and improving their performance on macOS. This can lead to advancements in the design of more capable digital assistants and enhance user experience across macOS applications.