AI-generated summaries
Today's ML research,
without the noise.
Daily summaries of the latest machine learning papers from arXiv, processed every 8 hours.
62
Papers today
8h
Update frequency
7
Days of history
Deep Learning-Based Metamodeling of Nonlinear Stochastic Dynamic Systems under Parametric and Predictive Uncertainty
Time Series
Theory
Optimization
- Introduces three metamodeling frameworks for nonlinear dynamic systems that account for both loading and parameter uncertainties.
- Utilizes advanced deep learning architectures (MLP, MPNN, AE) combined with LSTM for effective feature extraction and time-series prediction.
- Demonstrates low prediction errors across different structural models, validating the effectiveness of the proposed methods.
- Establishes a correlation between predictive variance and actual error, enhancing model reliability and confidence in predictions.
Read more
Deep Learning-Based Metamodeling of Nonlinear Stochastic Dynamic Systems under Parametric and Predictive Uncertainty
Summary
This paper addresses the challenges of modeling high-dimensional, nonlinear dynamic structural systems subjected to natural hazards, particularly focusing on uncertainties in external loads and structural parameters. The authors propose three innovative metamodeling frameworks that integrate feature extraction modules—using multi-layer perceptron (MLP), message-passing neural network (MPNN), or autoencoder (AE)—with a long short-term memory (LSTM) network. These frameworks are designed to quantify both epistemic and aleatoric uncertainties while predicting the complete time-history response across all degrees of freedom in large-scale nonlinear structural systems. The proposed architectures (MLP-LSTM, MPNN-LSTM, AE-LSTM) were validated through two case studies: a multi-degree-of-freedom Bouc–Wen system and a 37-story fiber-discretized nonlinear steel moment-resisting frame. The results demonstrated that the MLP-LSTM achieved the highest accuracy for the simpler Bouc–Wen system, while the MPNN-LSTM and AE-LSTM outperformed on the more complex steel-frame model. The study confirms a strong correlation between predictive variance and actual error, indicating the frameworks' potential for active-learning strategies and reliable model confidence assessment in structural response predictions.
Methodology
The study develops three metamodeling frameworks (MLP-LSTM, MPNN-LSTM, AE-LSTM) that integrate a feature extraction module with an LSTM network. The feature extraction modules distill information from structural configurations and stochastic excitations into compact feature vectors. The LSTM network is trained using Monte Carlo dropout and a negative log-likelihood loss function to learn temporal dynamics and quantify uncertainties. Wavelet-based approximations are employed to enhance training efficiency.
Results
The proposed metamodeling frameworks were validated on two case studies, achieving low prediction errors. The MLP-LSTM framework provided the most accurate results for the simpler Bouc–Wen system, while the MPNN-LSTM and AE-LSTM frameworks excelled in the more complex steel-frame model. A consistent correlation between predictive variance and actual error was observed, supporting the frameworks' utility in active-learning strategies.
Implications
The findings suggest that the proposed metamodeling frameworks can significantly improve the reliability and efficiency of structural response predictions in engineering applications, particularly in the context of natural hazards. This work opens avenues for further research in uncertainty quantification and active learning in structural engineering and related fields.
Breaking the Tuning Barrier: Zero-Hyperparameters Yield Multi-Corner Analysis Via Learned Priors
Optimization
Efficient ML
Theory
- Introduces a learned prior framework that eliminates the need for hyperparameter tuning in yield analysis.
- Achieves state-of-the-art accuracy with mean relative errors as low as 0.11%.
- Reduces total validation costs by over 10 times compared to traditional methods.
- Demonstrates effective cross-corner knowledge transfer through an attention mechanism.
Read more
Breaking the Tuning Barrier: Zero-Hyperparameters Yield Multi-Corner Analysis Via Learned Priors
Summary
This paper addresses the challenges of Yield Multi-Corner Analysis (YMCA) in integrated circuit design, where circuits must be validated across multiple Process-Voltage-Temperature (PVT) corners, leading to significant computational costs. Existing methods struggle with a trade-off between automation and the ability to model complex, nonlinear behaviors, often requiring extensive hyperparameter tuning. The authors propose a novel approach that utilizes learned priors from a foundation model pre-trained on millions of regression tasks, effectively breaking the 'Tuning Barrier' by enabling in-context learning without the need for hyperparameter optimization. This method leverages an attention mechanism to transfer knowledge across corners, enhancing efficiency and accuracy. The proposed framework includes an automated feature selection process that reduces dimensionality from 1152D to 48D, achieving state-of-the-art accuracy with mean relative errors as low as 0.11% while reducing validation costs by over 10 times. The results demonstrate significant improvements in yield prediction accuracy and robustness across various corners, making the method suitable for industrial applications.
Methodology
The authors replace engineered priors with learned priors from a foundation model, specifically using TabPFN, which performs in-context Bayesian inference without hyperparameter tuning. The framework includes an automated feature selection process that compresses circuit data dimensions and employs a global surrogate model to jointly predict yield across multiple corners. Active learning is utilized to focus simulations on critical decision boundaries.
Results
The proposed method achieves a mean relative error as low as 0.11% in yield predictions, significantly improving accuracy and reducing computational costs by over 10 times compared to existing methods. Ablation studies indicate over 70% error reduction on challenging corners due to the learned prior.
Implications
This approach has the potential to revolutionize yield analysis in integrated circuit design, making advanced AI methods more accessible for industrial applications by eliminating the need for extensive tuning and improving efficiency in multi-corner analysis.
A Multi-Label Temporal Convolutional Framework for Transcription Factor Binding Characterization
Time Series
- Introduces a multi-label classification framework for predicting TF binding sites.
- Utilizes Temporal Convolutional Networks (TCNs) for improved performance over traditional methods.
- Demonstrates the ability to capture correlations among multiple TFs and their cooperative mechanisms.
- Reveals biologically meaningful motifs and novel TF interactions.
Read more
A Multi-Label Temporal Convolutional Framework for Transcription Factor Binding Characterization
Summary
This paper addresses the challenge of transcription factor (TF) binding site prediction by framing it as a multi-label classification problem. Traditional methods have predominantly focused on single TFs and binary classification, neglecting the intricate interactions among multiple TFs. The authors propose a deep learning approach utilizing Temporal Convolutional Networks (TCNs) to predict binding profiles for multiple TFs simultaneously. TCNs are advantageous over recurrent neural networks (RNNs) and attention-based models due to their ability to handle long-range dependencies, parallel computation, and lower data requirements. The study leverages DNA sequences from public repositories to train the models and applies explainability techniques to uncover biologically relevant motifs and interactions among TFs. The results indicate that the multi-label approach not only enhances predictive performance but also reveals novel insights into TF cooperation and regulatory mechanisms, potentially guiding future biological investigations.
Methodology
The authors employed Temporal Convolutional Networks (TCNs) to model the multi-label classification of TF binding sites. They created datasets from raw ChIP-seq data and benchmarked their algorithms against existing datasets. The study also incorporated explainability methods to analyze the learned features and interactions among TFs.
Results
The TCN-based models achieved reliable predictions for multiple TF binding profiles, outperforming traditional binary classification approaches. The analysis revealed biologically significant motifs and co-binding patterns consistent with known TF interactions, while also suggesting new relationships among TFs.
Implications
The findings have significant implications for understanding the cooperative nature of transcriptional regulation in eukaryotic organisms. The proposed framework can refine experimental approaches in molecular biology and enhance the prediction of TF interactions, potentially leading to new discoveries in gene regulation.
A Learning-Based Superposition Operator for Non-Renewal Arrival Processes in Queueing Networks
Theory
Efficient ML
Optimization
- Introduces a scalable, data-driven superposition operator for non-renewal arrival processes.
- Utilizes deep learning to accurately reconstruct statistical descriptors of merged arrival streams.
- Demonstrates significant performance improvements over classical renewal-based methods.
- Enables decomposition-based analysis of queueing networks with merging flows.
Read more
A Learning-Based Superposition Operator for Non-Renewal Arrival Processes in Queueing Networks
Summary
This paper addresses the challenge of superposing non-renewal arrival processes in queueing networks, a task that is analytically intractable with traditional methods. The author proposes a novel data-driven superposition operator that utilizes deep learning to map low-order moments and autocorrelation descriptors of multiple arrival streams to those of their merged process. The operator is trained on synthetically generated Markovian Arrival Processes (MAPs), allowing it to learn a compact representation that accurately reconstructs the first five moments and short-range dependence structure of the aggregate stream. The results from extensive computational experiments show that this approach significantly outperforms classical renewal-based approximations, demonstrating uniformly low prediction errors across various variability and correlation regimes. Furthermore, when integrated with learning-based modules for departure-process and steady-state analysis, the proposed operator facilitates decomposition-based evaluation of feed-forward queueing networks with merging flows. This framework offers a scalable alternative to traditional analytical methods while retaining crucial higher-order variability and dependence information necessary for accurate distributional performance analysis.
Methodology
The methodology involves training a deep learning model on a diverse set of synthetically generated Markovian Arrival Processes (MAPs). The model learns to approximate the first five moments and short-range dependence of the superposed arrival processes based on the low-order moments and autocorrelation coefficients of the input streams. The training data consists of input-output pairs derived from the exact superposition of MAPs, allowing the model to generalize beyond MAP representations during inference.
Results
The proposed superposition operator achieved uniformly low prediction errors across various scenarios, outperforming classical methods that rely on low-dimensional variability summaries. The integration of this operator with learning-based modules for departure and steady-state analysis demonstrated its effectiveness in supporting accurate distributional performance analysis in queueing networks.
Implications
The findings suggest that the learning-based superposition operator can significantly enhance the analysis of queueing networks, particularly in complex systems with multiple independent traffic streams. This approach could lead to more reliable performance predictions in various applications, including manufacturing systems, communication networks, and service systems.
RetroReasoner: A Reasoning LLM for Strategic Retrosynthesis Prediction
Large Language Models
Reinforcement Learning
Generative Models
- RetroReasoner incorporates a stepwise reasoning process that aligns with chemists' strategies for retrosynthesis.
- The model is trained using a novel framework, SyntheticRetro, which generates structured reasoning text.
- RetroReasoner employs reinforcement learning with round-trip accuracy as a reward to enhance prediction feasibility.
- Experimental results indicate significant performance improvements over existing retrosynthesis prediction models.
Read more
RetroReasoner: A Reasoning LLM for Strategic Retrosynthesis Prediction
Summary
The paper introduces RetroReasoner, a novel large language model (LLM) designed for retrosynthesis prediction, which is the process of predicting reactants for a given product molecule in organic synthesis. Traditional methods rely heavily on chemists' expertise and are time-consuming, often lacking strategic reasoning in bond disconnection. RetroReasoner addresses these limitations by incorporating a structured reasoning process that mimics chemists' strategic thinking. The model is trained using a two-stage approach: supervised fine-tuning (SFT) and reinforcement learning (RL). The SFT phase utilizes a framework called SyntheticRetro to generate structured disconnection rationales alongside reactant predictions. In the RL phase, a round-trip accuracy reward is implemented, where the predicted reactants are validated against a forward synthesis model to ensure they can regenerate the original product. Experimental results demonstrate that RetroReasoner outperforms existing models, providing a wider range of feasible reactant proposals, particularly for complex reaction instances. The findings highlight the effectiveness of strategic reasoning in retrosynthesis prediction, showcasing the potential of LLMs in chemistry.
Methodology
RetroReasoner is developed through a two-stage training process: first, supervised fine-tuning (SFT) using SyntheticRetro to generate structured reasoning data, followed by reinforcement learning (RL) that optimizes the model based on round-trip accuracy rewards. This approach allows the model to learn strategic bond disconnections and validate reactant predictions effectively.
Results
The experimental evaluation shows that RetroReasoner outperforms prior baselines in terms of accuracy and the diversity of reactant proposals. It demonstrates superior performance on both in-distribution and challenging datasets, particularly those involving rare reaction types and complex molecular structures.
Implications
The advancements presented in RetroReasoner could significantly enhance the efficiency and accuracy of retrosynthesis prediction in organic chemistry, potentially reducing the time and expertise required for chemical synthesis. This model could also pave the way for further applications of LLMs in other areas of chemistry and related fields.
A Stable Neural Statistical Dependence Estimator for Autoencoder Feature Analysis
Theory
Efficient ML
Generative Models
- Introduces a stable neural dependence estimator for analyzing autoencoders.
- Avoids input concatenation and re-pairing, improving computational efficiency.
- Demonstrates that Gaussian noise assumptions enable meaningful statistical dependence measurements.
- Proposes a scalar objective based on NMF for enhanced stability.
Read more
A Stable Neural Statistical Dependence Estimator for Autoencoder Feature Analysis
Summary
This paper addresses the challenges of applying statistical dependence measures, particularly mutual information, to analyze autoencoders, which are often deterministic and noise-free. The authors propose a stable neural dependence estimator based on an orthonormal density-ratio decomposition that avoids the pitfalls of existing methods like MINE, which can be computationally expensive and unstable. By adopting a variational Gaussian framework, the authors demonstrate that statistical dependence among inputs, latent variables, and reconstructions can be effectively measured. They introduce a new scalar objective inspired by Nonnegative Matrix Factorization (NMF) to enhance stability and efficiency. The empirical results indicate that by assuming Gaussian noise, meaningful dependence measurements can be achieved, facilitating quantitative feature analysis and learning. The paper concludes that while the approach requires some compromise regarding noise assumptions, it effectively supports feature learning and analysis in autoencoder settings.
Methodology
The authors utilize an orthonormal density-ratio decomposition to estimate statistical dependence without the need for input concatenation or re-pairing of samples. They introduce a new scalar cost function inspired by Nonnegative Matrix Factorization (NMF) to improve the stability and efficiency of the estimator. The approach is validated through empirical experiments that leverage Gaussian noise assumptions to construct auxiliary variables for dependence measurement.
Results
The main experimental findings reveal that by introducing an auxiliary variable constructed from the original data with added Gaussian noise, the authors can effectively measure statistical dependence between the inputs and outputs of the autoencoder. The results indicate that good features should maintain dependence when the input is replaced, demonstrating the effectiveness of the proposed method for quantitative analysis.
Implications
The proposed estimator can significantly enhance the analysis of autoencoder features, making it easier to understand the relationships between inputs, latent representations, and outputs. This has potential applications in various domains where autoencoders are used, such as data compression, anomaly detection, and feature extraction in machine learning.
Duration Aware Scheduling for ASR Serving Under Workload Drift
Audio & Speech
Optimization
Efficient ML
- Duration-aware scheduling significantly improves end-to-end latency in ASR systems.
- Shortest Job First (SJF) can reduce median latency by up to 73%, but may cause increased tail latency.
- Highest Response Ratio Next (HRRN) balances latency reduction and tail latency control effectively.
- Both scheduling algorithms incur less than 0.1 ms overhead per request.
Read more
Duration Aware Scheduling for ASR Serving Under Workload Drift
Summary
This paper addresses the inefficiencies of first-come-first-served (FCFS) scheduling in Automatic Speech Recognition (ASR) systems, particularly under variable workloads. The authors demonstrate that audio duration serves as a reliable predictor of job processing time in ASR models like Whisper. They propose a duration-aware scheduling approach by integrating two classical algorithms: Shortest Job First (SJF) and Highest Response Ratio Next (HRRN) into the vLLM engine. The study evaluates these algorithms under realistic workloads, revealing that SJF can reduce median end-to-end latency by up to 73% at high loads, although it may increase tail latency due to starvation of longer requests. HRRN mitigates this issue by balancing latency improvements with tail latency degradation, achieving up to 28% reduction in median latency while limiting tail latency increase to 24%. The proposed methods show consistent performance improvements across different workloads without incurring significant overhead, highlighting their potential for enhancing ASR responsiveness in practical applications.
Methodology
The authors integrated SJF and HRRN scheduling algorithms into the vLLM engine, leveraging the correlation between audio duration and job processing time to estimate job lengths. They conducted evaluations using the LibriSpeech dataset and a synthetic workload to assess performance under varying conditions.
Results
The study found that SJF reduced median end-to-end latency by up to 73% and median time to first token by up to 93% at high workloads, while HRRN provided a more balanced approach with a 28% reduction in median latency and a maximum 24% increase in tail latency. Both methods maintained throughput and had minimal scheduling overhead.
Implications
The findings suggest that adopting duration-aware scheduling can significantly enhance the responsiveness of ASR systems, making them more efficient in real-time applications such as voice assistants and real-time captioning. This approach could lead to improved user satisfaction and system performance in various interactive scenarios.
Retrieval-Enhanced Real Estate Appraisal
Efficient ML
Interpretability
- Introduces a new comparable selection framework based on retrieval-enhanced machine learning (REML).
- Demonstrates that learning to select comparables yields higher-quality comparables compared to traditional methods.
- Achieves similar performance with up to 22 times fewer parameters than state-of-the-art models.
- Enhances model explainability and confidence for decision-makers by simplifying the examination of retrieved properties.
Read more
Retrieval-Enhanced Real Estate Appraisal
Summary
This paper presents a novel approach to real estate appraisal by enhancing the Sales Comparison Approach (SCA) through a retrieval-enhanced machine learning (REML) framework. The authors argue that the selection of comparable properties, which is crucial for accurate appraisals, can be significantly improved by learning a selection policy rather than relying on traditional heuristics. Their method combines a hybrid vector-geographical retrieval module with an estimation module, allowing for adaptability across various datasets. The study demonstrates that using carefully selected comparables enables the development of models that require fewer comparables and parameters while maintaining performance comparable to state-of-the-art models. Evaluations were conducted on five datasets from the United States, Brazil, and France, showcasing the effectiveness of the proposed method in improving model explainability and reducing complexity.
Methodology
The authors developed a hybrid vector-geographical retrieval module that learns to select comparables based on geographic and other features. This module is optimized jointly with an estimation module, allowing for improved adaptability and performance across different datasets.
Results
The proposed method showed that learning-based selection of comparables resulted in models that required significantly fewer comparables and parameters while achieving performance levels comparable to existing state-of-the-art models. The evaluations indicated enhanced explainability and usability for real estate appraisal tasks.
Implications
The findings suggest that integrating learned selection policies in real estate appraisal can lead to more efficient and interpretable models, potentially transforming practices in real estate valuation and decision-making processes for financial institutions and the general public.
CAETC: Causal Autoencoding and Treatment Conditioning for Counterfactual Estimation over Time
Time Series
Theory
Optimization
- CAETC addresses time-dependent confounding bias in counterfactual estimation.
- The method is model-agnostic and can be applied to various sequence architectures.
- An entropy maximization adversarial game is introduced to ensure balanced representations.
- CAETC shows significant improvements over existing counterfactual estimation methods.
Read more
CAETC: Causal Autoencoding and Treatment Conditioning for Counterfactual Estimation over Time
Summary
The paper introduces CAETC, a novel method for counterfactual estimation over time, addressing the challenges posed by time-dependent confounding bias in observational data. The authors highlight the importance of accurate counterfactual estimation in fields like personalized medicine, where randomized controlled trials are often impractical. CAETC employs an autoencoding architecture to create a partially invertible and treatment-invariant representation, allowing for effective treatment-specific conditioning during outcome prediction. This method is model-agnostic, meaning it can be integrated with various sequence models, including LSTMs and temporal convolution networks. The authors propose an entropy maximization adversarial game to ensure balanced representation across treatment regimes, which theoretically bounds the outcome estimation error. Extensive experiments on synthetic, semi-synthetic, and real-world datasets demonstrate that CAETC significantly outperforms existing counterfactual estimation methods, showcasing its potential for improving individualized decision-making in healthcare and beyond.
Methodology
CAETC utilizes a causal autoencoding framework to learn a treatment-invariant representation while applying treatment-specific conditioning for outcome prediction. The method incorporates an entropy maximization adversarial game to balance representations across treatment regimes, ensuring robustness against confounding biases.
Results
The empirical validation of CAETC on various datasets indicates substantial improvements in counterfactual estimation accuracy compared to existing baseline methods, demonstrating its effectiveness in handling time-dependent confounding.
Implications
The findings suggest that CAETC can enhance personalized treatment planning in healthcare by providing more accurate counterfactual estimations, which can lead to better decision-making processes. The method's flexibility also allows for broader applications in other fields requiring causal inference over time.
Learning Pore-scale Multiphase Flow from 4D Velocimetry
Graph Learning
Multimodal
Time Series
- Introduces a multimodal learning framework for pore-scale multiphase flow prediction.
- Combines graph network simulation with 3D U-Net architecture for enhanced accuracy.
- Achieves significant reduction in computational time for predictions, enabling real-time applications.
- Captures complex flow dynamics and interface evolution effectively, including transient phenomena.
Read more
Learning Pore-scale Multiphase Flow from 4D Velocimetry
Summary
This paper presents a novel multimodal learning framework designed to infer multiphase pore-scale flow dynamics from time-resolved four-dimensional (4D) micro-velocimetry measurements. The authors address the challenges of characterizing and predicting pore-scale dynamics in porous media, which are crucial for subsurface energy and environmental technologies such as geological CO2 storage and underground hydrogen storage. The proposed model integrates a graph network simulator for Lagrangian tracer-particle motion with a 3D U-Net for voxelized interface evolution. This architecture allows for the incorporation of imaged pore geometry as a boundary constraint, enabling iterative updates of flow velocity and multiphase interface predictions at each time step. The model is trained autoregressively on experimental sequences under capillary-dominated conditions, effectively capturing transient flow perturbations and abrupt interface rearrangements, known as Haines jumps. The framework significantly reduces the computational time for predictions from hours or days to mere seconds, facilitating rapid, experimentally informed predictions. This advancement opens avenues for 'digital experiments' that can replicate pore-scale physics observed in multiphase flow experiments, thereby providing an efficient tool for exploring various injection conditions and pore-geometry effects relevant to subsurface storage applications.
Methodology
The methodology involves a multimodal architecture that integrates a graph neural network simulator for Lagrangian tracer-particle motion and a 3D U-Net for voxelized interface evolution. The model uses time-resolved 4D micro-velocimetry data as input, incorporating pore geometry as a boundary constraint and updating flow predictions iteratively at each time step. The model is trained autoregressively on experimental data under capillary-dominated conditions.
Results
The model demonstrates high accuracy in predicting particle trajectories and multiphase interface dynamics, achieving R² values of 0.99994 and 0.99992 on test sets. It effectively reconstructs spatiotemporal flow structures and captures the dynamics of Haines jumps, outperforming traditional methods that rely solely on surface history.
Implications
The framework has significant implications for improving the efficiency and safety of subsurface energy technologies, such as geological carbon sequestration and hydrogen storage. It enables rapid exploration of various operational conditions and pore geometries, potentially leading to optimized designs and enhanced understanding of multiphase flow phenomena.
Federated Hierarchical Clustering with Automatic Selection of Optimal Cluster Numbers
Federated Learning
- Introduces Fed-k∗-HC, a federated clustering framework that automatically determines the optimal number of clusters.
- Addresses the issue of imbalanced cluster distributions in federated learning scenarios.
- Utilizes a hierarchical merging process to explore clusters of varying sizes and shapes.
- Demonstrates improved clustering performance through extensive experiments on diverse datasets.
Read more
Federated Hierarchical Clustering with Automatic Selection of Optimal Cluster Numbers
Summary
This paper presents a novel framework for Federated Clustering (FC) called Fed-k∗-HC, which addresses the challenges of imbalanced cluster distributions and the unknown number of clusters in federated learning environments. Traditional FC methods often assume uniform cluster sizes and a predetermined number of clusters, which is rarely the case in real-world scenarios. The proposed method allows clients to generate micro-subclusters that capture the local data distribution more accurately. These micro-subclusters are then uploaded to a central server, where a hierarchical merging process occurs to determine the optimal number of clusters (k∗) based on the density and relationships among the prototypes. This approach not only mitigates the 'uniform effect' but also enhances the robustness of clustering in the presence of privacy constraints inherent in federated learning. The authors demonstrate the effectiveness of Fed-k∗-HC through extensive experiments on various datasets, showing its capability to accurately identify the proper number of clusters while maintaining privacy.
Methodology
The methodology involves generating micro-subclusters on client devices, which represent local data distributions. These micro-subclusters are uploaded to a central server, where a hierarchical merging process is applied to identify the optimal number of clusters based on the density of the prototypes. The merging process is designed to self-terminate based on the relationships among the prototypes, allowing for the exploration of clusters with varying sizes and shapes.
Results
The experimental results indicate that Fed-k∗-HC outperforms existing federated clustering methods in accurately determining the number of clusters and handling imbalanced data distributions. The framework effectively captures the complexities of real-world data while adhering to privacy-preserving constraints.
Implications
The findings suggest that Fed-k∗-HC can be applied in various domains where federated learning is relevant, such as healthcare, finance, and industrial applications, enabling better data analysis while preserving user privacy. The automatic determination of cluster numbers can enhance the usability of federated clustering in practical scenarios.
Bridging Discrete Marks and Continuous Dynamics: Dual-Path Cross-Interaction for Marked Temporal Point Processes
Time Series
- NEXTPP integrates discrete event marks and continuous dynamics through a dual-channel architecture.
- The model employs self-attention for discrete encoding and Neural ODE for continuous evolution.
- A cross-attention mechanism allows for bidirectional interaction between discrete and continuous representations.
- Extensive evaluations show superior performance compared to existing models on real-world datasets.
Read more
Bridging Discrete Marks and Continuous Dynamics: Dual-Path Cross-Interaction for Marked Temporal Point Processes
Summary
The paper addresses the challenges of predicting irregularly spaced event sequences with discrete marks, which involve complex dependencies in continuous-time data streams. Existing models either focus on discrete event dependencies or continuous dynamics, failing to capture the interplay between the two. To bridge this gap, the authors propose NEXTPP, a dual-channel framework that integrates discrete and continuous representations through Event-granular Neural Evolution with Cross-Interaction for Marked Temporal Point Processes. NEXTPP employs a self-attention mechanism to encode discrete event marks and utilizes a Neural Ordinary Differential Equation (Neural ODE) to evolve a latent continuous-time state. A cross-attention module fuses these streams, enabling bidirectional interaction that informs both future timing and event mark generation. The model drives the conditional intensity function of a neural Hawkes process and uses an iterative thinning sampler for event generation. The authors demonstrate that NEXTPP consistently outperforms state-of-the-art models across five real-world datasets, showcasing its effectiveness in capturing complex temporal dependencies while maintaining interpretability.
Methodology
The methodology involves a dual-path architecture where one channel uses self-attention to encode discrete event marks, while the other employs a Neural ODE to evolve a continuous-time state. A cross-attention module fuses these two streams, allowing for mutual influence between discrete and continuous representations. The model's output drives the conditional intensity function of a neural Hawkes process, and an iterative thinning sampler is used to generate future events.
Results
The results indicate that NEXTPP consistently outperforms state-of-the-art models in terms of prediction accuracy and interpretability across five real-world datasets, demonstrating its capability to effectively model the complex interactions between discrete and continuous event dynamics.
Implications
The proposed framework has potential applications in various domains such as social network analysis, healthcare monitoring, e-commerce user behavior prediction, and seismic activity forecasting, where understanding the interplay between discrete events and their timing is crucial.
Cross-Domain Policy Optimization via Bellman Consistency and Hybrid Critics
Reinforcement Learning
Robotics
Optimization
- Introduction of cross-domain Bellman consistency to measure model transferability.
- Development of the QAvatar framework for effective knowledge transfer between domains with distinct state and action spaces.
- Establishment of convergence properties for the QAvatar algorithm.
- Demonstration of QAvatar's superior performance on various reinforcement learning benchmarks.
Read more
Cross-Domain Policy Optimization via Bellman Consistency and Hybrid Critics
Summary
This paper addresses the challenges of cross-domain reinforcement learning (CDRL), which aims to enhance data efficiency by transferring knowledge from a source domain to a target domain with potentially distinct state and action spaces. The authors identify two primary challenges: the difficulty of direct transfer due to differing state-action representations and the uncertainty regarding the transferability of source-domain models. To tackle these issues, the paper introduces the concept of cross-domain Bellman consistency to measure the transferability of source models. The proposed framework, QAvatar, integrates Q functions from both domains using a hyperparameter-free adaptive weight function. This design allows for effective knowledge transfer while ensuring convergence. Experimental results demonstrate that QAvatar outperforms existing CDRL benchmarks across various tasks, including locomotion and robot arm manipulation, showcasing its potential for improving sample efficiency in reinforcement learning.
Methodology
The authors propose a novel CDRL framework called QAvatar, which combines Q functions from both source and target domains. The framework utilizes cross-domain Bellman consistency to quantify transferability and employs a weighted combination of Q functions to update the target-domain policy. A tabular prototype of QAvatar is first developed, followed by a practical implementation that incorporates normalizing flow-based mapping for state-action correspondence learning.
Results
QAvatar achieves favorable transferability and improved sample efficiency across various reinforcement learning tasks, outperforming existing CDRL benchmark algorithms. The experiments validate the effectiveness of the proposed framework in facilitating knowledge transfer between domains with distinct state and action spaces.
Implications
The findings suggest that QAvatar can significantly enhance the efficiency of reinforcement learning in scenarios where data collection is costly or limited, such as robotics and simulation environments. This framework could be applied to various domains requiring knowledge transfer, potentially leading to advancements in robotic control and other applications where cross-domain learning is beneficial.
Probing Length Generalization in Mamba via Image Reconstruction
Computer Vision
NLP
Efficient ML
- Mamba's performance degrades on sequences longer than those encountered during training.
- The study uses image reconstruction tasks to probe Mamba's length generalization capabilities.
- A length-adaptive variant of Mamba is introduced, improving performance on varying sequence lengths.
- The research highlights the importance of understanding internal processing mechanisms in sequence models.
Read more
Probing Length Generalization in Mamba via Image Reconstruction
Summary
This paper investigates the length generalization capabilities of the Mamba sequence model, particularly in the context of image reconstruction tasks. Mamba, known for its low computational complexity and competitive performance compared to transformers, exhibits performance degradation when processing sequences longer than those seen during training. The authors conduct a controlled study using the Omniglot dataset, where Mamba reconstructs images from sequences of image patches of varying lengths. Through this analysis, they reveal that Mamba adapts its processing strategies based on the training sequence lengths, leading to poor generalization for longer sequences. To address this issue, the authors propose a length-adaptive variant of Mamba, which demonstrates improved performance across different training sequence lengths. The findings provide insights into the internal dynamics of Mamba and suggest architectural modifications to enhance length generalization.
Methodology
The authors employed a controlled vision task where Mamba reconstructs images from sequences of image patches. They analyzed the model's behavior at different stages of sequence processing and compared its performance with a transformer baseline. The study involved generating interpretable visualizations to understand Mamba's internal dynamics during the reconstruction process.
Results
The analysis revealed that Mamba qualitatively adapts its processing strategies based on training sequence lengths, resulting in limited generalization to longer sequences. The introduction of a length-adaptive variant led to significant improvements in reconstruction performance across different sequence lengths.
Implications
The findings suggest that enhancing length generalization in sequence models like Mamba could lead to broader applications in fields such as computer vision and natural language processing, where varying input lengths are common. The proposed architectural modifications may inform future designs of sequence models to improve their robustness and adaptability.
Chem4DLLM: 4D Multimodal LLMs for Chemical Dynamics Understanding
NLP
Large Language Models
Multimodal
- Introduction of Chemical Dynamics Understanding (ChemDU) to model dynamic chemical phenomena.
- Development of Chem4DBench, the first dataset linking 4D molecular trajectories with natural language explanations.
- Proposal of Chem4DLLM, a model that combines graph encoding with large language models for enhanced molecular understanding.
- Focus on generating coherent narratives that describe chemical events, improving interpretability of dynamic processes.
Read more
Chem4DLLM: 4D Multimodal LLMs for Chemical Dynamics Understanding
Summary
The paper introduces a novel approach to understanding chemical dynamics through the development of Chemical Dynamics Understanding (ChemDU), which translates 4D molecular trajectories into natural-language explanations. Traditional methods have relied on static molecular representations, limiting their ability to model dynamic phenomena such as bond breaking and conformational changes. To address this, the authors present Chem4DBench, a benchmark dataset that pairs 4D molecular trajectories with expert-authored explanations, focusing on gas-phase and catalytic reactions. The proposed Chem4DLLM model integrates an equivariant graph encoder with a pretrained large language model (LLM) to effectively capture molecular geometry and rotational dynamics. This approach aims to generate coherent narratives that describe key events in molecular trajectories, thereby enhancing the understanding of dynamic chemical processes. The authors argue that their contributions will stimulate further research in dynamic chemical understanding and multimodal scientific reasoning.
Methodology
The authors constructed Chem4DBench to benchmark 4D understanding capabilities of LLMs, focusing on gas-phase and heterogeneous catalytic reactions. They proposed Chem4DLLM, which integrates an equivariant graph encoder with a pretrained LLM to process 4D molecular trajectories and generate structured textual descriptions.
Results
The paper demonstrates that Chem4DLLM can effectively reason over molecular trajectories and generate coherent narratives about dynamic chemical events, outperforming existing static models in capturing the qualitative aspects of chemical processes.
Implications
The findings suggest that Chem4DLLM can facilitate better understanding of complex chemical dynamics, potentially aiding in scientific discovery and the development of new materials and drugs. This work opens avenues for further research in multimodal reasoning and dynamic modeling in chemistry.
MXNorm: Reusing MXFP block scales for efficient tensor normalisation
Large Language Models
Efficient ML
- MXNorm reduces the computational overhead of normalization by reusing block scales from MXFP8 quantization.
- The method achieves a 32x reduction in the size of reductions needed for normalization.
- Validation on Llama 3 models shows minimal loss of training accuracy compared to RMSNorm.
- MXNorm provides kernel speedups of up to 2.4x over RMSNorm, enhancing efficiency in large language models.
Read more
MXNorm: Reusing MXFP block scales for efficient tensor normalisation
Summary
The paper introduces MXNorm, a novel normalization method designed to improve the efficiency of tensor normalization in deep learning models, particularly in the context of low-precision matrix multiplications. MXNorm serves as a drop-in replacement for RMSNorm, leveraging block scales from the MXFP8 quantization process to estimate the root mean square (RMS) of activations. This approach significantly reduces the computational overhead associated with normalization, achieving a 32x decrease in the size of reductions needed. The authors validate MXNorm through experiments on Llama 3 models with varying parameter sizes (125M, 1B, and 8B), demonstrating that it maintains training accuracy comparable to traditional RMSNorm while providing practical kernel speedups. The findings highlight the potential of MXNorm to enhance the efficiency of large language models (LLMs) during pre-training, addressing the growing bottleneck in elementwise operations and reductions as matrix multiplication performance improves.
Methodology
The authors propose MXNorm as a fusion of RMSNorm and MX quantization, where the RMS is approximated using block scales calculated during MX quantization. This method allows for a single pass of statistics gathering over the tensor, streamlining the normalization process and reducing computational overhead.
Results
Experiments conducted on Llama 3 models of various sizes (125M, 1B, and 8B parameters) indicate that MXNorm maintains training accuracy similar to RMSNorm while achieving kernel speedups of up to 2.4x. Specifically, MXNorm results in a 1.3% speedup in 8B transformer layers using MXFP8 and a 2.6% speedup in NVFP4.
Implications
The introduction of MXNorm has significant implications for the efficiency of training large language models, particularly in scenarios where low-precision computations are utilized. By addressing the normalization bottleneck, MXNorm could facilitate faster training times and lower resource consumption in deep learning applications.
A Spectral Revisit of the Distributional Bellman Operator under the Cramér Metric
Reinforcement Learning
Theory
- The distributional Bellman operator is contractive under the Cramér metric, ensuring stability in policy evaluation.
- Existing analyses lack insight into the structural dynamics of the Bellman update on distributions.
- The authors introduce a two-level analytical framework to analyze Bellman dynamics at the CDF level and construct regularized Hilbert spaces.
- The framework preserves the intrinsic Cramér geometry while enabling operator-theoretic analysis.
Read more
A Spectral Revisit of the Distributional Bellman Operator under the Cramér Metric
Summary
This paper addresses the structural understanding of the distributional Bellman operator in distributional reinforcement learning (DRL) under the Cramér metric. The authors highlight that while the distributional Bellman operator is known to be contractive under the Cramér metric, existing analyses primarily focus on metric properties without exploring the underlying dynamics of the Bellman update on cumulative distribution functions (CDFs). The authors propose a two-level analytical framework that first examines the Bellman dynamics directly at the CDF level, establishing contraction and fixed-point properties. They then introduce a family of regularized Hilbert spaces that provide a faithful representation of the CDF-level geometry, allowing for the application of operator-theoretic and spectral analysis tools. This approach clarifies the operator structure of distributional Bellman updates and reveals the intrinsic relationship between the Cramér geometry and the dynamics of the Bellman operator, ultimately contributing to a deeper understanding of distributional reinforcement learning.
Methodology
The authors analyze distributional Bellman dynamics directly in the CDF domain, establishing contraction and fixed-point properties. They then construct a family of regularized Hilbert spaces that provide an exact conjugate realization of the CDF-level geometry, facilitating a stable analytical framework for operator-theoretic analysis.
Results
The paper establishes that the Bellman update acts affinely on CDFs and linearly on their differences, leading to a uniform bound on this linear action. The construction of regularized Hilbert spaces allows for a faithful representation of the CDF-level geometry, preserving the contraction properties of the Bellman operator.
Implications
This work provides a clearer understanding of the operator structure underlying distributional Bellman updates, which can inform the development of more robust and theoretically grounded distributional reinforcement learning methods. It opens avenues for further functional and operator-theoretic analyses in DRL.
H2LooP Spark Preview: Continual Pretraining of Large Language Models for Low-Level Embedded Systems Code
Large Language Models
Generative Models
- H2LooP Spark Preview adapts a 7-billion parameter model for low-level embedded systems programming.
- A large-scale training corpus was created from repository-datasheet pairs, enabling effective domain adaptation.
- The model achieved a 70.4% reduction in in-domain perplexity and surpassed larger models in generative benchmarks.
- Extensive hyperparameter tuning established optimal configurations for continual pretraining.
Read more
H2LooP Spark Preview: Continual Pretraining of Large Language Models for Low-Level Embedded Systems Code
Summary
The paper presents H2LooP Spark Preview, a continual pretraining (CPT) pipeline designed to adapt the OLMo-3-7B language model for low-level embedded systems programming. Recognizing the limitations of large language models (LLMs) in specialized domains, the authors construct a training corpus from 818 repository-datasheet pairs, totaling 76.4 GB of data across 117 manufacturers. The methodology employs BF16 LoRA with Rank-Stabilized scaling on NVIDIA H100 GPUs, resulting in a curated corpus of approximately 23.5 billion tokens. Through extensive hyperparameter exploration involving over 1,400 runs, the study identifies optimal configurations for domain adaptation. The model achieves significant reductions in perplexity and outperforms larger models like Claude Opus 4.6 and Qwen3-Coder-30B in generative code completion tasks across 13 embedded domains. The authors also release a production checkpoint for community research, emphasizing the potential of targeted continual pretraining for enhancing LLM capabilities in specialized technical tasks.
Methodology
The authors utilized a continual pretraining approach with a large-scale corpus constructed from repository-datasheet pairs. They employed BF16 LoRA with Rank-Stabilized scaling on NVIDIA H100 GPUs and conducted systematic hyperparameter exploration across multiple projects, including Bayesian optimization and grid searches to identify optimal training configurations.
Results
The H2LooP Spark Preview model achieved a 70.4% reduction in in-domain perplexity and a 66.1% reduction on held-out repositories. It outperformed larger models in generative code completion tasks, achieving the highest token accuracy in 8 out of 13 embedded categories.
Implications
The findings suggest that continual pretraining can significantly enhance the performance of LLMs in specialized domains like embedded systems, potentially leading to more effective automated code generation and understanding in hardware-software co-design contexts.
PhysMoDPO: Physically-Plausible Humanoid Motion with Preference Optimization
Generative Models
Optimization
Robotics
- PhysMoDPO integrates a Whole-Body Controller into the training of diffusion models for humanoid motion generation.
- The framework uses physics-based and task-specific rewards to ensure generated motions are both realistic and condition-faithful.
- Extensive experiments show consistent improvements in physical realism and task metrics in simulation.
- PhysMoDPO enables zero-shot motion transfer to real robots, demonstrating its practical applicability.
Read more
PhysMoDPO: Physically-Plausible Humanoid Motion with Preference Optimization
Summary
The paper introduces PhysMoDPO, a novel framework for generating physically plausible humanoid motion using a Direct Preference Optimization approach. Building on recent advancements in text-conditioned human motion generation via diffusion models, the authors address the challenge of ensuring that generated motions remain compliant with physics when executed by a Whole-Body Controller (WBC). Traditional methods often rely on hand-crafted heuristics that can lead to significant deviations from the original motion. In contrast, PhysMoDPO integrates WBC into the training pipeline, optimizing the diffusion model to produce outputs that are both physically realistic and faithful to the provided text instructions. The framework employs physics-based and task-specific rewards to guide the optimization process. Extensive experiments demonstrate that PhysMoDPO consistently enhances physical realism and task-related performance in simulated environments, and it successfully achieves zero-shot motion transfer to a real-world G1 humanoid robot without additional refinement. This work highlights the potential of combining generative models with physics-guided optimization for robotics applications.
Methodology
The authors propose a Direct Preference Optimization framework that incorporates a Whole-Body Controller (WBC) into the training pipeline of diffusion models. They generate multiple candidate motions based on text conditions, execute them through the WBC to obtain simulated motions, and compute physics-based rewards for trackability and contact realism, alongside task-specific rewards for condition fidelity. This optimization process ensures that the generated motions are both physically feasible and aligned with the input conditions.
Results
The experiments conducted show that PhysMoDPO significantly improves physical realism and task-related metrics in simulated environments. Additionally, the framework demonstrates effective zero-shot transfer capabilities to the Unitree G1 humanoid robot, indicating that the generated motions can be deployed in real-world scenarios without further refinement.
Implications
The findings suggest that integrating physics-guided optimization into generative models can enhance the realism and applicability of humanoid motion generation in robotics. This approach could lead to more reliable and scalable solutions for motion tracking and policy training in various robotic applications.
Beyond Barren Plateaus: A Scalable Quantum Convolutional Architecture for High-Fidelity Image Classification
Computer Vision
Theory
Efficient ML
- Introduction of a novel QCNN architecture that mitigates barren plateaus.
- Achieved a classification accuracy of 98.7% on the MNIST dataset.
- Demonstrated a significant reduction in required trainable parameters compared to classical CNNs.
- Utilized localized cost functions and tensor-network initialization to enhance trainability.
Read more
Beyond Barren Plateaus: A Scalable Quantum Convolutional Architecture for High-Fidelity Image Classification
Summary
This paper addresses the challenges faced by Quantum Convolutional Neural Networks (QCNNs) in practical implementations, particularly the issue of barren plateaus, which leads to the exponential vanishing of gradients and poor classification accuracy. The author proposes a novel QCNN architecture that incorporates localized cost functions and a tensor-network-based initialization strategy to effectively mitigate these barren plateaus. The proposed architecture is evaluated on the MNIST dataset, achieving a classification accuracy of 98.7%, a significant improvement over the baseline QCNN accuracy of 52.32%. Additionally, the architecture demonstrates a parameter-efficiency advantage, requiring O(log N) fewer trainable parameters than classical CNNs to achieve over 95% convergence. This work bridges the gap between theoretical quantum advantages and practical applications, providing a scalable framework for quantum computer vision tasks.
Methodology
The proposed QCNN architecture employs localized cost functions instead of global observables and utilizes a Tensor Network Initialization (TNI) protocol to initialize quantum parameters effectively. This approach aims to avoid the barren plateau phenomenon by ensuring that the optimization landscape remains trainable.
Results
The optimized QCNN achieved a classification accuracy of 98.7% on the MNIST dataset, significantly outperforming the baseline QCNN accuracy of 52.32%. The architecture also showed a parameter-efficiency advantage, requiring O(log N) fewer trainable parameters than classical CNNs to achieve over 95% convergence.
Implications
This research has significant implications for the field of quantum machine learning, particularly in enhancing the practical applicability of QCNNs for image classification tasks. It paves the way for more efficient quantum algorithms that can leverage quantum computing's potential in real-world applications.
NeuroLoRA: Context-Aware Neuromodulation for Parameter-Efficient Multi-Task Adaptation
NLP
Large Language Models
Efficient ML
- NeuroLoRA introduces a context-aware neuromodulation mechanism to enhance expert selection in multi-task adaptation.
- The framework retains the computational efficiency of frozen random projections while allowing for dynamic adjustments based on input context.
- A Contrastive Orthogonality Loss is proposed to improve task decoupling and mitigate catastrophic forgetting in continual learning.
- Extensive experiments show that NeuroLoRA consistently outperforms existing methods like FlyLoRA in various adaptation scenarios.
Read more
NeuroLoRA: Context-Aware Neuromodulation for Parameter-Efficient Multi-Task Adaptation
Summary
The paper introduces NeuroLoRA, a novel framework for parameter-efficient fine-tuning of large language models (LLMs) that addresses the limitations of existing methods like FlyLoRA. While FlyLoRA employs a static routing mechanism based on input magnitude, NeuroLoRA incorporates a context-aware neuromodulation gate that dynamically adjusts the projection space based on the input context. This approach enhances expert selection by allowing for context-sensitive activation of model parameters. Additionally, the authors propose a Contrastive Orthogonality Loss to ensure separation between expert subspaces, which aids in task decoupling and supports continual learning. The experimental results demonstrate that NeuroLoRA outperforms FlyLoRA and other strong baselines across various benchmarks, including MMLU, GSM8K, and ScienceQA, in single-task adaptation, multi-task merging, and continual learning scenarios, while maintaining parameter efficiency.
Methodology
NeuroLoRA employs a Mixture-of-Experts architecture combined with a context-aware neuromodulation gate that dynamically rescales the projection space based on input context. It also introduces a Contrastive Orthogonality Loss to enforce separation between expert subspaces, enhancing both task decoupling and continual learning capabilities.
Results
The experiments conducted on MMLU, GSM8K, and ScienceQA benchmarks indicate that NeuroLoRA outperforms FlyLoRA and other strong baselines in single-task adaptation, multi-task model merging, and sequential continual learning, while maintaining comparable parameter efficiency.
Implications
NeuroLoRA's context-aware neuromodulation can lead to more effective multi-task learning and continual learning in LLMs, potentially improving their adaptability and performance across diverse applications in natural language processing.
LightMoE: Reducing Mixture-of-Experts Redundancy through Expert Replacing
Large Language Models
Efficient ML
- Introduces a novel expert replacing paradigm for MoE models.
- Achieves significant memory efficiency without sacrificing performance.
- Demonstrates superior performance compared to existing compression methods.
- Utilizes adaptive expert selection and hierarchical expert construction.
Read more
LightMoE: Reducing Mixture-of-Experts Redundancy through Expert Replacing
Summary
This paper introduces LightMoE, a novel framework aimed at reducing the redundancy in Mixture-of-Experts (MoE) models through a technique called expert replacing. Traditional MoE architectures, while efficient, face challenges related to high memory demands due to the need to load multiple expert modules. Existing compression methods, such as pruning and merging, often lead to irreversible knowledge loss or high training overhead. The authors propose expert replacing as a solution, which involves substituting less critical experts with parameter-efficient modules and recovering their capabilities with minimal training costs. LightMoE enhances this approach by implementing adaptive expert selection, hierarchical expert construction, and an annealed recovery strategy. Experimental results demonstrate that LightMoE achieves performance comparable to LoRA fine-tuning at a 30% compression ratio and outperforms existing methods at a 50% compression rate, showing an average performance improvement of 5.6% across five diverse tasks. This indicates that LightMoE effectively balances memory efficiency, training efficiency, and model performance.
Methodology
The LightMoE framework consists of three main stages: (1) adaptive thresholding for selecting less important experts, (2) hierarchical construction of shared bases with task-specific low-rank adaptation parameters, and (3) an annealed recovery strategy that gradually integrates original experts into the compressed structure during fine-tuning.
Results
LightMoE matches the performance of LoRA fine-tuning at a 30% compression ratio and outperforms existing methods by an average of 5.6% at a 50% compression rate across five diverse tasks, demonstrating its effectiveness in maintaining model performance while reducing memory usage.
Implications
The findings suggest that LightMoE can facilitate the deployment of large language models in resource-constrained environments, making them more accessible for real-world applications. The expert replacing approach could also inspire further research into efficient model compression techniques.
Adaptive Conditional Forest Sampling for Spectral Risk Optimisation under Decision-Dependent Uncertainty
Optimization
- Introduces a four-phase framework for spectral risk optimization under decision-dependent uncertainty.
- Utilizes Generalised Random Forests for adaptive conditional sampling to address distribution shifts.
- Implements a two-stage oracle reranking mechanism to enhance solution quality.
- Demonstrates superior performance in reducing variance and improving reliability compared to existing methods.
Read more
Adaptive Conditional Forest Sampling for Spectral Risk Optimisation under Decision-Dependent Uncertainty
Summary
This paper addresses the challenge of minimizing spectral risk, defined as a combination of expected cost and Conditional Value-at-Risk (CVaR), in scenarios where the uncertainty distribution is dependent on decision variables. The author proposes a novel framework called Adaptive Conditional Forest Sampling (ACFS), which consists of four phases: (1) using Generalised Random Forests (GRF) for approximating decision-conditional distributions, (2) employing a CEM-guided global exploration strategy, (3) implementing rank-weighted focused augmentation, and (4) conducting a two-stage oracle reranking process to refine the results. The framework is evaluated on two distinct data-generating processes—a decision-dependent Student-t copula and a Gaussian copula with log-normal marginals—across various configurations. The results demonstrate that ACFS consistently achieves lower median oracle spectral risk compared to competitors, particularly excelling in reducing cross-replication dispersion, indicating improved reliability in optimization outcomes. The paper also includes ablation and sensitivity analyses to validate the robustness of the proposed method.
Methodology
The methodology involves a four-phase simulation-optimization framework that integrates Generalised Random Forests for conditional distribution approximation, a CEM-guided exploration strategy, rank-weighted focused augmentation, and a two-stage reranking process to refine the optimization results. The use of antithetic variates and control variates is also employed to minimize variance in the estimation process.
Results
ACFS outperforms existing methods, achieving the lowest median oracle spectral risk in all configurations tested on the Gaussian copula benchmark, with median gaps over GP-BO ranging from 6.0% to 20.0%. On the Student-t copula benchmark, ACFS and GP-BO are statistically indistinguishable in median objective, but ACFS significantly reduces cross-replication dispersion by approximately 1.8 to 1.9 times, indicating enhanced reliability.
Implications
The proposed ACFS framework has significant implications for decision-making under uncertainty in various fields, including finance, supply chain management, and resource allocation, where controlling exposure to extreme risks is critical. The methodology can be adapted for other optimization problems involving decision-dependent uncertainties.
Differentiable Thermodynamic Phase-Equilibria for Machine Learning
Optimization
Theory
- Introduction of DISCOMAX, a differentiable algorithm for phase-equilibrium calculations.
- Ensures thermodynamic consistency during training and inference.
- Outperforms existing surrogate-based methods for binary liquid-liquid equilibrium data.
- Provides a general framework for learning from different types of equilibrium data.
Read more
Differentiable Thermodynamic Phase-Equilibria for Machine Learning
Summary
The paper addresses the challenge of accurately predicting phase equilibria in chemical engineering using a novel differentiable algorithm called DISCOMAX. This method integrates thermodynamic principles into machine learning frameworks, specifically for modeling liquid-liquid equilibria, which are traditionally difficult to handle due to their reliance on extremum principles. DISCOMAX ensures thermodynamic consistency during both training and inference by employing a discrete enumeration of feasible states and a masked softmax aggregation. The authors demonstrate that this approach outperforms existing surrogate-based methods when evaluated on binary liquid-liquid equilibrium data. The paper highlights the potential of DISCOMAX as a general framework for learning from various types of equilibrium data, thus advancing the integration of machine learning with thermodynamic modeling.
Methodology
The authors developed DISCOMAX, which utilizes a discrete enumeration of feasible states and a masked softmax aggregation to ensure thermodynamic consistency. The method incorporates a straight-through gradient estimator to facilitate end-to-end learning of neural excess Gibbs energy models, allowing for the differentiation of equilibrium states in a bilevel optimization framework.
Results
The evaluation of DISCOMAX on binary liquid-liquid equilibrium data showed that it significantly outperformed existing surrogate-based methods, demonstrating its effectiveness in accurately predicting phase equilibria while maintaining thermodynamic consistency.
Implications
The findings suggest that DISCOMAX can be applied in various engineering and scientific fields where accurate phase equilibrium predictions are crucial, such as in the design of chemical processes, pharmaceutical formulations, and material science. This approach could lead to more efficient and reliable modeling of complex thermodynamic systems.
GeoChemAD: Benchmarking Unsupervised Geochemical Anomaly Detection for Mineral Exploration
Theory
- Introduction of GeoChemAD, a comprehensive benchmark dataset for unsupervised geochemical anomaly detection.
- Benchmarking of various unsupervised anomaly detection methods, establishing the first unified performance comparison.
- Development of GeoChemFormer, a transformer-based framework that enhances anomaly detection through self-supervised learning.
- Demonstration of superior performance of GeoChemFormer across diverse geochemical datasets.
Read more
GeoChemAD: Benchmarking Unsupervised Geochemical Anomaly Detection for Mineral Exploration
Summary
The paper presents GeoChemAD, an open-source benchmark dataset for unsupervised geochemical anomaly detection (GAD) in mineral exploration, addressing significant limitations in existing studies, such as reliance on proprietary datasets and single-region evaluations. The dataset is compiled from government-led geological surveys and includes eight subsets that represent diverse spatial scales, sampling sources, and target elements. The authors benchmark various unsupervised anomaly detection methods, including statistical models, generative models, and transformer-based approaches, establishing strong baselines for comparison. A novel framework, GeoChemFormer, is introduced, which utilizes self-supervised pretraining to learn target-element-aware geochemical representations. Experimental results demonstrate that GeoChemFormer outperforms existing methods in both anomaly detection accuracy and generalization capability across all subsets, providing a robust foundation for future research in GAD. The dataset and code are publicly available, promoting reproducibility and further exploration in the field.
Methodology
The authors compiled the GeoChemAD dataset from government geological surveys, covering multiple regions and sampling conditions. They reproduced and benchmarked various unsupervised anomaly detection methods, including statistical models, autoencoders, variational autoencoders, and transformer-based models. The GeoChemFormer framework was developed to leverage self-supervised pretraining for learning geochemical representations tailored to specific target elements.
Results
GeoChemFormer consistently achieved superior performance in anomaly detection accuracy and generalization across all eight subsets of the GeoChemAD dataset, outperforming existing unsupervised methods. The comprehensive evaluations provided insights into the strengths and limitations of different approaches under varied real-world scenarios.
Implications
The introduction of GeoChemAD and the GeoChemFormer framework has significant implications for mineral exploration, enabling more effective identification of geochemical anomalies and supporting better decision-making in resource exploration. The open-source nature of the dataset and methods fosters collaboration and innovation in the field.
Exact Federated Continual Unlearning for Ridge Heads on Frozen Foundation Models
Federated Learning
Theory
Efficient ML
- Introduces exact federated continual unlearning for ridge heads on frozen foundation models.
- Develops a communication protocol that allows for efficient handling of add/delete requests.
- Proves deterministic exactness and invariance properties of the proposed methods.
- Demonstrates experimental validation matching centralized retraining with minimal error.
Read more
Exact Federated Continual Unlearning for Ridge Heads on Frozen Foundation Models
Summary
This paper addresses the challenge of exact federated continual unlearning in the context of frozen foundation models with ridge regression heads. The authors highlight the necessity for models to comply with the 'right to be forgotten' by effectively removing the influence of specific training samples or users. Existing federated unlearning methods often rely on approximate techniques, which can be costly and imprecise. The authors propose a novel approach that leverages the analytic structure of ridge regression, allowing for exact updates based on two sufficient statistics: the feature Gram matrix and the feature-label moment. They develop a communication protocol that supports a continuous stream of add and delete requests without the need for retraining, ensuring that the server's model remains identical to one obtained through centralized retraining. The paper presents two server-side implementations for maintaining these statistics and demonstrates their effectiveness through experiments on multiple benchmarks, achieving high precision and efficiency compared to existing methods.
Methodology
The authors formalize the problem of federated continual unlearning for ridge heads and derive a protocol where clients send fixed-size sufficient-statistic messages for each add/delete event. The server updates global statistics and maintains an exact model equivalent to centralized retraining. Two server-side implementations are proposed: a numerically robust exact solver and an incremental inverse tracker using Sherman–Morrison–Woodbury updates.
Results
Experiments on benchmarks such as CIFAR-10 and Sentiment140 show that both proposed variants achieve a relative Frobenius error of 10^-9 compared to centralized ridge retraining. The methods efficiently handle a stream of deletion requests at a significantly lower computational cost than existing federated unlearning baselines.
Implications
The findings provide a robust framework for implementing federated learning systems that comply with data privacy regulations, particularly in sensitive domains like healthcare and finance. The exact unlearning capability enhances user trust and system compliance with legal requirements.
Comparison of Outlier Detection Algorithms on String Data
Theory
- Introduces a modified local outlier factor algorithm for string data using Levenshtein distance.
- Presents a new outlier detection algorithm based on hierarchical left regular expression learning.
- Demonstrates the effectiveness of both algorithms in identifying outliers in string datasets.
- Highlights the conditions under which each algorithm performs optimally.
Read more
Comparison of Outlier Detection Algorithms on String Data
Summary
This thesis addresses the under-researched area of outlier detection in string data, contrasting it with the extensive literature on numerical data outlier detection. The author proposes and evaluates two distinct algorithms tailored for string data. The first is a modified version of the local outlier factor (LOF) algorithm, which utilizes the Levenshtein distance to assess dataset density, incorporating a weighted Levenshtein measure that accounts for hierarchical character classes to enhance algorithm tuning for specific datasets. The second algorithm is a novel approach based on hierarchical left regular expression learning, which generates a regular expression to represent expected data patterns. Through experimental evaluations on various datasets, the study demonstrates that both algorithms effectively identify outliers in string data. The regular expression-based method excels when the expected data exhibits a clear structure that significantly diverges from that of the outliers, while the LOF variant is more effective when the edit distances between outliers and expected data are distinctly different. This work contributes to the field by providing new methodologies for outlier detection in string data, which is crucial for applications such as data cleaning and anomaly detection in system logs.
Methodology
The study employs two main methodologies: a variant of the local outlier factor algorithm adapted for string data using the Levenshtein distance, and a new algorithm based on hierarchical left regular expression learning. The performance of both algorithms is evaluated through experiments on various datasets with and without outliers, analyzing their ability to detect anomalies based on different structural characteristics of the data.
Results
The experimental results indicate that both algorithms can successfully identify outliers in string data. The regular expression-based algorithm outperforms the LOF variant when the expected data has a distinct structure, while the LOF variant is more effective when the edit distances between outliers and expected data are sufficiently different.
Implications
The findings suggest that robust outlier detection algorithms for string data can significantly enhance data cleaning processes and anomaly detection in various applications, including system log analysis and automated data quality assurance.
Grammar of the Wave: Towards Explainable Multivariate Time Series Event Detection via Neuro-Symbolic VLM Agents
Time Series
Interpretability
- Introduction of Knowledge-Guided TSED (K-TSED) for event detection using natural language descriptions.
- Development of the Event Logic Tree (ELT) framework to represent temporal-logic structures of events.
- Creation of a neuro-symbolic VLM agent system (SELA) for zero-shot event detection.
- Validation through a benchmark demonstrating superior performance compared to existing methods.
Read more
Grammar of the Wave: Towards Explainable Multivariate Time Series Event Detection via Neuro-Symbolic VLM Agents
Summary
This paper addresses the challenge of Time Series Event Detection (TSED), which is critical in various high-stakes domains but complicated by the need for semantic understanding of events that often lack sufficient labeled data. The authors propose a novel approach termed Knowledge-Guided TSED (K-TSED), which allows models to detect events based on natural language descriptions rather than relying solely on extensive training data. Central to this approach is the Event Logic Tree (ELT), a knowledge representation framework that captures the temporal-logic structures of events, facilitating the mapping of linguistic descriptions to time series data. The authors introduce a neuro-symbolic VLM agent framework, SELA, which consists of Logic Analyst agents that construct ELT schemas from event descriptions and Signal Inspector agents that identify relevant time series intervals guided by these schemas. The effectiveness of the proposed method is validated through a newly created benchmark based on real-world time series data, demonstrating significant improvements over traditional supervised methods and existing zero-shot reasoning frameworks. The results indicate that the ELT framework is crucial in reducing the hallucination issues commonly faced by VLMs, thus enhancing both accuracy and explainability in event detection.
Methodology
The authors propose a two-part neuro-symbolic VLM agent system, SELA, which includes Logic Analyst agents for constructing ELT schemas from natural language descriptions and Signal Inspector agents for locating and refining time series intervals based on these schemas. The ELT framework serves as an intermediary representation that guides the agents' reasoning and provides structured explanations.
Results
The proposed method outperformed traditional supervised fine-tuning approaches and existing zero-shot reasoning frameworks, achieving near-human-level performance in low-resource settings. An ablation study confirmed the critical role of the ELT in reducing hallucination problems in VLMs.
Implications
This work has significant implications for fields requiring reliable event detection in time series data, such as healthcare and energy management, by providing a more explainable and data-efficient approach to understanding complex events.
FastDSAC: Unlocking the Potential of Maximum Entropy RL in High-Dimensional Humanoid Control
Reinforcement Learning
Robotics
- FastDSAC effectively scales Maximum Entropy RL for high-dimensional humanoid control.
- Dimension-wise Entropy Modulation (DEM) improves exploration efficiency by redistributing the exploration budget.
- A continuous distributional critic enhances value fidelity and reduces overestimation errors.
- FastDSAC achieves performance gains of 180% and 400% over deterministic baselines on challenging tasks.
Read more
FastDSAC: Unlocking the Potential of Maximum Entropy RL in High-Dimensional Humanoid Control
Summary
The paper presents FastDSAC, a novel framework designed to enhance Maximum Entropy Reinforcement Learning (RL) for high-dimensional humanoid control tasks. Traditional approaches have struggled with the 'curse of dimensionality', leading to exploration inefficiencies and training instabilities. FastDSAC addresses these challenges by introducing two key innovations: Dimension-wise Entropy Modulation (DEM) and a continuous distributional critic. DEM allows for dynamic redistribution of the exploration budget across action dimensions, effectively reducing noise from irrelevant dimensions. The continuous distributional critic, parameterized by a Gaussian distribution, mitigates value overestimation issues that arise in high-dimensional settings. Empirical evaluations on various benchmarks, including Humanoid-Bench and MuJoCo Playground, demonstrate that FastDSAC outperforms existing state-of-the-art methods, achieving significant performance gains on complex tasks such as Basketball and Balance Hard. The findings suggest that well-designed stochastic policies can lead to superior outcomes in high-dimensional robotics, challenging the dominance of deterministic methods in this field.
Methodology
FastDSAC integrates Dimension-wise Entropy Modulation (DEM) to dynamically allocate exploration resources across action dimensions, minimizing noise from irrelevant dimensions. Additionally, it employs a continuous distributional critic to provide a robust estimation of the return distribution, addressing value overestimation issues common in high-dimensional action spaces.
Results
FastDSAC consistently matches or surpasses state-of-the-art baselines across multiple high-dimensional control tasks, achieving notable performance improvements of approximately 180% on the Basketball task and 400% on the Balance Hard task compared to the FastTD3 baseline.
Implications
The advancements presented in FastDSAC could lead to more efficient training and improved performance in robotic control applications, particularly in environments with high-dimensional action spaces. This framework may also inspire further research into stochastic policy methods in reinforcement learning.
High-resolution weather-guided surrogate modeling for data-efficient cross-location building energy prediction
Optimization
Time Series
Efficient ML
- Introduces a high-resolution weather-informed surrogate modeling approach for building energy prediction.
- Achieves effective cross-location generalization with minimal simulation effort.
- Maintains high predictive accuracy when trained on a single location and applied to others within the same climate zone.
- Utilizes weekly weather data to capture short-term energy demand patterns, improving model transferability.
Read more
High-resolution weather-guided surrogate modeling for data-efficient cross-location building energy prediction
Summary
This paper addresses the challenges of building energy prediction across different locations by introducing a high-resolution weather-informed surrogate modeling approach. Traditional physics-based simulation tools, while accurate, are computationally expensive and slow, making them impractical for iterative design optimization. Existing surrogate models are often location-specific and require extensive simulations from multiple sites to generalize effectively. The proposed method captures recurring short-term weather-driven energy demand patterns common to various regions, allowing for a generalized surrogate model that maintains high predictive accuracy even when trained on a single location. The study demonstrates that the model performs well within the same climate zone and exhibits minimal performance degradation across different climate zones. By utilizing weekly weather data instead of annual aggregates, the approach enhances the model's ability to learn fine-grained weather-energy relationships, thereby improving scalability and reusability in building design optimization workflows.
Methodology
The study employs a high-resolution weather-informed surrogate modeling approach that captures short-term weather-energy demand patterns. It analyzes the impact of intra and inter climate zone representation in training data and evaluates three time-series learning strategies: Temporal Convolutional Networks (TCNs), Transformer-based encoders, and convolutional autoencoders to determine the best techniques for scalable and transferable surrogate modeling.
Results
Experimental results indicate that the proposed surrogate model retains high predictive accuracy for buildings in the same climate zone when trained on a single location. The model shows only minimal degradation in performance when applied across different climate zones, demonstrating its potential for climate-informed generalization.
Implications
The findings suggest that the proposed approach can significantly enhance the efficiency and scalability of building design optimization workflows, promoting more sustainable building practices by reducing reliance on computationally intensive simulations.
TERMINATOR: Learning Optimal Exit Points for Early Stopping in Chain-of-Thought Reasoning
Large Language Models
Efficient ML
Optimization
- Introduction of hindsight-optimal reasoning length (HORL) for determining optimal exit points in CoT reasoning.
- Development of TERMINATOR, an inference-time early-exit algorithm that significantly reduces CoT lengths.
- Creation of a novel dataset for training the early-exit strategy based on the first logical arrival of final answers.
- Demonstrated substantial reductions in reasoning lengths (14%-55%) across multiple datasets.
Read more
TERMINATOR: Learning Optimal Exit Points for Early Stopping in Chain-of-Thought Reasoning
Summary
The paper introduces TERMINATOR, an innovative early-exit strategy for Large Reasoning Models (LRMs) that aims to reduce unnecessary computation during Chain-of-Thought (CoT) reasoning. LRMs excel in complex reasoning tasks by generating intermediate tokens, but they often engage in excessive reasoning, leading to inefficiencies. The authors propose the concept of hindsight-optimal reasoning length (HORL), which identifies the earliest point in the reasoning process where the final answer can be confidently produced. By leveraging this concept, they create a dataset of optimal reasoning lengths and develop a binary probe classifier that predicts when to exit the reasoning process. TERMINATOR demonstrates significant reductions in CoT lengths, achieving an average decrease of 14%–55% across four challenging datasets (MATH-500, AIME 2025, HumanEval, and GPQA) while outperforming existing state-of-the-art methods. This work highlights the potential for optimizing reasoning processes in LRMs, making them more efficient without sacrificing performance.
Methodology
The authors designed TERMINATOR by first defining the concept of hindsight-optimal reasoning length (HORL), which identifies the earliest point in the reasoning process where the final answer can be produced. They constructed a dataset of optimal reasoning lengths by analyzing the first logical arrival of answers in CoT outputs. A binary probe classifier was then trained to predict whether to exit the reasoning process at each token, allowing for early termination of reasoning when confidence in the answer is high.
Results
TERMINATOR achieved an average reduction in CoT lengths of 14%–55% across four datasets (MATH-500, AIME 2025, HumanEval, and GPQA), while also outperforming current state-of-the-art methods in terms of efficiency and maintaining accuracy.
Implications
The findings suggest that optimizing reasoning lengths in LRMs can lead to significant computational savings, making these models more efficient for practical applications. This could enhance the deployment of LRMs in real-time systems where computational resources are limited.
Statistical and structural identifiability in representation learning
Theory
Generative Models
Computer Vision
- Introduces statistical and structural identifiability as distinct concepts in representation learning.
- Proposes model-agnostic definitions of near-identifiability allowing for error tolerance.
- Demonstrates that ICA can resolve linear ambiguities in representation learning models.
- Achieves state-of-the-art disentanglement using a vanilla autoencoder combined with ICA.
Read more
Statistical and structural identifiability in representation learning
Summary
This paper addresses the stability of internal representations in representation learning models by formalizing it into two distinct concepts: statistical identifiability and structural identifiability. The authors introduce model-agnostic definitions of statistical and structural near-identifiability, allowing for some error tolerance (ϵ). They prove that representations from models with nonlinear decoders can achieve statistical ϵ-near-identifiability, extending existing theories beyond last-layer representations to include intermediate representations in various models such as masked autoencoders and supervised learners. The paper also demonstrates that independent components analysis (ICA) can resolve remaining ambiguities, and empirically validates the near-identifiability claims. The proposed method for disentanglement, which combines ICA post-processing with latent representations, achieves state-of-the-art results on synthetic benchmarks and improves generalization in biological applications, specifically in cell microscopy.
Methodology
The authors formalize statistical and structural identifiability, proving new results on ϵ-near-identifiability for various models. They conduct synthetic experiments to validate their theoretical claims and apply ICA for disentanglement in latent spaces.
Results
The paper shows that intermediate representations in models with statistically identifiable outputs are also statistically ϵ-near identifiable. It validates that ICA effectively resolves linear indeterminacies and demonstrates that the proposed disentanglement method achieves competitive results on benchmark datasets.
Implications
The findings suggest that representation learning models can be more reliably interpreted and utilized in various applications, particularly in fields requiring disentanglement of complex data, such as biology and computer vision.
Fingerprinting Concepts in Data Streams with Supervised and Unsupervised Meta-Information
Time Series
- FiCSUM framework combines supervised and unsupervised meta-information for concept representation.
- Dynamic weighting strategy enhances the adaptability of the framework across different datasets.
- FiCSUM significantly outperforms existing methods in classification accuracy and concept drift detection.
- Concept fingerprints allow for effective reuse of classifiers for recurring concepts.
Read more
Fingerprinting Concepts in Data Streams with Supervised and Unsupervised Meta-Information
Summary
This paper addresses the challenge of concept drift in data streams, where the distribution of data changes over time, impacting the performance of classification algorithms. The authors propose a novel framework called FiCSUM (Fingerprinting with Combined Supervised and Unsupervised Meta-Information) to create concept representations that uniquely identify concepts in a data stream. The framework utilizes a vector of diverse meta-information features to construct a fingerprint for each concept, allowing for better detection of concept drift. The authors highlight that existing methods often rely on a limited number of meta-information features, which can lead to failures in distinguishing between concepts. FiCSUM combines both supervised and unsupervised meta-information features, employing a dynamic weighting strategy to adjust the influence of each feature based on the dataset. The evaluation of FiCSUM across 11 real-world and synthetic datasets demonstrates its superior performance in classification accuracy and its ability to model underlying concept drift compared to state-of-the-art methods. The findings suggest that FiCSUM can effectively adapt to changes in data streams and improve the robustness of classification systems.
Methodology
The authors developed the FiCSUM framework, which constructs a fingerprint vector from a wide range of meta-information features to represent concepts in data streams. A dynamic weighting scheme is employed to learn and adjust the influence of each feature in real-time, enhancing the framework's flexibility and generalizability across various datasets.
Results
The evaluation of FiCSUM showed that it achieved significantly better classification accuracy and a higher ability to capture ground truth concepts compared to purely supervised or unsupervised methods. The framework demonstrated effective discrimination between concepts across diverse datasets with varying types of concept drift.
Implications
The FiCSUM framework has potential applications in real-time data analysis, particularly in fields where data streams are prevalent, such as sensor networks, financial markets, and online services. Its ability to adapt to concept drift can enhance the performance of machine learning models in dynamic environments.
Generalist Large Language Models for Molecular Property Prediction: Distilling Knowledge from Specialist Models
NLP
Large Language Models
Interpretability
- TreeKD enhances LLM performance in MPP by distilling knowledge from tree-based models.
- The method verbalizes decision tree rules into natural language for LLM training.
- Rule-consistency improves prediction robustness by ensembling outputs from diverse rules.
- TreeKD narrows the performance gap between LLMs and specialist models in MPP tasks.
Read more
Generalist Large Language Models for Molecular Property Prediction: Distilling Knowledge from Specialist Models
Summary
This paper addresses the challenge of Molecular Property Prediction (MPP) in drug discovery, where Large Language Models (LLMs) have shown potential but currently underperform compared to specialist models. The authors propose a novel knowledge distillation method called TreeKD, which transfers knowledge from tree-based specialist models to LLMs. The method involves training decision trees on functional group features and verbalizing their predictive rules into natural language, allowing LLMs to learn from these insights. Additionally, the authors introduce a test-time scaling technique named rule-consistency, which utilizes a Random Forest to ensemble predictions across diverse rules, enhancing robustness. Experiments conducted on 22 ADMET properties from the TDC benchmark demonstrate that TreeKD significantly improves LLM performance, bridging the gap with state-of-the-art specialist models while maintaining the interpretability and interactive capabilities of LLMs.
Methodology
The authors developed TreeKD, a knowledge distillation method that trains specialist decision trees on functional group features and verbalizes their predictive rules. These rules are integrated into the training context of LLMs. The rule-consistency technique involves using a Random Forest to ensemble predictions from multiple decision trees, enhancing the robustness of the LLM's outputs.
Results
The experiments on 22 ADMET properties showed that TreeKD significantly outperformed baseline LLMs, achieving competitive performance with specialist models. This demonstrates the effectiveness of knowledge transfer from tree-based models to LLMs in the context of molecular property prediction.
Implications
The findings suggest that combining the strengths of specialist models with LLMs can lead to more effective and interpretable tools for molecular property prediction, potentially accelerating drug discovery processes by improving the accuracy of early-stage screening of drug candidates.
Test-time RL alignment exposes task familiarity artifacts in LLM benchmarks
Large Language Models
Reinforcement Learning
NLP
- Direct evaluation of LLMs can misrepresent their capabilities due to task familiarity.
- The proposed TTRA method aligns models effectively without requiring a specific training dataset.
- Post-alignment, base models show performance comparable to fine-tuned models, especially in reasoning tasks.
- Many reported performance gains from RL and SFT may be artifacts of task familiarity rather than genuine reasoning improvements.
Read more
Test-time RL alignment exposes task familiarity artifacts in LLM benchmarks
Summary
This paper addresses the issue of misleading evaluations of Large Language Models (LLMs) on benchmarks, where strong performance may stem from task familiarity rather than genuine capability. The authors propose a novel two-stage test-time reinforcement learning (RL) alignment method, termed Test-time RL Alignment (TTRA), which allows for task alignment without the need for a specific training dataset. The first stage involves a one-shot RL alignment using a single sample to adapt the model to the task format, while the second stage employs majority-voting rewards to align the model with the benchmark distribution. The results indicate that TTRA aligns models comparably to supervised fine-tuning (SFT) methods, but without requiring task-specific training data. The authors demonstrate that direct evaluations often underestimate the capabilities of base models, which perform significantly better post-alignment. Notably, for reasoning tasks, the performance gap between fine-tuned models and their base counterparts diminishes after alignment, suggesting that previously reported gains from RL and SFT methods may largely be artifacts of task familiarity rather than true improvements in reasoning ability.
Methodology
The authors introduce a two-stage test-time reinforcement learning alignment method (TTRA). The first stage involves a one-shot RL alignment using a single labeled example to adapt the model to the task format. The second stage applies majority-voting rewards on the benchmark to further align the model with the benchmark distribution, enabling evaluation without a dedicated training dataset.
Results
The TTRA method harmonizes model rankings across benchmarks similarly to SFT-based approaches but without the need for training data. It significantly reduces performance differences between base models and fine-tuned variants, particularly in reasoning tasks. In a clinical decision-making benchmark, TTRA reveals that base models, initially underestimated, can achieve accuracy comparable to human doctors after alignment.
Implications
The findings suggest that evaluation methodologies for LLMs need to account for task familiarity to provide a more accurate assessment of model capabilities. TTRA could be applied in various domains where task-specific training data is scarce, enhancing the evaluation of models in clinical, legal, and other specialized fields.
No More DeLuLu: Physics-Inspired Kernel Networks for Geometrically-Grounded Neural Computation
Theory
Efficient ML
Interpretability
- Introduction of the ⵟ-product kernel operator that combines alignment and proximity.
- Neural Matter Networks (NMNs) utilize the ⵟ-product, eliminating the need for separate activation functions.
- Empirical results show NMNs match linear classifiers on MNIST and outperform GPT-2 in language modeling.
- The framework offers a unified approach to kernel learning and gradient stability.
Read more
No More DeLuLu: Physics-Inspired Kernel Networks for Geometrically-Grounded Neural Computation
Summary
This paper introduces the ⵟ-product, a novel kernel operator that integrates quadratic alignment with inverse-square proximity, proving it to be a Mercer kernel with desirable properties such as analytic behavior, Lipschitz continuity on bounded domains, and self-regularization. The author presents Neural Matter Networks (NMNs) that utilize the ⵟ-product as their sole non-linearity, replacing traditional activation functions and normalization layers. This architectural innovation not only simplifies the network design but also maintains universal approximation capabilities while embedding normalization within the kernel itself. Empirical evaluations demonstrate that NMN-based classifiers achieve performance on par with linear baselines on the MNIST dataset, showcasing bounded prototype evolution and robustness in superposition. In language modeling tasks, Aether-GPT2, utilizing ⵟ-based attention and MLP blocks, outperforms GPT-2 in validation loss while maintaining a similar parameter budget. The framework unifies concepts from kernel learning, gradient stability, and information geometry, positioning NMNs as a principled alternative to conventional neural architectures.
Methodology
The ⵟ-product is defined mathematically as ⵟ(w, x) = (⟨w, x⟩^2) / (∥w - x∥^2 + ε), where it captures both alignment and proximity in a single computation. This formulation allows for a unique non-linearity that is inherently tied to geometric relationships, avoiding the need for traditional activation functions. NMNs are constructed using this kernel in its primal form, which eliminates the need for Gram matrix inversion, thus enhancing stability and efficiency.
Results
The NMN classifiers demonstrated performance comparable to linear models on the MNIST dataset, while Aether-GPT2 achieved lower validation loss than GPT-2, indicating the effectiveness of the ⵟ-based attention mechanism. The architecture also showed a reduction in memory usage by 15-25% and maintained infinite differentiability, making it suitable for physics-informed applications.
Implications
The introduction of the ⵟ-product and NMNs could lead to more efficient neural network architectures that are easier to interpret and require fewer resources. This approach may have significant implications for various applications in machine learning, particularly in fields that benefit from geometric interpretations, such as computer vision and natural language processing.
Causal Representation Learning with Optimal Compression under Complex Treatments
Theory
Efficient ML
- Introduces a novel generalization bound for multi-treatment causal representation learning.
- Proposes a consistent estimator for the optimal balancing weight α, eliminating heuristic tuning.
- Demonstrates O(1) scalability with the Treatment Aggregation strategy.
- Extends the framework to a generative architecture preserving Wasserstein geodesic structure.
Read more
Causal Representation Learning with Optimal Compression under Complex Treatments
Summary
This paper addresses the challenges of estimating Individual Treatment Effects (ITE) in multi-treatment scenarios, specifically focusing on hyperparameter selection and the curse of dimensionality. The authors derive a novel generalization bound and propose a theoretically grounded estimator for the optimal balancing weight, α, which eliminates the need for heuristic tuning. They investigate three balancing strategies: Pairwise, One-vs-All (OVA), and Treatment Aggregation, finding that while OVA performs well in low-dimensional settings, Treatment Aggregation ensures O(1) scalability. The framework is extended to a Multi-Treatment CausalEGM, a generative architecture that preserves the Wasserstein geodesic structure of the treatment manifold. Experiments on semi-synthetic and image datasets show that the proposed approach significantly outperforms traditional models in terms of estimation accuracy and efficiency, especially in large-scale intervention scenarios.
Methodology
The authors reformulate multi-treatment causal representation learning as a problem of optimal compression. They derive a generalization bound that formalizes the bias-information trade-off and propose a consistent estimator for the balancing weight α. The methodology includes three balancing strategies and extends to a generative model that maintains the structure of the treatment manifold.
Results
The proposed methods significantly improve estimation accuracy and efficiency compared to traditional models, particularly in scenarios with a large number of treatments. The Treatment Aggregation strategy allows for stable performance as the number of treatments increases, demonstrating O(1) complexity.
Implications
The findings have important implications for personalized medicine, policy evaluation, and targeted interventions, as they provide a more efficient and accurate method for estimating individualized causal effects in complex treatment scenarios.
Thermodynamics of Reinforcement Learning Curricula
Reinforcement Learning
Optimization
Theory
- Introduces a geometric framework for curriculum learning in reinforcement learning.
- Optimal curricula are shown to correspond to geodesics in a task manifold defined by reward parameters.
- Presents the 'MEW' algorithm for temperature annealing in maximum-entropy RL.
- Challenges the assumption of a flat task space in traditional RL approaches.
Read more
Thermodynamics of Reinforcement Learning Curricula
Summary
This paper explores the intersection of non-equilibrium thermodynamics and reinforcement learning (RL), specifically focusing on curriculum learning. The authors propose a geometric framework where reward parameters are treated as coordinates on a task manifold. By minimizing the excess thermodynamic work, they demonstrate that optimal curricula correspond to geodesics in this task space. The study introduces an algorithm called 'MEW' (Minimum Excess Work) for deriving a principled schedule for temperature annealing in maximum-entropy RL. The authors argue that traditional linear interpolation of tasks is insufficient, as it assumes a flat task space, and instead, they reveal a nontrivial geometry that characterizes the difficulty of adapting to new tasks. This framework not only provides insights into optimal reward parameter schedules but also unifies various phenomena in RL, such as reward shaping and simulated annealing. The paper emphasizes the importance of understanding the geometric structure of task spaces in RL to enhance learning efficiency and adaptability.
Methodology
The authors employ a statistical mechanics approach to analyze the space of parameterized reward functions in RL. They define task schedules as paths in this space and use a friction tensor to quantify the cost of adapting to new tasks. The analysis leads to a closed-form expression for optimal one-dimensional task schedules.
Results
The study finds that optimal reward parameter schedules minimize the path-dependent excess cost associated with adapting to new tasks, confirming that these schedules follow geodesics in the induced task space. The proposed MEW algorithm effectively provides a principled approach to temperature annealing in maximum-entropy RL.
Implications
The findings have significant implications for improving RL training methodologies, particularly in designing curricula that enhance learning efficiency. The geometric framework can inform the development of more effective RL algorithms and strategies for task interpolation, potentially leading to better performance in complex environments.
IsoCompute Playbook: Optimally Scaling Sampling Compute for LLM RL
Reinforcement Learning
Large Language Models
Optimization
- Optimal number of parallel rollouts increases with compute budget and saturates at high levels.
- Scaling trends differ between easy and hard problems, with distinct underlying mechanisms.
- Performance is relatively insensitive to the number of unique problems per batch compared to rollouts per problem.
- Guidelines for compute allocation can help maximize performance in LLM RL training.
Read more
IsoCompute Playbook: Optimally Scaling Sampling Compute for LLM RL
Summary
The paper addresses the challenge of scaling reinforcement learning (RL) for large language models (LLMs) by proposing a systematic approach to optimize the allocation of sampling compute during on-policy RL training. The authors investigate how to effectively distribute compute resources across three dimensions: parallel rollouts per problem, number of problems per batch, and sequential iterations. Through extensive experiments on various base models and problem distributions, they establish that the optimal number of rollouts per problem increases with the compute budget and eventually saturates. The findings reveal that while both easy and hard problems show similar scaling trends, they are driven by different mechanisms—solution sharpening for easy problems and coverage expansion for hard problems. The study provides practical guidelines for compute-efficient LLM RL post-training, emphasizing the importance of prioritizing rollouts over problem diversity as compute resources increase. This work contributes to the understanding of RL scaling laws and offers a framework for practitioners to maximize performance based on available compute resources.
Methodology
The authors conducted a series of experiments using three base models (Qwen2.5-7B-Instruct, Qwen3-4B-Instruct, and Llama 3.1-8B-Instruct) across various training configurations and problem distributions. They framed the compute allocation as a constrained optimization problem and analyzed the effects of parallel rollouts, problem batch size, and sequential iterations on performance metrics.
Results
The results indicate that the compute-optimal number of rollouts per problem increases with the compute budget and saturates. The study found that increasing rollouts mitigates interference across problems, while the number of problems per batch has a marginal effect on performance. The findings were validated across approximately 120,000 H200-hours of RL experiments.
Implications
The findings have significant implications for practitioners in the field of reinforcement learning, particularly in optimizing resource allocation for LLM training. The established scaling laws can guide the design of more efficient RL algorithms and improve the performance of LLMs in various applications.
Disentangled Latent Dynamics Manifold Fusion for Solving Parameterized PDEs
Theory
Optimization
Time Series
- DLDMF provides a novel approach to generalizing neural surrogate models for parameterized PDEs.
- The framework utilizes a deterministic feed-forward mapping for encoding PDE parameters, avoiding unstable test-time auto-decoding.
- DLDMF integrates spatial, temporal, and parameter information into a cohesive latent representation.
- Extensive experiments validate DLDMF's superior performance in predictive accuracy and extrapolation robustness.
Read more
Disentangled Latent Dynamics Manifold Fusion for Solving Parameterized PDEs
Summary
This paper addresses the challenge of generalizing neural surrogate models for parameterized partial differential equations (PDEs) in scientific machine learning. Traditional models struggle with variations in PDE coefficients, leading to optimization failures and difficulties in temporal extrapolation. The authors propose a novel framework called Disentangled Latent Dynamics Manifold Fusion (DLDMF), which utilizes a space-time-parameter disentanglement strategy to enhance predictive accuracy and robustness. Unlike existing methods that rely on static time representations or computationally intensive auto-decoding, DLDMF encodes PDE parameters into a continuous latent representation through a deterministic feed-forward mapping. This representation initializes a continuous temporal latent state governed by a parameter-conditioned neural ordinary differential equation (Neural ODE). The framework also introduces a dynamic manifold fusion mechanism, integrating spatial coordinates, parameter embeddings, and time-evolving latent states to produce physically consistent spatiotemporal solutions. Extensive experiments demonstrate that DLDMF significantly outperforms state-of-the-art methods in predictive accuracy, parameter generalization, and long-term temporal extrapolation, showcasing its potential for solving complex PDEs in various applications.
Methodology
The authors developed DLDMF, which employs a space-time-parameter disentanglement strategy to encode PDE parameters into a continuous latent representation. This representation is used to initialize a temporal latent state governed by a Neural ODE, allowing for dynamic modeling of the PDE solutions. The framework also includes a manifold fusion mechanism to integrate various latent states and coordinates, ensuring the generation of physically consistent solutions.
Results
The experimental results indicate that DLDMF significantly outperforms existing state-of-the-art methods in terms of predictive accuracy, parameter generalization, and robustness in long-term temporal extrapolation. The framework successfully maintains global geometric coherence, even under unseen parameter configurations.
Implications
The proposed DLDMF framework has significant implications for scientific machine learning, particularly in fields requiring accurate and efficient solutions to complex PDEs. Its ability to generalize across parameter variations and perform reliable long-term predictions could enhance simulations in fluid dynamics, climate modeling, and other scientific applications.
PISmith: Reinforcement Learning-based Red Teaming for Prompt Injection Defenses
NLP
Large Language Models
Reinforcement Learning
- PISmith is a reinforcement learning-based framework for evaluating prompt injection defenses.
- The framework addresses reward sparsity issues in training attack LLMs through adaptive entropy regularization and dynamic advantage weighting.
- Extensive evaluations show that state-of-the-art defenses are vulnerable to adaptive attacks.
- PISmith consistently achieves higher attack success rates compared to seven baseline methods.
Read more
PISmith: Reinforcement Learning-based Red Teaming for Prompt Injection Defenses
Summary
The paper addresses the security vulnerabilities posed by prompt injection attacks on large language models (LLMs), particularly in autonomous agents. The authors introduce PISmith, a novel reinforcement learning (RL)-based framework designed to evaluate and stress-test existing prompt injection defenses. The framework trains an attack LLM to optimize injected prompts in a black-box setting, where the attacker can only query the defended LLM and observe outputs. A significant challenge identified is the extreme reward sparsity encountered when strong defenses block most injected prompts, leading to suboptimal performance when using standard RL techniques like GRPO. To overcome this, PISmith incorporates adaptive entropy regularization and dynamic advantage weighting, which enhance exploration and learning from rare successful attacks. The authors conduct extensive evaluations across 13 benchmarks, revealing that current state-of-the-art defenses remain vulnerable to adaptive attacks. PISmith outperforms seven baseline methods in various attack categories and demonstrates strong performance in agentic settings against both open-source and closed-source LLMs.
Methodology
PISmith employs a reinforcement learning approach to train an attack LLM that generates effective injected prompts. It utilizes adaptive entropy regularization to maintain exploration and dynamic advantage weighting to enhance learning from successful attacks, addressing the challenges of reward sparsity and entropy collapse encountered in traditional methods.
Results
The evaluation of PISmith across 13 benchmarks indicates that existing prompt injection defenses are significantly vulnerable to adaptive attacks. PISmith consistently outperforms seven baseline methods in terms of attack success rates and shows robust performance in agentic environments against both open-source and closed-source LLMs.
Implications
The findings suggest that current defenses against prompt injection attacks are inadequate, highlighting the need for more robust security measures in LLM applications. PISmith can serve as a valuable tool for systematically assessing and improving the resilience of prompt injection defenses.
Jailbreak Scaling Laws for Large Language Models: Polynomial-Exponential Crossover
NLP
Large Language Models
Theory
- Adversarial prompt-injection attacks can significantly increase the attack success rate of LLMs.
- The scaling of attack success rate transitions from polynomial to exponential with increased inference-time samples.
- A theoretical model based on spin-glass theory provides insights into the behavior of LLMs under adversarial conditions.
- Short prompts lead to power-law scaling, while long prompts result in exponential scaling of attack success rates.
Read more
Jailbreak Scaling Laws for Large Language Models: Polynomial-Exponential Crossover
Summary
This paper investigates the scaling laws of adversarial attacks, specifically prompt-injection attacks, on large language models (LLMs). The authors observe that the attack success rate (ASR) transitions from a polynomial growth to an exponential growth as the number of inference-time samples increases, particularly under adversarial prompt injection. To explain this phenomenon, they propose a theoretical generative model based on spin-glass theory, where the language model's behavior is likened to a spin-glass system operating in a replica-symmetry-breaking regime. The model captures how short and long injected prompts correspond to weak and strong magnetic fields, respectively, influencing the ASR. The authors derive analytical expressions for the ASR in both regimes and validate their findings empirically using various LLMs, including GPT-4.5 Turbo and Vicuna-7B v1.5. Their results indicate that the scaling behavior of ASR is influenced by the reasoning capabilities of the models, with implications for understanding and mitigating adversarial vulnerabilities in LLMs.
Methodology
The authors developed a generative model inspired by spin-glass theory, treating the language model as a system of spins. They analyzed the effects of prompt injections by comparing two models: a teacher model that defines safe and unsafe clusters and a student model influenced by adversarial prompts. The scaling of attack success rates was derived analytically and validated through empirical experiments on various LLMs.
Results
The study found that the attack success rate for LLMs exhibits polynomial scaling in the absence of adversarial prompts, while the introduction of such prompts leads to exponential scaling, particularly in weaker models. The theoretical model accurately predicted these behaviors, demonstrating the impact of prompt length on the scaling laws.
Implications
The findings highlight the vulnerabilities of safety-aligned LLMs to adversarial attacks, suggesting the need for improved defenses against prompt injection. Understanding these scaling laws can inform the design of more robust language models and enhance safety mechanisms in AI systems.
Hybrid Energy-Aware Reward Shaping: A Unified Lightweight Physics-Guided Methodology for Policy Optimization
Reinforcement Learning
Robotics
Theory
- H-EARS unifies potential-based reward shaping with energy-aware action regularization.
- The framework achieves linear complexity by focusing on dominant energy components.
- Theoretical foundations include convergence guarantees and performance-modeling trade-offs.
- Empirical results show improved convergence speed and energy efficiency across benchmarks.
Read more
Hybrid Energy-Aware Reward Shaping: A Unified Lightweight Physics-Guided Methodology for Policy Optimization
Summary
This paper presents Hybrid Energy-Aware Reward Shaping (H-EARS), a novel framework that integrates potential-based reward shaping with energy-aware action regularization to enhance deep reinforcement learning (DRL) in continuous control tasks. Traditional model-free DRL methods often require extensive exploration and can lead to high variance and low energy efficiency. H-EARS addresses these challenges by constraining action magnitudes while balancing task-specific and energy-based potentials through a theoretically grounded functional decomposition. The framework achieves linear modeling complexity by selectively capturing dominant energy components, making it applicable in real-world scenarios without requiring complete system models. The authors establish a robust theoretical foundation that includes functional independence for optimizing task performance and energy efficiency, energy-based convergence acceleration, and guarantees under function approximation. Empirical validation across various baseline algorithms shows consistent improvements in convergence speed, stability, and energy efficiency, particularly in high-fidelity vehicle simulations under challenging conditions. This work demonstrates that integrating lightweight physics priors can significantly enhance model-free reinforcement learning, facilitating its transition from research to practical industrial applications.
Methodology
The authors developed H-EARS by combining potential-based reward shaping with energy-aware action regularization. They established a theoretical framework that allows for separate optimization of task performance and energy efficiency while ensuring convergence and stability. The methodology includes functional decomposition to balance task-specific and energy potentials and empirical validation through experiments on various baseline algorithms.
Results
The experiments demonstrated that H-EARS consistently improved convergence speed, stability, and energy efficiency across multiple benchmarks. High-fidelity simulations of vehicle dynamics confirmed the framework's effectiveness in safety-critical scenarios, showcasing its practical applicability in real-world conditions.
Implications
The findings suggest that H-EARS can facilitate the deployment of deep reinforcement learning in industrial applications, particularly in domains requiring high reliability and energy efficiency, such as robotics and vehicle control. This approach may bridge the gap between theoretical research and practical implementation in complex systems.
As Language Models Scale, Low-order Linear Depth Dynamics Emerge
NLP
Large Language Models
Theory
- Low-order linear surrogates can accurately capture the depth dynamics of large language models.
- Agreement between linear surrogates and full models improves with increasing model size.
- The linear surrogate enables more efficient intervention strategies than standard heuristics.
- The study provides a systems-theoretic framework for analyzing transformer dynamics.
Read more
As Language Models Scale, Low-order Linear Depth Dynamics Emerge
Summary
This paper investigates the dynamics of transformer-based language models, particularly focusing on how their depth dynamics can be approximated by low-order linear surrogates as model size increases. The authors demonstrate that a 32-dimensional linear surrogate can effectively reproduce the layerwise sensitivity profiles of the GPT-2-large model across various tasks, including toxicity detection and sentiment analysis. They reveal a scaling principle where the agreement between the linear surrogate and the full model improves with model size, indicating that larger models allow for better linear approximations. Furthermore, the linear surrogate facilitates more efficient multi-layer interventions compared to traditional heuristic methods, suggesting a shift in understanding large language models. The findings imply that as language models scale, their local dynamics become increasingly tractable, providing a foundation for systematic analysis and control of these complex systems.
Methodology
The authors treat the depth of transformers as discrete time and the last-token hidden state as the system state. They identify a low-dimensional linear surrogate that approximates the state propagation through transformer blocks, allowing for the prediction of layerwise sensitivity and the impact of interventions introduced at various layers.
Results
The study finds that low-order linear surrogates can reproduce the layerwise sensitivity profiles of the full model with high fidelity. The agreement between the surrogate and the full model improves monotonically with the size of the GPT-2 family, indicating that larger models yield better linear approximations. Additionally, the linear surrogate supports more effective intervention strategies that require less energy than traditional methods.
Implications
The results suggest a new approach to studying large language models, emphasizing the potential for linear models to provide insights into the dynamics of complex systems. This could lead to more efficient methods for probing, monitoring, and intervening in transformer models, enhancing their controllability and interpretability.
Byzantine-Robust Optimization under $(L_0, L_1)$-Smoothness
Optimization
Federated Learning
Theory
- Introduction of Byz-NSGDM, a robust optimization method for distributed systems facing Byzantine attacks.
- The algorithm operates under the (L0, L1)-smoothness condition, which is more general than traditional L-smoothness.
- Proven convergence rate of O(K−1/4) with a bias floor dependent on robustness and gradient heterogeneity.
- Empirical validation shows strong performance against various Byzantine attack strategies.
Read more
Byzantine-Robust Optimization under $(L_0, L_1)$-Smoothness
Summary
This paper addresses the challenges of distributed optimization in the presence of Byzantine attacks, particularly under the framework of (L0, L1)-smoothness, which generalizes standard L-smoothness by allowing state-dependent gradient Lipschitz constants. The authors propose a novel algorithm, Byz-NSGDM (Byzantine-Robust Normalized Stochastic Gradient Descent with Momentum), which integrates momentum normalization with Byzantine-robust aggregation techniques, specifically enhanced by Nearest Neighbor Mixing (NNM). The convergence analysis demonstrates that Byz-NSGDM achieves a convergence rate of O(K−1/4) while being resilient to Byzantine biases influenced by the robustness coefficient and gradient heterogeneity. Experimental results validate the algorithm's effectiveness across various tasks, including heterogeneous MNIST classification, synthetic (L0, L1)-smooth optimization, and character-level language modeling with a small GPT model, showcasing its robustness against multiple Byzantine attack strategies. An ablation study further confirms the stability of Byz-NSGDM across a range of momentum and learning rate configurations.
Methodology
The authors develop Byz-NSGDM, which combines normalized stochastic gradient descent with momentum and Byzantine-robust aggregation methods. The algorithm employs Nearest Neighbor Mixing to mitigate the effects of gradient heterogeneity among workers. The theoretical analysis focuses on establishing convergence guarantees under the (L0, L1)-smoothness assumption, leveraging recent advances in optimization techniques.
Results
Byz-NSGDM achieves a convergence rate of O(K−1/4) under Byzantine attacks, with experimental results demonstrating its effectiveness in heterogeneous settings and against various attack strategies. The method outperforms traditional momentum-based baselines without normalization, indicating its robustness and adaptability.
Implications
The findings suggest that Byz-NSGDM can enhance the reliability of distributed optimization in federated learning environments, particularly where data privacy and security are paramount. This work opens avenues for further research into robust optimization techniques that can handle more complex smoothness conditions and adversarial scenarios.
L2GTX: From Local to Global Time Series Explanations
Time Series
- L2GTX is a fully model-agnostic method for generating global explanations in time series classification.
- The method aggregates local explanations from a selective set of instances to create class-wise global insights.
- L2GTX effectively reduces redundancy in explanations by merging local clusters and constructing an instance-cluster importance matrix.
- Experimental results indicate that L2GTX maintains high interpretability and global faithfulness across different datasets.
Read more
L2GTX: From Local to Global Time Series Explanations
Summary
The paper introduces L2GTX, a novel model-agnostic method designed to generate class-wise global explanations for time series classification by aggregating local explanations derived from a selective set of time series instances. The authors identify three main limitations in existing time series explanation methods: the inadequacy of model-agnostic XAI methods for time series data, the underexploration of global explanation synthesis for time series, and the model-specific nature of existing global approaches. L2GTX addresses these challenges by first extracting local explanations using LOMATCE, which identifies parameterized temporal event primitives such as trends and local extrema. These local explanations are clustered and merged to reduce redundancy, and an instance-cluster importance matrix is constructed to evaluate global relevance. The method selects representative instances under a user-defined budget to maximize coverage of influential clusters, ultimately aggregating the events from these instances into concise global explanations. Experimental results on six benchmark datasets demonstrate that L2GTX produces interpretable global explanations while maintaining stable global faithfulness across varying levels of explanation consolidation.
Methodology
L2GTX employs a two-step process: first, it extracts local explanations using LOMATCE to identify parameterized temporal event primitives. Then, it consolidates these local explanations into clusters, constructs an instance-cluster importance matrix, and selects representative instances to maximize coverage of influential clusters. Finally, it aggregates the relevant events into concise global explanations.
Results
The experiments conducted on six benchmark time series datasets show that L2GTX produces compact and interpretable global explanations while maintaining stable global faithfulness, as measured by mean local surrogate fidelity (R2), across various levels of explanation consolidation.
Implications
The findings suggest that L2GTX can enhance the interpretability of time series classifiers, making it easier for practitioners to understand model decisions in critical applications such as finance and healthcare. This model-agnostic approach could foster greater trust and accountability in AI systems that rely on time series data.
Feynman: Knowledge-Infused Diagramming Agent for Scalable Visual Designs
Multimodal
Large Language Models
Generative Models
- FEYNMAN effectively decouples knowledge elicitation from visual production, enhancing diagram generation.
- The agent generated over 100,000 well-aligned diagram-caption pairs at a low cost.
- A new benchmark, DIAGRAMMA, was created for evaluating visual reasoning in multi-modal models.
- The use of PENROSE allows for diverse and semantically consistent diagram rendering.
Read more
Feynman: Knowledge-Infused Diagramming Agent for Scalable Visual Designs
Summary
This paper introduces FEYNMAN, a novel diagramming agent designed to generate high-quality, knowledge-infused diagrams at scale. The authors identify a significant gap in the availability of well-aligned image-text pairs necessary for training multi-modal AI systems. FEYNMAN addresses this by decoupling the processes of knowledge elicitation and visual production. It first enumerates domain-specific knowledge components and formulates a plan to translate these into simple declarative programs. The diagrams are then rendered using the PENROSE system, which optimizes visual semantics while introducing randomness for diversity. The authors successfully synthesized a dataset of over 100,000 diagram-caption pairs and created a benchmark, DIAGRAMMA, for evaluating visual reasoning capabilities in vision-language models. The paper highlights the efficiency of the FEYNMAN pipeline, achieving significant output with minimal cost and time, and discusses the implications for future multi-modal AI systems.
Methodology
FEYNMAN operates by first eliciting domain-specific knowledge from large language models (LLMs) and then translating this knowledge into visual representations through a structured pipeline. The process involves generating declarative programs that are iteratively refined based on feedback, ultimately rendered using the PENROSE diagramming system.
Results
The FEYNMAN agent successfully produced 10,693 knowledge-infused programs, resulting in 106,930 well-aligned diagram-caption pairs. This was achieved with a total of 1,550 million tokens processed at a cost of under $400, demonstrating the agent's efficiency and scalability.
Implications
The development of FEYNMAN has significant implications for the fields of visual design and multi-modal AI, particularly in enhancing the capabilities of models to understand and generate diagrams. The released dataset and benchmark can facilitate further research in visual reasoning and diagram synthesis, potentially benefiting educational tools, scientific visualization, and automated content generation.
Cornserve: A Distributed Serving System for Any-to-Any Multimodal Models
Multimodal
- Cornserve is the first distributed serving system specifically designed for Any-to-Any multimodal models.
- It offers a flexible task abstraction for expressing complex computation graphs in Python.
- The system enables model fission, allowing independent scaling of model components.
- Cornserve utilizes a record-and-replay execution model for efficient tensor data forwarding.
Read more
Cornserve: A Distributed Serving System for Any-to-Any Multimodal Models
Summary
The paper introduces Cornserve, a distributed serving system designed for Any-to-Any multimodal models that can process and generate various combinations of data types, including text, images, videos, and audio. The challenge in serving these models arises from the heterogeneous nature of requests, which can traverse different paths through the model's computation graph, and the varying scaling characteristics of model components. Cornserve addresses these challenges by providing a flexible task abstraction that allows developers to express computation graphs in Python, enabling model fission to disaggregate components for independent scaling, and implementing a distributed runtime that utilizes a record-and-replay execution model for efficient data handling. Built on Kubernetes, Cornserve has been implemented with approximately 23,000 lines of Python code and supports a range of Any-to-Any models. The system demonstrates significant performance improvements, achieving up to 3.81 times higher throughput and 5.79 times lower tail latency compared to existing solutions.
Methodology
Cornserve employs a distributed architecture built on Kubernetes, utilizing a flexible task abstraction for model computation. It implements model fission to allow independent scaling of components and uses a record-and-replay execution model to manage data dependencies and optimize tensor data forwarding between components.
Results
Cornserve achieves up to 3.81 times higher throughput and 5.79 times lower tail latency compared to existing serving systems, demonstrating its effectiveness in handling Any-to-Any multimodal models.
Implications
The development of Cornserve has significant implications for the deployment of multimodal AI applications, enabling more efficient and scalable serving of complex models that can process diverse data types. This can enhance the performance of applications in fields such as multimedia content generation, interactive AI systems, and cross-modal understanding.
Hierarchical Reference Sets for Robust Unsupervised Detection of Scattered and Clustered Outliers
Graph Learning
- Introduces a novel outlier detection paradigm using graph structures.
- Distinguishes between scatterliers and clusterliers for better anomaly detection.
- Utilizes hierarchical reference sets for local and global anomaly evaluation.
- Demonstrates effectiveness through extensive experiments and performance analysis.
Read more
Hierarchical Reference Sets for Robust Unsupervised Detection of Scattered and Clustered Outliers
Summary
This paper addresses the challenges of unsupervised outlier detection in Internet of Things (IoT) data, particularly focusing on two types of outliers: scattered outliers (scatterliers) and clustered outliers (clusterliers). The authors propose a novel detection paradigm that utilizes graph structures to leverage natural neighboring relationships, allowing for a multi-perspective evaluation of anomalies. The method distinguishes between scattered outliers, which are isolated and deviant, and clustered outliers, which form dense micro-clusters that can obscure detection due to their masking effect. The proposed approach incorporates hierarchical reference sets at both local and global scales, enabling effective recognition of scattered outliers while isolating clustered groups. Extensive experiments validate the method's efficacy through comparative performance analysis, ablation studies, and evaluations on downstream clustering tasks, demonstrating its robustness in various scenarios.
Methodology
The authors developed a graph-based approach that employs hierarchical reference sets to evaluate anomalies from multiple perspectives. This method allows for the effective identification of scattered outliers while simultaneously addressing the challenges posed by clustered outliers, which can mask individual anomalies due to their spatial proximity.
Results
The proposed method showed superior performance in detecting both scatterliers and clusterliers compared to existing techniques. The experiments included comparative analyses and ablation studies that confirmed the robustness and effectiveness of the approach across various datasets and scenarios.
Implications
This research has significant implications for IoT applications, enhancing the reliability of anomaly detection systems. By effectively distinguishing between different types of outliers, the proposed method can improve the accuracy of real-time monitoring and decision-making processes in dynamic environments.
RXNRECer Enables Fine-grained Enzymatic Function Annotation through Active Learning and Protein Language Models
NLP
Large Language Models
Interpretability
- RXNRECer directly predicts enzyme-catalyzed reactions, bypassing the limitations of EC number reliance.
- The framework integrates protein language modeling and active learning for enhanced prediction accuracy.
- Significant performance improvements were observed over traditional EC-based methods.
- RXNRECer supports scalable annotation and provides interpretable prediction rationales.
Read more
RXNRECer Enables Fine-grained Enzymatic Function Annotation through Active Learning and Protein Language Models
Summary
The paper introduces RXNRECer, a novel transformer-based ensemble framework designed to enhance the annotation of enzymatic functions by directly predicting enzyme-catalyzed reactions without relying on the traditional Enzyme Commission (EC) numbers. The authors highlight the limitations of existing methods that use EC numbers, which often lead to ambiguities and inconsistencies due to the complex relationships between proteins, EC numbers, and biochemical reactions. RXNRECer integrates protein language modeling and active learning to effectively capture both high-level sequence semantics and fine-grained transformation patterns. The framework was evaluated against six EC-based baselines, showing significant improvements with a 16.54% increase in F1 score and a 15.43% increase in accuracy. Additionally, RXNRECer demonstrates advantages in scalable proteome-wide reaction annotation, specificity in refining reaction schemas, and systematic annotation of previously uncurated proteins. The incorporation of large language models also allows for interpretable predictions, making RXNRECer a robust solution for enzyme function prediction with potential applications in various fields of enzyme research and industrial applications.
Methodology
RXNRECer employs a transformer-based dynamic ensemble learning framework that combines protein language modeling to capture sequence semantics, an active learning strategy for targeted fine-tuning, and a dynamic ensemble module to enhance robustness in reaction-level predictions. It also includes an interpretability component powered by a general-purpose language model.
Results
The evaluations demonstrated that RXNRECer consistently outperformed traditional multiple sequence alignment (MSA)-based approaches, EC-based tools, and recent protein language model (PLM)-based methods, achieving a 16.54% increase in F1 score and a 15.43% increase in accuracy on curated datasets.
Implications
RXNRECer has the potential to revolutionize enzyme function annotation by providing a more accurate, scalable, and interpretable method for predicting enzymatic reactions. Its applications extend to proteome-wide annotations, substrate-specific function resolution, and the analysis of uncharacterized proteins, thereby facilitating advancements in metabolic engineering and biopharmaceutical development.
Beyond the Class Subspace: Teacher-Guided Training for Reliable Out-of-Distribution Detection in Single-Domain Models
Computer Vision
- Identification of Domain-Sensitivity Collapse (DSC) as a critical failure mode in single-domain OOD detection.
- Introduction of Teacher-Guided Training (TGT) to enhance domain sensitivity during training.
- Demonstration of significant improvements in OOD detection performance without increasing inference costs.
- Validation of TGT across multiple benchmarks, showing consistent reductions in false positive rates.
Read more
Beyond the Class Subspace: Teacher-Guided Training for Reliable Out-of-Distribution Detection in Single-Domain Models
Summary
This paper addresses the challenge of out-of-distribution (OOD) detection in single-domain models, which often suffer from a phenomenon termed Domain-Sensitivity Collapse (DSC). DSC occurs when supervised training compresses features into a low-rank class subspace, leading to a loss of sensitivity to domain shifts. The authors propose a novel approach called Teacher-Guided Training (TGT), which leverages a frozen multi-domain teacher model (DINOv2) to guide the training of a student model. TGT distills class-suppressed residual structures from the teacher into the student, enhancing the model's ability to detect OOD samples without incurring additional inference overhead. The paper demonstrates that TGT significantly reduces false positive rates for distance-based OOD detection methods across eight single-domain benchmarks while maintaining or slightly improving in-domain classification accuracy.
Methodology
The authors formalize the concept of Domain-Sensitivity Collapse (DSC) and propose Teacher-Guided Training (TGT) as a solution. TGT involves using a frozen multi-domain teacher model to transfer domain-sensitive features to a student model during training. The teacher's class-discriminative directions are projected out, and the student is supervised on the remaining residual features through an auxiliary domain head, which is discarded after training.
Results
TGT resulted in substantial reductions in false positive rates at 95% recall (FPR@95) for distance-based OOD detection methods: MDS improved by 11.61 percentage points, ViM by 10.78 percentage points, and kNN by 12.87 percentage points on average across ResNet-50. The approach also closed the gap to a teacher-feature oracle while maintaining or slightly improving in-domain OOD and classification accuracy.
Implications
The findings suggest that TGT can be effectively applied in practical single-domain OOD detection scenarios, such as medical imaging and industrial inspection, where models are often trained on limited, homogeneous datasets. This could lead to more reliable deployment of machine learning systems in critical applications.
DirPA: Addressing Prior Shift in Imbalanced Few-shot Crop-type Classification
Computer Vision
- DirPA method effectively mitigates prior shifts in imbalanced few-shot learning scenarios.
- The study evaluates DirPA across eight European countries, demonstrating cross-dataset stability.
- A strong correlation exists between class imbalance and performance improvements with DirPA.
- DirPA enhances hierarchical classification accuracy, ensuring reliable land-cover identification.
Read more
DirPA: Addressing Prior Shift in Imbalanced Few-shot Crop-type Classification
Summary
This paper addresses the challenges of class imbalance and high label acquisition costs in agricultural monitoring, particularly in the context of few-shot learning (FSL). The authors extend their previously introduced Dirichlet Prior Augmentation (DirPA) method to mitigate prior shifts during model training, which can lead to poor generalization in real-world agricultural tasks. By applying DirPA across a diverse dataset from eight European countries, the study demonstrates that the method enhances model robustness and improves class-specific performance. The findings indicate a strong correlation between dataset structural class imbalance and performance gains, as well as improved hierarchical accuracy in land-cover identification, even under data scarcity. This work emphasizes the importance of addressing prior shifts in FSL to achieve reliable agricultural monitoring.
Methodology
The authors employed the Dirichlet Prior Augmentation (DirPA) method to proactively address prior shifts during training in few-shot learning. They conducted experiments on a large-scale dataset covering multiple European countries, analyzing the impact of class imbalance on model performance and robustness.
Results
The results showed that DirPA significantly improved classification accuracy and stability across different geographical regions. The method enhanced individual class performance and demonstrated resilience to extreme long-tailed distributions, leading to better generalization in agricultural monitoring tasks.
Implications
The findings suggest that DirPA can be a valuable tool for improving few-shot learning in agricultural applications, particularly in scenarios with limited labeled data. This could enhance the effectiveness of remote sensing technologies for crop monitoring, yield estimation, and other agricultural tasks, contributing to food security efforts.
Global Evolutionary Steering: Refining Activation Steering Control via Cross-Layer Consistency
NLP
Large Language Models
Theory
- GER-steer provides a training-free solution for refining activation steering in LLMs.
- The method utilizes the first principal component of tangent semantic directions to enhance steering robustness.
- Extensive evaluations show GER-steer outperforms existing baselines across various tasks and models.
- The framework ensures consistent control without requiring manual tuning or heuristic layer selection.
Read more
Global Evolutionary Steering: Refining Activation Steering Control via Cross-Layer Consistency
Summary
This paper introduces Global Evolutionary Refined Steering (GER-steer), a novel framework designed to enhance the control of Large Language Models (LLMs) through activation steering. Traditional methods for deriving steering vectors often suffer from high-dimensional noise and semantic drift, leading to ineffective model alignment. GER-steer addresses these issues by leveraging the geometric stability of the network's representation evolution to refine raw steering vectors. By focusing on the first principal component of the tangent semantic direction across layers, GER-steer effectively decouples robust semantic intent from noise, ensuring consistent steering performance across different model architectures and tasks. The authors validate GER-steer through extensive experiments on multiple models and domains, demonstrating its superiority over existing methods in terms of efficacy and generalization without the need for layer-specific tuning.
Methodology
The authors propose GER-steer, which refines raw steering vectors by extracting a Global Evolutionary Direction from the network's internal representation evolution. This approach mitigates the influence of noise on steering direction estimation by enhancing the component of the raw steering vector along the globally consistent direction, ensuring robust and stable control across different layers of the model.
Results
The experimental results indicate that GER-steer consistently improves performance across three different models (Qwen-2.5-7B, Llama-3.1-8B-Instruct, and Gemma-2-9B-it) and five distinct domains, including safety alignment and sentiment control. The method exhibits superior domain generalization and transferability, confirming its effectiveness as a universal solution for model alignment.
Implications
The findings suggest that GER-steer can be applied to enhance the alignment of LLMs with human intent in various applications, potentially improving the safety and reliability of AI systems. Its training-free nature and robustness make it a valuable tool for practitioners working with large-scale language models.
Surprised by Attention: Predictable Query Dynamics for Time Series Anomaly Detection
Time Series
- AxonAD introduces a novel approach to anomaly detection by focusing on predictable query dynamics in multi-head attention mechanisms.
- The model effectively captures structural dependency shifts in multivariate time series data, addressing limitations of traditional residual-based detectors.
- A dual scoring mechanism combines reconstruction error with a query mismatch score to enhance sensitivity to anomalies.
- Extensive evaluations show significant improvements in anomaly detection performance over existing methods.
Read more
Surprised by Attention: Predictable Query Dynamics for Time Series Anomaly Detection
Summary
The paper presents AxonAD, an unsupervised anomaly detection method for multivariate time series data, particularly in the context of autonomous driving telemetry. Traditional anomaly detection techniques often fail to identify anomalies that manifest as shifts in cross-channel dependencies rather than simple amplitude changes. AxonAD addresses this limitation by treating the evolution of multi-head attention query vectors as a predictable process over short time horizons. The model consists of two main components: a reconstruction pathway that utilizes bidirectional self-attention to reconstruct input windows, and a history-only predictor that forecasts future query vectors based on past context. This dual approach allows AxonAD to capture structural dependency shifts while maintaining sensitivity to amplitude-level anomalies. The model is trained using a masked cosine loss against an exponential moving average (EMA) target encoder, and during inference, it combines reconstruction error with a query mismatch score to produce a final anomaly score. Evaluations on proprietary in-vehicle telemetry data and the TSB-AD multivariate suite demonstrate that AxonAD outperforms strong baselines in terms of ranking quality and temporal localization, confirming the effectiveness of query prediction and combined scoring in enhancing anomaly detection capabilities.
Methodology
AxonAD employs a two-pathway architecture: a reconstruction pathway using bidirectional self-attention for input reconstruction, and a history-only predictor that forecasts future query vectors. The model is trained with a masked cosine loss against an EMA target encoder, and during inference, it combines reconstruction error with a tail-aggregated query mismatch score to produce the final anomaly score.
Results
AxonAD demonstrated improved ranking quality and temporal localization on both proprietary in-vehicle telemetry data and the TSB-AD multivariate suite, outperforming strong baseline models. Ablation studies confirmed that the query prediction and the combined scoring mechanism were critical for the observed performance gains.
Implications
The findings suggest that AxonAD can be effectively utilized for real-time anomaly detection in complex systems like autonomous vehicles, where understanding the coordination between different telemetry channels is crucial for safety and performance monitoring.
Deconstructing the Failure of Ideal Noise Correction: A Three-Pillar Diagnosis
Theory
- The failure of ideal noise correction methods is not solely due to T estimation issues.
- Controlled experiments with a perfect transition matrix still show performance collapse.
- A unified analysis links macroscopic, microscopic, and information-theoretic perspectives.
- The study provides insights into the inherent instabilities of noise correction methods.
Read more
Deconstructing the Failure of Ideal Noise Correction: A Three-Pillar Diagnosis
Summary
This paper investigates the shortcomings of statistically consistent methods for Learning with Noisy Labels (LNL), particularly those relying on the noise transition matrix (T). While these methods theoretically guarantee convergence to the optimal classifier, they often underperform compared to empirical approaches. The authors challenge the prevailing assumption that the failure of noise correction methods is primarily due to difficulties in estimating T. Through controlled experiments using a perfect oracle transition matrix, they demonstrate that even under ideal conditions, noise correction methods like Forward Correction (FC) still experience performance collapse during training. This indicates that the issue lies not in T estimation but in deeper flaws within the correction methods themselves. The authors provide a unified analysis linking macroscopic convergence states, microscopic optimization dynamics, and information-theoretic limits, revealing that the failure of ideal noise correction is rooted in inherent instabilities and unavoidable information loss. The findings offer a comprehensive theoretical framework for understanding these failures and suggest directions for developing more reliable LNL solutions.
Methodology
The authors conducted controlled experiments using an oracle transition matrix to isolate the effects of noise correction methods. They performed a three-part analysis: a macroscopic analysis of convergence states, a microscopic analysis of optimization dynamics, and an information-theoretic analysis of information loss due to noise corruption.
Results
The results showed that even with a perfect transition matrix, the Forward Correction method exhibited a rise-and-fall dynamic in performance, ultimately converging to poor results similar to uncorrected training. This suggests that the limitations of noise correction methods are not merely due to T estimation but are rooted in deeper issues related to the correction objectives.
Implications
The findings imply that future research should focus on understanding the fundamental limitations of noise correction methods and developing new strategies that address these issues. This could lead to more robust solutions for learning with noisy labels, which is critical in various applications where label accuracy is compromised.
When Drafts Evolve: Speculative Decoding Meets Online Learning
NLP
Large Language Models
Efficient ML
- Introduction of OnlineSPEC framework that combines speculative decoding with online learning.
- Establishment of a theoretical link between acceleration rates and online learning performance.
- Development of novel algorithms leveraging interactive feedback for draft model refinement.
- Demonstrated up to 24% speedup over existing methods while preserving output quality.
Read more
When Drafts Evolve: Speculative Decoding Meets Online Learning
Summary
This paper introduces OnlineSPEC, a novel framework that integrates speculative decoding with online learning to enhance the performance of draft models in large language model (LLM) inference. Speculative decoding accelerates inference by using a lightweight draft model to generate candidate tokens, which are then verified by a larger target model. However, existing draft models often fail to approximate the target distribution due to limited capacity, leading to shorter acceptance lengths and reduced speedup. The authors propose leveraging the feedback from the verification process to iteratively refine the draft model, creating a continuous 'draft commits–feedback provides–draft adapts' loop. This approach aligns with the principles of online learning, where decisions are made iteratively based on feedback. The framework is grounded in dynamic regret minimization, establishing a formal link between the acceleration rate of the speculative system and online learning performance. The authors develop several algorithms, including optimistic online learning and online ensemble learning, which demonstrate improved acceleration rates and theoretical justifications. Experimental results show that OnlineSPEC achieves up to 24% speedup over existing state-of-the-art methods across seven benchmarks and three foundation models, while maintaining the quality of outputs.
Methodology
The authors propose the OnlineSPEC framework, which formulates the interaction between draft and target models as an online learning problem. They develop algorithms such as Online-LR, Opt-Hydra, and Ens-Eagle, which utilize techniques like online gradient descent, optimistic online learning, and online ensemble learning to adaptively refine draft models based on feedback from the target model.
Results
The experiments conducted across seven benchmarks and three foundation models demonstrate that the OnlineSPEC framework consistently outperforms both offline baselines and naive online adaptations, achieving speedups of up to 24% compared to previous state-of-the-art methods.
Implications
The proposed framework has significant implications for improving the efficiency of large language models in real-time applications, enabling faster inference without sacrificing output quality. It opens avenues for further research into adaptive learning techniques in LLMs and other generative models.
Reinforcement Learning for Diffusion LLMs with Entropy-Guided Step Selection and Stepwise Advantages
Reinforcement Learning
Large Language Models
Generative Models
- Introduces a finite-horizon MDP framework for DLMs to facilitate RL applications.
- Derives an exact policy gradient that allows for stepwise advantage estimation.
- Implements entropy-guided step selection to optimize compute allocation during training.
- Achieves state-of-the-art results on coding and logical reasoning tasks, surpassing existing RL methods for DLMs.
Read more
Reinforcement Learning for Diffusion LLMs with Entropy-Guided Step Selection and Stepwise Advantages
Summary
This paper addresses the challenges of applying reinforcement learning (RL) to diffusion language models (DLMs), which differ fundamentally from autoregressive language models (ARLMs) due to their iterative denoising process. The authors propose a novel framework that formulates the diffusion-based sequence generation as a finite-horizon Markov decision process (MDP) over denoising steps. They derive an exact, unbiased policy gradient that decomposes over these steps, allowing for the estimation of intermediate advantages without requiring explicit sequence likelihood evaluations. To enhance practical implementation, the authors introduce two key methods: (1) an entropy-guided step selection for policy updates, which prioritizes higher-entropy denoising steps, and (2) a method for estimating stepwise advantages using a one-step denoising reward from the diffusion model, thus avoiding costly multi-step rollouts. The proposed approach, termed Entropy-Guided Stepwise Policy Optimization with Stepwise Advantages (EGSPO-SA), demonstrates state-of-the-art performance on coding and logical reasoning benchmarks, outperforming existing RL post-training methods for DLMs. The findings indicate that the proposed methods effectively leverage the unique structure of DLMs to facilitate scalable RL training while maintaining high output quality.
Methodology
The authors formulate the diffusion sequence generation as a finite-horizon MDP and derive an exact policy gradient. They implement entropy-guided step selection for policy updates and estimate stepwise advantages using a one-step denoising reward, avoiding the need for explicit sequence likelihood evaluations and costly multi-step rollouts.
Results
The proposed EGSPO-SA method achieves state-of-the-art performance on standard coding and logical reasoning benchmarks, demonstrating superior results compared to existing RL post-training approaches for DLMs.
Implications
The findings suggest that the proposed RL framework can significantly enhance the performance of DLMs in various applications, particularly in tasks requiring logical reasoning and coding capabilities. This could lead to advancements in natural language understanding and generation tasks, as well as broader applications in AI-driven systems.
Causal Matrix Completion under Multiple Treatments via Mixed Synthetic Nearest Neighbors
Theory
Efficient ML
- Introduction of Mixed Synthetic Nearest Neighbors (MSNN) for causal matrix completion under multiple treatments.
- MSNN retains the statistical properties of the original SNN method while improving sample efficiency for sparse treatment levels.
- The method leverages shared latent structures across treatments to enhance causal effect estimation.
- Empirical results show MSNN's effectiveness in data-scarce environments, outperforming existing methods.
Read more
Causal Matrix Completion under Multiple Treatments via Mixed Synthetic Nearest Neighbors
Summary
This paper addresses the challenge of causal matrix completion under multiple treatments, particularly in scenarios where data is missing not at random (MNAR). The authors introduce a novel method called Mixed Synthetic Nearest Neighbors (MSNN), which enhances the existing Synthetic Nearest Neighbors (SNN) approach by integrating information across different treatment levels. The MSNN algorithm allows for the estimation of causal effects even when treatment levels are data-scarce, by utilizing shared latent structures across treatments. The paper establishes that MSNN retains the statistical properties of SNN, such as finite-sample error bounds and asymptotic normality, while significantly improving sample efficiency. Empirical evaluations demonstrate MSNN's effectiveness on both synthetic and real-world datasets, particularly in estimating effects for treatments with limited data availability, thereby providing a robust solution for causal inference in complex treatment scenarios.
Methodology
The authors propose the MSNN algorithm, which utilizes Mixed Anchor Rows and Mixed Anchor Columns to estimate imputation coefficients from data spanning multiple treatment levels. This method addresses the challenges posed by data scarcity and the need for sufficient anchor sets in the SNN approach. The theoretical framework is built on the assumption of shared latent row factors across treatments, allowing for effective cross-treatment identifiability.
Results
The MSNN algorithm demonstrates exponential improvements in sample efficiency compared to SNN, particularly under conditions of missing data. The expected number of usable data subgroups for MSNN significantly exceeds that of SNN, enhancing the feasibility of causal estimation in sparse treatment scenarios. Empirical evaluations confirm that MSNN reliably estimates causal effects where SNN fails, particularly illustrated through a case study on California's tobacco control policy.
Implications
The findings suggest that MSNN can be a powerful tool for causal inference in various fields, such as economics, public policy, and healthcare, where treatment assignments are complex and data may be missing not at random. By effectively utilizing available information across treatments, MSNN can facilitate better decision-making and intervention evaluations in data-scarce environments.
Deep Distance Measurement Method for Unsupervised Multivariate Time Series Similarity Retrieval
Time Series
- DDMM improves retrieval accuracy by focusing on minute differences in multivariate time series data.
- The method uses a unique weighting system for anchor-positive pairs based on Euclidean distance.
- Empirical studies show significant performance improvements over existing methods in industrial applications.
- Combining DDMM with feature extraction methods can lead to further accuracy enhancements.
Read more
Deep Distance Measurement Method for Unsupervised Multivariate Time Series Similarity Retrieval
Summary
This paper introduces the Deep Distance Measurement Method (DDMM) aimed at enhancing retrieval accuracy in unsupervised multivariate time series (MTS) similarity retrieval, particularly within industrial contexts. The authors highlight the importance of recognizing minute differences between states in MTS data, which is crucial for applications in industrial plants where sensor data is abundant. DDMM employs a novel learning algorithm that assigns weights to pairs of anchor and positive samples based on their Euclidean distance, allowing the model to focus on learning subtle differences across the entire time series rather than just local segments. This approach contrasts with traditional methods that often rely on negative samples or proximity-based sampling, which can limit the retrieval scope and accuracy. The empirical results demonstrate that DDMM significantly outperforms existing state-of-the-art methods on a Pulp-and-paper mill dataset, showcasing its effectiveness in real-world industrial applications. Additionally, the authors found that integrating DDMM with existing feature extraction techniques further enhances accuracy, indicating the potential for hybrid models in improving MTS retrieval tasks.
Methodology
DDMM utilizes an Autoencoder (AE) model trained on difference vectors derived from pairs of anchor and positive samples. The weights assigned to these pairs are based on their Euclidean distance, allowing the model to prioritize learning from pairs that exhibit minute similarities. The reconstruction error of the difference vectors serves as the distance measure for retrieval tasks.
Results
The results indicate that DDMM significantly outperforms state-of-the-art time series representation learning methods on the Pulp-and-paper mill dataset, demonstrating its effectiveness in accurately retrieving similar MTS data in industrial contexts.
Implications
The findings suggest that DDMM can be a valuable tool for industries relying on MTS data for monitoring and maintenance, enabling better anomaly detection and operational analysis. The method's focus on unsupervised learning also opens avenues for applications in domains where labeled data is limited.
Learnability and Privacy Vulnerability are Entangled in a Few Critical Weights
Theory
- Privacy vulnerability is concentrated in a small fraction of weights.
- Critical weights for utility performance overlap with privacy-vulnerable weights.
- The importance of weights is determined more by their locations than their values.
- The proposed fine-tuning strategy selectively rewinds only privacy-vulnerable weights.
Read more
Learnability and Privacy Vulnerability are Entangled in a Few Critical Weights
Summary
This paper addresses the challenge of preserving membership privacy in neural networks while maintaining utility performance. The authors identify that privacy vulnerabilities are concentrated in a small fraction of weights, which also significantly impact the model's utility. They propose a novel approach that focuses on the locations of these critical weights rather than their values. By scoring and selectively fine-tuning only the privacy-vulnerable weights, the authors demonstrate that their method can effectively mitigate membership inference attacks (MIAs) while preserving model accuracy. Their findings suggest that traditional methods, which often retrain or update all weights, can lead to unnecessary utility loss. Through extensive experiments, the proposed strategy shows superior resilience against MIAs compared to existing privacy-preserving techniques, marking a significant advancement in the field of privacy in machine learning.
Methodology
The authors introduce a weight-level importance estimation based on Machine Unlearning (MU) to identify privacy-vulnerable weights in neural networks. They then develop a fine-tuning strategy that focuses on these critical weights, allowing for effective privacy preservation without sacrificing model performance.
Results
The proposed method demonstrates improved resilience against modern membership inference attacks while maintaining utility, outperforming existing methods that retrain models from scratch. The experiments validate the effectiveness of focusing on a small set of critical weights for both privacy and utility.
Implications
This research has significant implications for developing privacy-preserving machine learning models, particularly in sensitive applications where membership privacy is crucial. The findings could lead to more efficient training methods that balance privacy and performance, potentially influencing future research in privacy-preserving techniques.
When LLM Judge Scores Look Good but Best-of-N Decisions Fail
NLP
Large Language Models
Optimization
- Global metrics can misrepresent the effectiveness of LLM judges in best-of-n selection tasks.
- A judge with moderate global correlation may perform poorly in actual selection scenarios.
- Within-prompt ranking is crucial for effective candidate selection, as it differs from global agreement.
- Explicit pairwise judging significantly improves recovery rates in selection tasks.
Read more
When LLM Judge Scores Look Good but Best-of-N Decisions Fail
Summary
This paper investigates the effectiveness of large language models (LLMs) as judges in scoring candidate responses for best-of-n selection tasks. The author argues that relying solely on global metrics, such as correlation with reference labels, can be misleading. In a benchmark of 5,000 prompts, a judge with a moderate global correlation (r = 0.47) only captures 21% of the improvement achievable through optimal selection over random choice. This discrepancy arises because global metrics often reflect baseline prompt-level effects rather than the necessary within-prompt ranking. The study reveals that within-prompt correlation is significantly lower (r = 0.27), and coarse scoring leads to ties in 67% of pairwise comparisons. To address this, the paper proposes a matched-pair best-of-2 audit, which enhances recovery from 21.1% to 61.2% by utilizing explicit pairwise judging. The author emphasizes the need for decision-centric audits that report within-prompt signal, tie rates, and recovery accuracy, rather than relying on global agreement alone.
Methodology
The study employs a large cross-policy benchmark of 5,000 prompts to analyze the performance of LLM judges in best-of-n selection tasks. It compares global correlation metrics with within-prompt ranking signals and conducts a matched-pair best-of-2 audit to assess the effectiveness of explicit pairwise judging. The analysis focuses on recovery rates and top-1 accuracy as key performance indicators.
Results
The findings indicate that a judge with a global correlation of r = 0.47 achieves only 21% recovery in selection tasks, while explicit pairwise judging raises this recovery to 61.2%. The study highlights the importance of within-prompt ranking, revealing that coarse scoring leads to a high rate of ties in candidate comparisons.
Implications
The results suggest that practitioners should be cautious when using global metrics to evaluate LLM judges for selection tasks. The proposed decision-centric audit framework could improve the effectiveness of LLMs in real-world applications, particularly in scenarios requiring precise candidate selection.
On Linear Separability of the MNIST Handwritten Digits Dataset
Theory
- The MNIST dataset remains a critical benchmark for evaluating image classification models.
- Linear separability is a key concept in machine learning, yet its status for MNIST has been unclear.
- The paper distinguishes between pairwise and one-vs-rest linear separability in its analysis.
- The findings may confirm the prevailing belief that MNIST is not linearly separable.
Read more
On Linear Separability of the MNIST Handwritten Digits Dataset
Summary
This paper investigates the linear separability of the MNIST dataset, a widely used benchmark for handwritten digit recognition. Despite its long history, the question of whether the dataset is linearly separable has not been conclusively answered, with conflicting claims in the literature. The author conducts a comprehensive empirical analysis, distinguishing between pairwise and one-vs-rest linear separability for the training, test, and combined sets. The study reviews theoretical approaches to linear separability and applies state-of-the-art methods to evaluate the dataset's separability. The findings aim to clarify the existing ambiguity regarding the linear separability of MNIST, potentially confirming the informal consensus that it is not linearly separable. The paper concludes with a discussion of the implications of these findings for machine learning models that rely on linear separability.
Methodology
The author employs a systematic empirical investigation to assess linear separability using both pairwise and one-vs-rest approaches. The study reviews existing theoretical frameworks and applies state-of-the-art methods to evaluate the MNIST dataset, including computational geometry techniques and algorithms like Support Vector Machines (SVM).
Results
The results of the experiments indicate that the MNIST dataset is not linearly separable, supporting the informal consensus in the literature. The paper provides detailed empirical evidence and analysis of the separability across different digit pairs and the overall dataset.
Implications
The findings have significant implications for the design and evaluation of machine learning models, particularly those that rely on linear decision boundaries. Understanding the separability of the MNIST dataset can inform the choice of algorithms and techniques used in digit recognition tasks.