AI-generated summaries
Today's ML research,
without the noise.
Daily summaries of the latest machine learning papers from arXiv, processed every 8 hours.
59
Papers today
8h
Update frequency
7
Days of history
Rapid FinFET Modelling Using an Autoencoder
Efficient ML
- Utilizes an autoencoder for efficient FinFET modeling.
- Incorporates drain to source voltage (VDS) as an input feature.
- Achieves high accuracy with minimal training data.
- Successfully reconstructs full I-V curves and extracts key device metrics.
Read more
Rapid FinFET Modelling Using an Autoencoder
Summary
This paper presents a novel machine learning framework utilizing an autoencoder (AE) for efficient modeling of FinFET devices. The authors calibrated a BSIM-CMG model to generate a dataset of current-voltage (ID-VG) characteristics, which served as the training data for the autoencoder. The AE compresses full I-V curves into a low-dimensional latent space, effectively capturing essential device physics. A significant innovation of this work is the incorporation of the drain to source voltage (VDS) as an input feature, which enhances the model's ability to account for bias-dependent variations. The trained autoencoder successfully reconstructs full I-V curves and extracts critical device metrics such as threshold voltage (VTH), subthreshold slope (SS), and peak transconductance (gm). The results demonstrate that data-driven compact models, built from actual characterization data, can achieve high accuracy with minimal training data, thus providing a powerful tool for rapid device characterization, modeling, and circuit-level simulation.
Methodology
The methodology involves calibrating a BSIM-CMG model to generate ID-VG datasets, which are then preprocessed and used to train an autoencoder. The autoencoder compresses the I-V curves into a latent space, allowing for the reconstruction of curves and extraction of device parameters. The training data undergoes normalization and logarithmic transformation to enhance model performance.
Results
The autoencoder demonstrated a high level of accuracy in reconstructing full I-V curves and extracting critical metrics such as VTH, SS, and gm. The model's performance indicates that it can effectively generalize from a limited dataset, showcasing its potential for rapid device modeling.
Implications
This approach could significantly reduce the computational costs and time associated with traditional TCAD simulations, making it a valuable tool for semiconductor device characterization and circuit-level applications. It also opens avenues for further research in machine learning applications within semiconductor physics.
CKM-Driven Communication-Aware UAV Intelligent Trajectory Optimization for Urban Inspection
Optimization
Reinforcement Learning
Robotics
- Introduction of a channel knowledge map (CKM) for UAV trajectory planning.
- Utilization of a diffusion model for constructing a time-accumulated CKM.
- Development of a graph attention network soft actor-critic (GATSAC) algorithm for optimizing UAV paths.
- Demonstrated improvement in communication reliability and trajectory efficiency.
Read more
CKM-Driven Communication-Aware UAV Intelligent Trajectory Optimization for Urban Inspection
Summary
This paper addresses the challenges of reliable communication in urban inspection tasks using unmanned aerial vehicles (UAVs), where spatial channel heterogeneity complicates communication. The authors propose a novel trajectory planning framework that utilizes a channel knowledge map (CKM) to enhance communication-aware path planning for multi-UAV operations. The CKM is constructed using a diffusion model to create a time-accumulated representation of the channel quality, allowing for accurate perception with minimal flight overhead. The proposed method integrates a global-to-local graph attention network (GAT) with a soft actor-critic (SAC) algorithm to optimize UAV trajectories. The GAT addresses the combinatorial node ordering problem, generating an optimal sequence for inspection targets while the SAC ensures smooth flight paths and avoids areas of poor communication quality. Simulation results indicate that this approach significantly improves trajectory efficiency and communication reliability without relying on real-time feedback.
Methodology
The methodology involves constructing a CKM using a diffusion model to represent the spatial distribution of received signal strength (RSS). The trajectory planning is reformulated as a traveling salesman problem (TSP) using a graph attention network (GAT) to optimize the sequence of target visits. A soft actor-critic (SAC) algorithm is then employed for continuous action control to ensure smooth UAV trajectories while avoiding areas with poor communication quality.
Results
Simulation results show that the proposed CKM-driven trajectory optimization method effectively guides UAVs through regions with high-quality communication channels, leading to significant improvements in both trajectory efficiency and communication reliability compared to traditional methods.
Implications
The findings suggest that integrating communication-aware planning into UAV operations can enhance the effectiveness of urban inspections, potentially leading to broader applications in areas such as disaster response, infrastructure monitoring, and smart city management.
The Degeneracy Distillery
Theory
Efficient ML
Interpretability
- Introduction of a three-stage pipeline for detecting and resolving parameter degeneracies.
- Method leverages Fisher information geometry to identify independent parameter combinations.
- Demonstrated significant reductions in simulation budget for posterior estimation.
- Validated on synthetic problems and applied to real-world scientific challenges.
Read more
The Degeneracy Distillery
Summary
The paper introduces the 'Degeneracy Distillery', a novel method for detecting and resolving degenerate parameter combinations in physical models and real-world datasets. Degeneracies occur when multiple parameters yield similar data, complicating label prediction and inverse problems. The proposed method operates in three stages: it estimates the Fisher information matrix globally from parameter-data pairs, learns neural coordinates that flatten this matrix, and derives symbolic expressions for these coordinates. This approach allows for the identification of parameter combinations that independently affect the data, leading to a more efficient simulation budget for downstream neural posterior estimation. The authors validate their method on synthetic problems and apply it to various scientific challenges, demonstrating significant reductions in required simulations while maintaining calibrated coverage. The work emphasizes the importance of understanding degeneracies for both statistical inference and practical applications in scientific modeling.
Methodology
The methodology involves estimating the Fisher information matrix globally using neural networks, learning neural coordinates that flatten the Fisher geometry, and deriving symbolic reparametrizations that remove degeneracies. This systematic, data-driven approach allows for the identification of effective parameter combinations without requiring realized data observations.
Results
The results show that the Degeneracy Distillery can achieve up to 10 times fewer simulations for posterior estimation while preserving calibration. The method was validated against synthetic benchmarks and applied to complex scientific problems, demonstrating its effectiveness in revealing the intrinsic structure of parameter spaces.
Implications
The findings have significant implications for improving the efficiency of simulations in scientific modeling, enhancing the understanding of parameter interactions, and enabling the construction of more informed priors in Bayesian inference. This could lead to advancements in fields such as astrophysics, epidemiology, and cosmology.
Learning Subset-Shared Invariances for Domain Generalization with Mixture-of-Experts
Theory
- Identifies limitations of global invariance in domain generalization, which can reduce predictive information.
- Introduces subset-shared invariance, reframing DG as learning structured, subset-dependent invariances.
- Proposes a routing-based Mixture-of-Experts framework to selectively enforce invariance across domain subsets.
- Demonstrates improved out-of-domain generalization and robustness under heterogeneous domain shifts.
Read more
Learning Subset-Shared Invariances for Domain Generalization with Mixture-of-Experts
Summary
This paper addresses the challenge of domain generalization (DG), which involves training models that can generalize to unseen domains without access to target data during training. The authors critique the common approach of enforcing global invariance across all source domains, arguing that this can restrict the representation space and discard useful predictive factors that are not universally shared. To overcome this limitation, they propose a novel concept called subset-shared invariance, which assumes that predictive structures are stable only within subsets of domains. The authors implement this idea using a mixture-of-experts (MoE) architecture, where each expert specializes in aligning the domains it serves, and a routing mechanism composes subset-invariant components for predictions. This approach allows for a routing-conditioned invariance that is learned alongside the representation. They also introduce training objectives that promote selective alignment, confident and balanced routing, and diverse expert specialization. Experimental results on DomainBed benchmarks show that their method significantly improves out-of-domain generalization and robustness under increasing domain heterogeneity, suggesting that DG should model invariance through partially shared structures across domain subsets rather than enforcing a single global invariance.
Methodology
The authors utilize a mixture-of-experts (MoE) architecture to implement subset-shared invariance. A routing function is employed to create soft partitions over inputs, defining subsets of domains associated with each expert. They introduce a subset-conditioned invariance objective that aligns class-conditional feature distributions across domain pairs selected by routing. Additionally, they incorporate a diversity regularizer and routing objectives to promote effective expert utilization.
Results
The proposed method shows consistent improvements in out-of-domain generalization on standard DG benchmarks, particularly when faced with heterogeneous domain shifts. The results indicate that the routing-induced invariance allows for better preservation of predictive information, leading to enhanced model performance as the number of source domains increases.
Implications
The findings suggest that domain generalization strategies should focus on modeling invariance through subset-specific structures rather than relying on a single global invariant representation. This approach could lead to more robust models in real-world applications where domain shifts are common.
Exploring Dualistic Meta-Learning to Enhance Domain Generalization in Open Set Scenarios
Computer Vision
Theory
Optimization
- Introduces MEDIC, a dualistic meta-learning strategy for open set domain generalization.
- Addresses the issue of biased decision boundaries caused by imbalanced sample distributions.
- Implements simultaneous gradient matching for inter-domain and inter-class tasks.
- Provides a theoretical analysis of gradient matching that improves upon previous methods.
Read more
Exploring Dualistic Meta-Learning to Enhance Domain Generalization in Open Set Scenarios
Summary
This paper addresses the challenge of open set domain generalization (OSDG), which aims to recognize unseen classes in unseen domains while maintaining classification accuracy for known classes. Traditional methods often struggle with imbalanced data distributions, leading to biased decision boundaries that over-reject out-of-distribution samples. The authors propose a novel meta-learning strategy called dualistic MEta-learning with joint DomaIn-Class matching (MEDIC). This approach simultaneously matches gradients across inter-domain and inter-class task splits, allowing for a more balanced decision boundary. The paper extends previous work by providing a more comprehensive framework and theoretical analysis of step-wise gradient matching. Extensive experiments demonstrate that MEDIC outperforms state-of-the-art methods in open set scenarios while maintaining competitive performance in traditional domain generalization tasks.
Methodology
The authors developed the MEDIC framework, which employs dualistic meta-learning to achieve gradient matching across both domains and classes. This involves constructing inter-class pairs from tasks sampled from different domains and optimizing their gradients to find a balanced decision boundary. The approach integrates a task scheduling strategy for handling hard class pairs and extends previous insights into a generalized framework.
Results
Experimental results indicate that MEDIC significantly outperforms several state-of-the-art methods in open set scenarios. Additionally, it maintains remarkable accuracy in traditional domain generalization tasks, showcasing its effectiveness in both settings.
Implications
The findings suggest that MEDIC can enhance the robustness and security of models deployed in real-world applications where unseen classes may be encountered. This has significant implications for fields such as medical imaging and other domains where new classes may emerge unpredictably.
Natural Identifiers for Privacy and Data Audits in Large Language Models
Large Language Models
NLP
Theory
- Introduction of natural identifiers (NIDs) as a solution for post-hoc privacy audits in LLMs.
- NIDs allow for the generation of additional random strings from the same distribution, facilitating auditing without retraining.
- The method adapts existing differential privacy auditing frameworks to leverage NIDs effectively.
- Empirical validation shows accurate inference of training membership without false positives.
Read more
Natural Identifiers for Privacy and Data Audits in Large Language Models
Summary
This paper addresses the challenges of auditing the privacy of large language models (LLMs), particularly in the context of differential privacy and dataset inference. Traditional auditing methods require the insertion of specially crafted canary data during training, making them impractical for pre-trained models. Moreover, dataset inference typically necessitates access to a private non-member held-out dataset, which is often difficult to obtain. The authors propose a novel solution called natural identifiers (NIDs), which are structured random strings that naturally occur in common LLM training datasets. NIDs can be used to generate unlimited additional random strings from the same distribution, allowing for effective post-hoc audits without retraining models or needing a dedicated held-out dataset. The paper demonstrates that using NIDs facilitates differential privacy auditing and dataset inference for suspect datasets containing NIDs. The evaluation shows that NIDs improve the auditing process, reduce sample complexity, and enable accurate inference of training membership across diverse data subsets, suggesting their utility in real-world applications.
Methodology
The authors leverage natural identifiers (NIDs) found in training datasets to create a framework for post-hoc privacy auditing and dataset inference. They adapt existing differential privacy auditing methods to utilize NIDs, allowing for audits without the need for model retraining or dedicated held-out datasets. The approach is empirically validated using open-source LLMs and their known training data.
Results
The results indicate that the use of NIDs significantly enhances the ability to conduct post-hoc differential privacy audits and dataset inference. The method achieves tighter privacy guarantees and reduces the sample complexity, enabling accurate training membership inference across various data subsets without false positives.
Implications
The findings suggest that NIDs can be a practical and scalable solution for auditing the privacy of LLMs in real-world scenarios, potentially aiding in legal contexts where the privacy of training data is contested. This could lead to improved trust and safety in the deployment of LLMs in sensitive applications.
Memory-Efficient Policy Libraries with Low-Rank Adaptation in Reinforcement Learning
Reinforcement Learning
Robotics
Efficient ML
- LoRA can reduce memory usage by 20-160 times compared to full fine-tuning.
- 90-95% storage savings enable the deployment of multiple specialized policies in memory-constrained environments.
- No significant performance loss when using LoRA for fine-tuning compared to traditional methods.
- The approach addresses catastrophic forgetting by maintaining a library of specialist policies.
Read more
Memory-Efficient Policy Libraries with Low-Rank Adaptation in Reinforcement Learning
Summary
This paper explores the application of Low-Rank Adaptation (LoRA), a Parameter-Efficient Fine-Tuning (PEFT) technique, in the context of multi-task Reinforcement Learning (RL) for robotics. The authors propose a memory-efficient approach to create a library of specialized policies that can be fine-tuned from a pre-trained baseline model using Proximal Policy Optimization (PPO). The study highlights the challenges of catastrophic forgetting in multi-task RL and demonstrates that LoRA can significantly reduce memory usage—by a factor of 20-160—compared to traditional full fine-tuning methods. The results indicate that deploying a library of 10-50 specialized policies can lead to a 90-95% reduction in storage requirements, making it feasible to store the entire library in memory. Importantly, the performance of policies fine-tuned with LoRA does not significantly differ from those obtained through full fine-tuning, suggesting that LoRA is a viable alternative for efficient policy management in robotics.
Methodology
The authors utilized Proximal Policy Optimization (PPO) to fine-tune a baseline model for various tasks using Low-Rank Adaptation (LoRA). The technique involves decomposing weight updates into low-rank matrices while keeping the original model weights frozen, allowing for efficient training and reduced memory usage.
Results
The experiments showed that using LoRA for fine-tuning can lead to a substantial reduction in memory requirements, with a factor of 20-160 compared to full fine-tuning. The performance of the policies remained comparable to those obtained through full fine-tuning, indicating that LoRA is an effective method for creating a library of specialized policies.
Implications
This research has significant implications for the field of robotics, particularly in multi-task learning scenarios where memory efficiency is crucial. The ability to store and switch between multiple specialized policies without significant performance degradation can enhance the operational capabilities of robotic systems in real-world applications.
Communicability-Inspired Positional Encoding (CIPE)
Graph Learning
- CIPE leverages communicability to create a positional encoding that reflects meaningful graph structure.
- The method introduces an Attention-Compatible Geometry that enhances self-attention mechanisms in Transformers.
- Dimensionality alignment is employed to map CIPE representations to a shared embedding space.
- Empirical results show a 35.5% performance improvement over existing positional encodings across multiple benchmarks.
Read more
Communicability-Inspired Positional Encoding (CIPE)
Summary
The paper introduces Communicability-Inspired Positional Encoding (CIPE), a novel approach to positional encoding for Transformers that addresses the challenges of encoding non-Euclidean graph structures. CIPE is designed to create an Attention-Compatible Geometry by leveraging communicability, a measure of node connectivity that aggregates contributions from all paths between nodes. This encoding allows for the inner products of CIPE to reflect meaningful structural relationships, thus enhancing the performance of self-attention mechanisms in graph-based models. The authors also propose a dimensionality alignment technique to ensure that CIPE representations are compatible with standard Transformer architectures while preserving the induced geometry. Empirical evaluations demonstrate that CIPE significantly outperforms existing positional encodings, achieving an average improvement of 35.5% across seven benchmarks for structure-agnostic Transformers and consistently enhancing performance in structure-biased graph Transformers. This positions CIPE as a robust framework for integrating positional information in graph-based learning tasks.
Methodology
The authors developed CIPE by constructing a positional geometry based on communicability, which aggregates path contributions between nodes. They utilized the discrete diffusion equation on graphs to derive CIPE representations, ensuring that the inner products of these representations directly correspond to communicability. Additionally, they implemented a dimensionality alignment process to adapt CIPE to standard Transformer dimensions while maintaining its geometric properties.
Results
The empirical evaluation revealed that CIPE improved the performance of structure-agnostic Transformers by an average of 35.5% across seven benchmarks, outperforming various existing positional encodings. Furthermore, CIPE provided consistent enhancements in structure-biased graph Transformers, where traditional positional encodings offered only marginal benefits.
Implications
CIPE has significant implications for improving graph-based learning models, particularly in applications requiring effective representation of non-Euclidean data structures. Its ability to integrate meaningful positional information could enhance performance in various domains such as social network analysis, biological network modeling, and other graph-related tasks.
A Survey on Federated Causal Discovery and Inference
Federated Learning
Graph Learning
Theory
- The paper provides a systematic review of Federated Causal Discovery and Inference, bridging a gap in existing literature.
- FCD and FCI are formalized as complementary stages in a unified federated causal reasoning pipeline.
- The authors categorize methodologies based on design decisions, federation topology, and structural scope.
- Key practical challenges include data heterogeneity, privacy, and the need for theoretical guarantees.
Read more
A Survey on Federated Causal Discovery and Inference
Summary
This paper presents a comprehensive survey on Federated Causal Discovery (FCD) and Federated Causal Inference (FCI), addressing the challenges of performing causal analysis on distributed data across institutions while adhering to privacy regulations. The authors systematically review the field by establishing multi-dimensional taxonomies based on three core design decisions: how structures are learned, how data are partitioned, and what structural knowledge is obtained by each party. They categorize FCD methodologies into constraint-based, score-based, continuous-optimization, and hybrid paradigms, and discuss federation topologies (horizontal, vertical, and hybrid) and structural scopes (global vs. local). For FCI, methods are categorized by target estimand and estimation strategy, ranging from classical to modern approaches. The paper emphasizes the interconnection between FCD and FCI as complementary stages in a unified federated causal reasoning pipeline, where FCD provides the necessary structural knowledge for valid effect estimation in FCI. The authors also highlight practical dimensions such as temporal dynamics, data heterogeneity, and privacy concerns, concluding with a discussion of open challenges for future research.
Methodology
The authors employed a systematic review approach, organizing the literature into multi-dimensional taxonomies based on methodological paradigms, federation topologies, and structural scopes. They analyzed existing methods for FCD and FCI, categorizing them by target estimands and estimation strategies.
Results
The survey reveals a lack of comprehensive frameworks that integrate FCD and FCI, highlighting the need for a unified approach. It identifies key challenges such as structural constraints, heterogeneity in causal mechanisms, and identifiability issues that arise in federated settings.
Implications
The findings suggest that a unified framework for FCD and FCI could enhance collaborative causal analysis across institutions while maintaining data privacy. This has significant implications for fields such as healthcare, finance, and social sciences where data sharing is restricted.
Grad Detect: Gradient-Based Hallucination Detection in LLMs
NLP
Large Language Models
Interpretability
- Grad Detect is the first framework for hallucination detection based on layer-wise gradient analysis.
- It outperforms confidence-based methods by 3-8 percentage points in hallucination detection.
- The final five transformer layers concentrate over 97% of the discriminative gradient signal.
- The method simultaneously predicts response correctness and model abstention.
Read more
Grad Detect: Gradient-Based Hallucination Detection in LLMs
Summary
The paper introduces Grad Detect, a novel gradient-based framework for detecting hallucinations in Large Language Models (LLMs). Hallucinations refer to outputs that are factually incorrect yet presented with confidence, which poses challenges for deploying LLMs in critical applications. Grad Detect leverages layer-wise gradient patterns obtained from a single forward-backward pass during inference to predict hallucinations. The authors argue that the internal gradient structure provides rich information about the correctness of outputs, which is not accessible through traditional output-level signals. The method was evaluated across multiple Q&A benchmarks, demonstrating superior performance compared to existing confidence-based and sampling-based methods. The study also reveals that the last five layers of the model contain over 97% of the discriminative gradient signal, allowing for efficient deployment with minimal performance loss. Furthermore, Grad Detect is capable of predicting both response correctness and model abstention, unifying two previously independent detection tasks. Overall, the framework offers strong predictive performance and interpretable insights into model failures.
Methodology
Grad Detect computes reference gradients by averaging layer-wise gradients over labeled examples, creating a prototype in gradient space. It then characterizes test samples using cosine similarity between their gradients and the reference gradients, resulting in a compact feature matrix processed by a lightweight transformer encoder. This approach requires only a single forward-backward pass and no fine-tuning of the target LLM.
Results
Grad Detect was validated on eleven instruction-tuned models across four architectural families, achieving 94-99% accuracy in abstention prediction and outperforming existing methods in hallucination detection. The systematic layer ablation studies confirmed that the last five layers are crucial for capturing the discriminative gradient signal.
Implications
The findings suggest that Grad Detect can enhance the reliability of LLMs in high-stakes applications such as healthcare and legal analysis by providing a robust mechanism for hallucination detection. The framework's ability to offer insights into model behavior may also aid in further improving LLM architectures.
GRACE: Gated Refinement for Accurate Causal Edge Discovery in High-Dimensional Time Series
Time Series
Graph Learning
Theory
- GRACE improves causal edge discovery by refining constraint-based methods with a gated neural model.
- The use of Hard Concrete gates with L0 regularization allows for robust binary decisions in edge selection.
- Empirical results show GRACE outperforms traditional methods in both synthetic and real-world datasets.
- The framework is computationally efficient, achieving results 75 times faster than nonlinear CI tests.
Read more
GRACE: Gated Refinement for Accurate Causal Edge Discovery in High-Dimensional Time Series
Summary
The paper introduces GRACE, a novel framework designed to enhance causal edge discovery in high-dimensional time series data. Traditional methods face challenges due to the complexity of nonlinear conditional independence (CI) tests and the need for arbitrary thresholds in score-based approaches. GRACE addresses these issues by employing a two-stage process: first, it generates a candidate skeleton using a high-recall constraint-based method, and then it refines this skeleton using a gated neural model that applies Hard Concrete gates with L0 regularization. This approach allows for a clear bimodal separation of edge scores, improving the robustness of binary decisions. The framework is tested on synthetic datasets with various graph topologies and dimensions, demonstrating significant improvements in F1 scores and precision compared to existing methods. Additionally, GRACE is evaluated on a real-world river flow dataset, successfully identifying causal edges while minimizing false positives, showcasing its effectiveness in handling non-stationarity and confounding variables.
Methodology
GRACE operates in two stages: first, it employs a constraint-based method (like CDNOTS or PCMCI) to create a candidate skeleton of causal edges. In the second stage, a gated neural model refines this skeleton by assigning independent Hard Concrete gates to each candidate edge, trained with L0 regularization to ensure a clear bimodal distribution of gate values, thus enhancing the precision of causal edge identification.
Results
The experiments reveal that GRACE significantly improves F1 scores over its base CI method while maintaining high precision. It also outperforms attention-based and score-based alternatives. In real-world applications, GRACE successfully identifies 9 out of 11 causal edges in the Elbe River dataset with only 1 false positive, achieving an F1 score of 0.86 and an AUROC of 0.99.
Implications
The GRACE framework has potential applications in various fields requiring causal inference from time series data, such as climate science, finance, and gene regulatory networks. Its ability to handle high-dimensional data and non-stationarity makes it a valuable tool for researchers and practitioners in these domains.
Machine Learning Modeling for Real-Time Melt Pool Monitoring in Laser Powder Bed Fusion Additive Manufacturing: A Hybrid Approach
Computer Vision
Efficient ML
- Developed a hybrid machine learning framework for real-time melt pool monitoring in LPBF.
- Achieved high classification performance with a balanced dataset of 1,200 images.
- Hybrid model outperformed purely deep learning models in terms of accuracy and inference time.
- Demonstrated the potential of combining pretrained CNN features with classical ML methods.
Read more
Machine Learning Modeling for Real-Time Melt Pool Monitoring in Laser Powder Bed Fusion Additive Manufacturing: A Hybrid Approach
Summary
This paper explores the use of artificial intelligence and machine learning (AI/ML) for real-time monitoring of melt pools in laser powder bed fusion (LPBF) additive manufacturing. The authors developed a binary image classification framework to differentiate between normal and abnormal melt pool images, utilizing a balanced dataset of 1,200 images from Nickel superalloy 625. The study benchmarks three transfer learning architectures (ResNet50, EfficientNetB0, and MobileNetV2) against two Random Forest approaches, one utilizing EfficientNetB0 feature embeddings and the other based on raw pixel features. The models were evaluated on various metrics, including accuracy, precision, recall, F1 score, and area under the receiver operating characteristic curve (AUC), while also considering training time, inference latency, and resource usage. The hybrid approach combining EfficientNetB0 with Random Forest achieved superior performance, with an F1 score of 0.9451, accuracy of 0.9458, and AUC of 0.9904, while maintaining a fast inference time of 1.15 ms per image. This study highlights the effectiveness of hybrid architectures in achieving a balance between accuracy and computational efficiency for real-time monitoring in data-limited environments.
Methodology
The authors implemented a binary image classification framework using a balanced dataset of 1,200 melt pool images. They benchmarked three transfer learning models (ResNet50, EfficientNetB0, MobileNetV2) against a hybrid model combining EfficientNetB0 with a Random Forest classifier. The dataset was split into training, validation, and test sets, and the models were evaluated based on accuracy, precision, recall, F1 score, AUC, training time, and inference latency.
Results
The hybrid EfficientNetB0-plus-Random Forest approach achieved an F1 score of 0.9451, accuracy of 0.9458, and AUC of 0.9904, with an inference time of 1.15 ms per image. This was significantly faster than the deep learning-only models, which had inference times of 85.47 ms for EfficientNetB0 and 179.51 ms for ResNet50.
Implications
The findings suggest that hybrid machine learning approaches can effectively address the challenges of real-time monitoring in LPBF additive manufacturing, providing a viable solution for quality assurance in industrial applications. This could lead to improved process optimization and reliability in the production of complex metal components.
RAVEN: A Regime-Aware Variable-context Expert Network for Financial Time Series Forecasting
Time Series
- RAVEN addresses the limitations of fixed context windows in financial time series forecasting.
- The model dynamically determines the temporal context for each input sample using a hierarchical windowing approach.
- RAVEN achieves state-of-the-art performance in cumulative log-return prediction and fund sales forecasting.
- The introduction of Correlation-Aware Weighting (CAW) enhances the aggregation of expert outputs.
Read more
RAVEN: A Regime-Aware Variable-context Expert Network for Financial Time Series Forecasting
Summary
The paper introduces RAVEN, a novel Mixture-of-Experts (MoE) framework designed to enhance financial time series forecasting by addressing the limitations of fixed context windows in existing models. Financial time series data, characterized by non-stationarity and regime-dependent temporal dependencies, poses unique challenges that traditional models struggle to manage effectively. RAVEN adapts the temporal context for each input sample by constructing a hierarchy of nested contiguous windows, determined by the data itself, rather than relying on a static look-back period. This approach utilizes a Cumulative Importance Thresholding (CIT) mechanism to score patches of data and route them to scale-specialized experts. Additionally, a Global Compressed Representation (GCR) branch runs in parallel to maintain global temporal coherence. The model also incorporates a Correlation-Aware Weighting (CAW) method to align variable-length outputs from different experts. Experimental results demonstrate that RAVEN significantly outperforms state-of-the-art models in cumulative log-return prediction and fund sales forecasting, showcasing its effectiveness in adapting to the dynamic nature of financial data.
Methodology
RAVEN employs a Mixture-of-Experts framework that utilizes a hierarchical structure of nested contiguous windows for adaptive context selection. It incorporates a Cumulative Importance Thresholding (CIT) mechanism for scoring data patches and routes them to specialized experts. A Global Compressed Representation (GCR) branch is included to maintain global coherence, while Correlation-Aware Weighting (CAW) aligns outputs from experts with variable-length inputs.
Results
RAVEN achieved a 9.2% improvement in Pearson correlation on the HS300 dataset and a 20.2% improvement on the S&P500 dataset for cumulative log-return prediction. It also reduced Mean Squared Error (MSE) by 18.2% in fund sales forecasting and outperformed in 14 of 16 metrics across four PEMS traffic benchmarks.
Implications
The adaptive context modeling approach of RAVEN can be applied to various financial forecasting tasks, potentially improving investment strategies and risk management. Its methodology may also inspire future research in time series analysis across different domains.
KLip-PPO: A per-sample KL perspective on PPO-Clip
Reinforcement Learning
Optimization
Theory
- Establishes a per-sample equivalence between PPO-Clip and PPO-KL surrogates.
- Demonstrates that both surrogates produce indistinguishable training outcomes on benchmark tasks.
- Clarifies the implicit structure of the PPO-Clip algorithm, highlighting its step function behavior.
- Suggests potential extensions and generalizations of the algorithm for broader applications.
Read more
KLip-PPO: A per-sample KL perspective on PPO-Clip
Summary
This paper presents KLip-PPO, a novel perspective on the Proximal Policy Optimization (PPO) algorithm, specifically focusing on the relationship between the clipped surrogate and the Kullback-Leibler (KL) penalty. The authors demonstrate that the gradient of the clipped surrogate can be exactly reproduced by a KL surrogate with a per-sample coefficient that varies based on the importance ratio and advantage. This finding clarifies the structural features of the PPO-Clip algorithm, revealing that its implicit per-sample penalty acts as a step function at the trust region boundary. The authors validate their theoretical claims through empirical experiments on five MuJoCo continuous-control benchmarks, showing that both the clipped and KL surrogates yield indistinguishable training curves. The paper also discusses potential extensions of the algorithm, including soft relaxations of the step function and off-policy adaptations, thereby providing a foundation for future research in policy optimization.
Methodology
The authors derive a per-sample gradient identity that connects the clipped surrogate and the KL penalty in PPO. They validate this identity through empirical experiments on five continuous-control benchmarks in the MuJoCo environment, comparing the performance of both surrogates.
Results
The experiments reveal that the training curves for both the PPO-Clip and PPO-KL surrogates are indistinguishable across all tested benchmarks, reinforcing the theoretical findings regarding their equivalence.
Implications
The findings suggest that the PPO-Clip algorithm can be understood and potentially improved through the lens of KL regularization, opening avenues for new algorithmic designs and optimizations in reinforcement learning.
MGI: Member vs Generated Inference
Generative Models
Computer Vision
- Introduction of the Member vs Generated Inference (MGI) task.
- Existing methods inadequately distinguish between training members and generated samples.
- Proposed Data Circuit Breaker (DCB) method effectively addresses MGI.
- DCB combines signals from an autoencoder and latent generator for improved accuracy.
Read more
MGI: Member vs Generated Inference
Summary
The paper addresses the challenge of distinguishing between training data and generated samples from generative models, formalized as the Member vs Generated Inference (MGI) task. As generative models produce outputs that are increasingly indistinguishable from real data, it becomes crucial to identify whether a given sample was part of the training set or generated by the model. The authors highlight the inadequacies of existing membership inference and attribution methods, which often misclassify generated samples as training members and vice versa due to overlapping likelihood signals. To tackle this issue, they propose a novel method called Data Circuit Breaker (DCB), which operates in three stages: (1) an autoencoder-based filtering step to identify generated samples, (2) a membership inference step for non-generated samples using the latent generator, and (3) a cross-generator attribution step to compare probabilities across different models. The DCB method effectively addresses the MGI challenge across various generative models, including image autoregressive and diffusion models, and demonstrates robustness even in cases of training data memorization.
Methodology
The proposed Data Circuit Breaker (DCB) method consists of three stages: (1) filtering generated samples using an autoencoder, (2) performing membership inference on non-generated samples with the latent generator, and (3) conducting cross-generator attribution to compare log-probabilities across different models.
Results
The DCB method consistently outperformed existing membership inference and attribution methods, effectively distinguishing between training members and generated samples across various generative models, even in challenging scenarios involving memorization of training data.
Implications
The findings have significant implications for the security and reliability of generative models, particularly in applications where distinguishing between real and generated data is critical, such as in content creation, data privacy, and model training protocols.
Managing Task Execution for Unknown Workloads in Batteryless IoT: A Hardware-Agnostic Evaluation
Reinforcement Learning
Optimization
Efficient ML
- Introduction of hardware-agnostic dynamic scheduling strategies for batteryless IoT systems.
- Comparison of model-free Reinforcement Learning and Approximated Prediction methods against traditional approaches.
- Evaluation of methods using real-world solar data and dynamic transmission profiles.
- Identification of operational trade-offs among different scheduling strategies.
Read more
Managing Task Execution for Unknown Workloads in Batteryless IoT: A Hardware-Agnostic Evaluation
Summary
This paper addresses the challenges of managing task execution in batteryless IoT systems that rely on energy harvesting. As IoT applications become more complex, traditional energy-aware schedulers struggle with unpredictable workloads due to their reliance on static execution thresholds and hardware-specific task profiles. The authors propose two novel, hardware-agnostic dynamic scheduling strategies: a model-free Reinforcement Learning (RL) agent and an on-the-fly Approximated Prediction (AP) method. These methods treat applications as a 'black box' and do not require prior energy information. The performance of these strategies is evaluated against an adaptive task rate approach (AsTAR) and optimized static thresholds using a custom-built simulation framework that incorporates real-world solar data and dynamic LoRa transmission profiles. The analysis reveals distinct operational trade-offs among the methods: the AP approach achieves lightweight, near-optimal task throughput; the RL agent offers tunable survival-execution balancing; and AsTAR excels in pacing execution over long energy gaps. The findings suggest that while advanced strategies enhance resilience in severely constrained systems, devices with larger energy buffers can effectively utilize simpler static policies.
Methodology
The authors developed a simulation framework to evaluate three dynamic task execution methods: a model-free Reinforcement Learning agent, an Approximated Prediction method, and an adaptive task rate approach (AsTAR). These methods were compared against optimized static thresholds, focusing on their performance under varying energy conditions and workloads without requiring prior knowledge of task energy profiles.
Results
The study found that the Approximated Prediction method provided near-optimal task throughput, while the Reinforcement Learning agent allowed for flexible balancing between survival and execution. The AsTAR method was effective in pacing execution during long energy gaps. The results highlighted the trade-offs between task throughput, recovery time, and execution pacing across the different methods.
Implications
The findings suggest that dynamic scheduling strategies can significantly enhance the reliability of batteryless IoT systems under unpredictable energy conditions. This research could inform the design of future IoT devices, particularly in applications where energy harvesting is critical. Additionally, it emphasizes the potential for simpler static policies in systems with larger energy storage capacities.
FlowPipe: LLM-Enhanced Conditional Generative Flow Networks for Data Preparation Pipeline Construction
Reinforcement Learning
Generative Models
Large Language Models
- FlowPipe automates data preparation pipeline construction, addressing the combinatorial complexity and high evaluation costs.
- It utilizes Conditional Generative Flow Networks for effective credit assignment and decision-making.
- Deep Semantic Modulation enhances the context-awareness of the pipeline construction process.
- The framework significantly outperforms existing methods, improving accuracy and training speed.
Read more
FlowPipe: LLM-Enhanced Conditional Generative Flow Networks for Data Preparation Pipeline Construction
Summary
The paper introduces FlowPipe, a novel framework designed to automate the construction of data preparation pipelines, which are crucial for enhancing data quality in machine learning workflows. Traditional methods face challenges due to the combinatorial complexity of operator sequences and high computational costs associated with end-to-end evaluations. FlowPipe addresses these issues by reformulating pipeline synthesis as conditional probabilistic flow generation over a directed acyclic graph. It employs Conditional Generative Flow Networks (C-GFlowNets) optimized through a Trajectory Balance objective, ensuring effective credit assignment from validation rewards to early actions. Additionally, it incorporates Deep Semantic Modulation via Feature-wise Linear Modulation (FiLM) to integrate logical priors from large language models (LLMs), enhancing the context-awareness of the decision-making process. To improve exploration efficiency, FlowPipe integrates failure awareness to prune invalid states early. Experimental results demonstrate that FlowPipe significantly outperforms state-of-the-art baselines, achieving an average accuracy improvement of 11.96% and a 12.5× speedup in training convergence across 74 real-world datasets.
Methodology
FlowPipe employs Conditional Generative Flow Networks (C-GFlowNets) optimized via a Trajectory Balance objective to reformulate pipeline synthesis. It integrates Deep Semantic Modulation through Feature-wise Linear Modulation (FiLM) to incorporate logical priors from LLMs, and it enhances exploration efficiency by incorporating failure awareness to prune invalid pipeline states.
Results
The framework was evaluated on two benchmark suites with 74 real-world datasets, showing an average accuracy improvement of 11.96% over state-of-the-art methods and achieving a 12.5× speedup in training convergence.
Implications
FlowPipe's advancements in automating data preparation pipelines can significantly enhance the efficiency and effectiveness of machine learning workflows, making it easier for practitioners to ensure high data quality without extensive manual intervention.
An LLM-based Two-Stage Transformer Framework for Cross-Domain Bearing Fault Diagnosis with Limited Data
Time Series
- Proposes a two-stage transfer learning framework for bearing fault diagnosis under limited data.
- Introduces explicit knowledge transfer mechanisms for improved performance in dual-shift scenarios.
- Develops a dynamic classification head for seamless adaptation across heterogeneous fault taxonomies.
- Achieves significant accuracy improvements over existing methods with minimal labeled data.
Read more
An LLM-based Two-Stage Transformer Framework for Cross-Domain Bearing Fault Diagnosis with Limited Data
Summary
This paper addresses the challenges of bearing fault diagnosis in industrial environments, particularly when faced with dataset heterogeneity, variations in operating conditions, and limited labeled data. Existing methods typically tackle these issues separately, which limits their effectiveness in real-world applications. The authors propose a novel knowledge-guided two-stage transfer learning framework that utilizes a lightweight GPT-2-style Transformer architecture with causal self-attention for hierarchical feature extraction from vibration signals. This framework establishes explicit pathways for knowledge transfer from multi-source pre-training to target adaptation, effectively addressing the dual-shift challenge through multi-source learning, prototype-based knowledge modulation, and taxonomy-adaptive classification. The experimental results demonstrate that the proposed framework achieves an average accuracy of 92.61% using only 10% of labeled target data, outperforming state-of-the-art methods by 17.24 percentage points, thereby offering a practical solution for predictive maintenance in Industry 4.0 applications.
Methodology
The proposed framework consists of two main stages: the first focuses on multi-source learning to create generalizable representations, while the second involves prototype-based knowledge modulation for adapting to the target domain. The architecture leverages a lightweight Transformer model for feature extraction and employs explicit mechanisms for knowledge transfer, including parameter-level initialization and feature-level modulation.
Results
The framework was validated on four real-world datasets, achieving an average accuracy of 92.61% with only 10% of labeled target data. This performance surpasses state-of-the-art methods by 17.24 percentage points, demonstrating the effectiveness of the proposed approach in handling the dual-shift challenge.
Implications
The findings suggest that the proposed framework can significantly enhance predictive maintenance strategies in various industrial sectors by enabling accurate fault diagnosis even with limited labeled data. This could lead to reduced downtime and maintenance costs, contributing to the efficiency of Industry 4.0 applications.
The Geometry of Sequential Learning: Lie-Bracket Prediction of Transfer Order
Theory
Optimization
Large Language Models
- Introduces a commutator theory of transfer order that connects order-dependent target loss to directional bracket scores.
- Presents a drift-matched Trotter estimator for efficient pairwise planning using gradients and Hessian-vector products.
- Develops Lie-Bracket Tournaments for scalable scheduling of multiple domains, avoiding exhaustive permutation evaluations.
- Validates the approach across various post-training and domain adaptation scenarios, demonstrating robust performance.
Read more
The Geometry of Sequential Learning: Lie-Bracket Prediction of Transfer Order
Summary
This paper addresses the order-dependent nature of sequential learning in machine learning, particularly in the context of adapting models across multiple source domains. The author introduces a geometric framework based on the Lie-bracket commutator of gradient update fields, which quantifies the local order effects in training sequences. By defining a pairwise score that predicts whether training on source A followed by B (A → B) or vice versa (B → A) is more effective for a target domain, the paper proposes a novel scheduling method called the Lie-Bracket Tournament. This method allows for efficient ranking of multiple source domains without the need to evaluate all possible N! permutations, significantly reducing computational complexity. The empirical results demonstrate high accuracy in predicting optimal training orders across various tasks, including instruction-based fine-tuning and preference optimization, while maintaining the integrity of pretraining-domain evidence. The findings suggest that the proposed geometric approach can effectively guide curriculum learning in complex machine learning scenarios.
Methodology
The methodology involves deriving a pairwise geometric primitive that quantifies the order dependence of training sequences using the Lie-bracket commutator. The paper employs a drift-matched scoring system to evaluate target gradients at a reference point, facilitating efficient computation of pairwise scores. The Lie-Bracket Tournament framework is then utilized to rank multiple source domains based on these scores, significantly reducing the computational burden associated with exhaustive curriculum searches.
Results
The empirical evaluation shows that the proposed planner achieves 98.1% and 98.9% pairwise accuracy for instruction-SFT and DPO at k=1, maintaining over 72% accuracy at k=20. The method successfully recovers the best training schedules in 87.5% of trials and ranks programming languages for a Python target in the 99th percentile. Additionally, it demonstrates high performance on 56 MMLU subjects, surpassing traditional gradient-norm baselines.
Implications
The findings have significant implications for curriculum learning in machine learning, particularly in optimizing training sequences for models that undergo multiple adaptations. The geometric framework provides a theoretical basis for understanding order effects, while the practical scheduling method can enhance the efficiency and effectiveness of training pipelines in various applications.
Exact Schur-Sylvester Dimensionality Reductions for Non-Smooth Stochastic Complexity and Manifold Sampling
Theory
Efficient ML
Optimization
- Introduces a new formulation for computing NML code-length that bypasses traditional computational bottlenecks.
- Reduces the complexity of projection and volume factor calculations from O(N^3) to O(k^3 + N^2k).
- Generalizes the method to various non-smooth estimators including Sparse SVMs and Elastic Net.
- Demonstrates significant speedup in sampling efficiency on high-dimensional datasets.
Read more
Exact Schur-Sylvester Dimensionality Reductions for Non-Smooth Stochastic Complexity and Manifold Sampling
Summary
This paper addresses the computational challenges associated with the exact computation of the Normalized Maximum Likelihood (NML) code-length for regular non-smooth estimators, such as Lasso and Sparse Support Vector Machines (SVMs). Traditional methods face significant computational bottlenecks due to the cubic scaling of manifold-constrained projection and volume integration. The authors propose a novel formulation that leverages the block Schur complement and Sylvester’s determinant identity to reduce the computational complexity from O(N^3) to O(k^3 + N^2k) per step. This reduction allows for efficient sampling on high-dimensional datasets while maintaining numerical stability. The paper also includes a rigorous numerical stability analysis and empirical benchmarks demonstrating a speedup exceeding 14,100×, making non-smooth NML estimation feasible for large-scale statistical inference.
Methodology
The authors utilize block Schur complement and Sylvester’s determinant identity to reformulate the computations involved in the Propose-and-Project Metropolis–Hastings (PPMH) sampler. This approach allows for a more efficient evaluation of projection operators and volume factors, decoupling the computational cost from the ambient data dimension.
Results
Empirical benchmarks indicate that the proposed method achieves a constant speedup exceeding 14,100× compared to traditional methods, while maintaining double-precision numerical equivalence. This demonstrates the method's effectiveness in handling large-scale statistical inference tasks.
Implications
The findings suggest that the proposed Schur-Sylvester dimensionality reduction technique can significantly enhance the efficiency of statistical inference methods in high-dimensional settings, making it applicable in various fields such as machine learning, statistics, and data science.
Fast and Slow Variational Continual Learning
Optimization
Large Language Models
Theory
- Introduces Continual IVON (CoVON) optimizer for continual learning.
- Incorporates fast and slow adaptation mechanisms into the VCL framework.
- Merges past posteriors to create a slow-moving prior for knowledge retention.
- Demonstrates superior performance over existing VCL optimizers and weight-regularization strategies.
Read more
Fast and Slow Variational Continual Learning
Summary
This paper addresses the challenge of continual learning in deep networks, particularly focusing on the limitations of existing optimizers that do not inherently support continual adaptation. The authors propose a novel approach called Continual IVON (CoVON), which integrates the concept of 'fast and slow adaptation' into the Variational Continual Learning (VCL) framework. By merging past posteriors to create a slow-moving prior, CoVON aims to stabilize knowledge retention while allowing for rapid adaptation to new tasks. The method is implemented using the Improved Variational Online Newton optimizer, which closely resembles the Adam optimizer in terms of form and computational cost. The authors demonstrate that CoVON consistently outperforms existing VCL optimizers and other weight-regularization strategies across various tasks, including domain-incremental learning and fine-tuning of large language models. This work highlights the potential for optimizing continual learning processes by embedding stability and plasticity directly into the optimization framework.
Methodology
The authors utilize the Variational Continual Learning (VCL) framework, where past posteriors are merged to create a slow-moving prior. This prior is then used in conjunction with fast-weight updates to adapt to new tasks. The implementation leverages the Improved Variational Online Newton optimizer, maintaining a structure similar to Adam.
Results
The CoVON optimizer was shown to consistently improve performance in continual learning scenarios compared to existing VCL optimizers and weight-regularization methods. The experiments conducted across various tasks, including domain-incremental learning and continual pre-training of large language models, confirmed its effectiveness.
Implications
The proposed CoVON optimizer has significant implications for the development of more robust continual learning systems, particularly in dynamic environments where data and tasks evolve over time. This approach could enhance the performance of large language models and other deep learning applications that require continual adaptation without catastrophic forgetting.
Offline Reinforcement Learning for Warehouse SLAM Throughput Control
Reinforcement Learning
Optimization
- Introduces an offline RL framework for optimizing SLAM throughput in warehouses.
- Employs a history-informed state representation and a compact action space to enhance learning.
- Utilizes a balanced reward function to address upstream and downstream operational metrics.
- Demonstrates superior performance of the CQL policy in improving system health and reducing throttling.
Read more
Offline Reinforcement Learning for Warehouse SLAM Throughput Control
Summary
This paper presents an offline reinforcement learning (RL) framework aimed at optimizing SLAM (Scan/Label/Apply/Manifest) throughput control in warehouse fulfillment environments. The authors highlight the importance of SLAM throughput in influencing system congestion and operational efficiency. The proposed RL-based control approach dynamically recommends SLAM throughput settings that balance throughput maximization with downstream stability through intelligent throttling adjustments. Key contributions include a history-informed state representation that captures system dynamics, an action space abstraction that reduces complexity, and a balanced reward function that considers both upstream and downstream metrics. The framework is algorithm-agnostic, allowing for the integration of various offline RL methods. The authors evaluate their approach using three state-of-the-art offline RL algorithms trained on historical operational logs from a large-scale warehouse. The performance of the policies is assessed through a multi-method strategy, including model-free and model-based evaluations. The empirical results demonstrate that the CQL policy outperforms other alternatives, significantly improving system health and reducing throttling duration, showcasing the potential of offline RL for safe and scalable warehouse throughput optimization.
Methodology
The authors employ an algorithm-agnostic deep reinforcement learning approach, utilizing a history-informed state representation, a transformed action space for reduced dimensionality, and a custom reward function that balances upstream and downstream metrics. They evaluate their framework using three offline RL algorithms (BCQ, CQL, TD3+BC) and assess policy performance through a combination of model-free and model-based evaluation methods.
Results
The CQL policy consistently outperformed other evaluated policies, achieving a 22.97% improvement in system health and a 3.18% reduction in average throttling duration. These results indicate the effectiveness of the proposed RL framework in optimizing warehouse throughput control.
Implications
The findings suggest that offline reinforcement learning can be effectively applied to complex warehouse operations, potentially leading to enhanced operational efficiency and reduced congestion in fulfillment environments. This approach may also be adapted for other industrial automation scenarios.
QC-SMOTE: Quality-Controlled SMOTE for Imbalanced Classification
Theory
Efficient ML
Optimization
- QC-SMOTE improves the generation of synthetic samples by assessing their reliability and quality.
- The method adapts its sampling strategy based on the local geometry of the data.
- Experiments show significant performance improvements in AUC-ROC and Macro F1 scores compared to existing methods.
- QC-SMOTE provides a graceful degradation mechanism by reverting to duplication in noisy regions.
Read more
QC-SMOTE: Quality-Controlled SMOTE for Imbalanced Classification
Summary
QC-SMOTE introduces a novel framework for addressing class imbalance in classification tasks, where traditional methods like SMOTE often generate low-quality synthetic samples in noisy or overlapping regions. The proposed method estimates the reliability of minority samples using a composite neighbourhood trustworthiness score that incorporates local density, safe-level characteristics, and isolation from the majority class. QC-SMOTE employs an IPQ-guided best-of-K strategy to generate multiple synthetic candidates, evaluating their quality before inclusion in the training set. This adaptive approach adjusts the interpolation range and selection criteria based on the local data geometry, ensuring that low-quality samples are replaced with original minority duplicates when necessary. Experiments conducted on 30 imbalanced datasets demonstrate that QC-SMOTE outperforms existing oversampling methods, achieving the highest average AUC-ROC and Macro F1 scores, particularly under moderate to severe imbalance conditions. The results underscore the importance of quality-aware and geometry-adaptive synthetic sampling in enhancing the robustness of imbalanced classification.
Methodology
QC-SMOTE utilizes a composite neighbourhood trustworthiness score to evaluate the reliability of minority samples. It generates synthetic candidates using an IPQ-guided best-of-K strategy, assessing local purity and majority clearance. The method adapts its generation behavior based on the overlap-imbalance regime, ensuring that low-quality samples are replaced with original duplicates when necessary.
Results
In experiments on 30 imbalanced datasets, QC-SMOTE achieved the highest average AUC-ROC and Macro F1 scores among the compared oversampling methods, with notable improvements particularly in moderate and severe imbalance scenarios.
Implications
The QC-SMOTE framework can be applied in various domains where class imbalance is prevalent, such as fraud detection, medical diagnosis, and anomaly detection, enhancing the performance of classifiers in recognizing minority patterns.
UC-Search: Risk-Aware Test-Time Search for Delayed Constrained Time-Series Control
Time Series
Reinforcement Learning
Optimization
- UC-Search is a novel framework for risk-aware decision-making in time-series control under uncertainty.
- The methodology combines a feasibility automaton with uncertainty-guided search mechanisms for improved action selection.
- Empirical results indicate significant performance gains over traditional methods in various test scenarios.
- The paper establishes theoretical foundations for when bounded lookahead can enhance decision-making.
Read more
UC-Search: Risk-Aware Test-Time Search for Delayed Constrained Time-Series Control
Summary
The paper introduces UC-Search, a model-agnostic test-time wrapper designed for time-series control problems that require delayed decisions under uncertainty and strict feasibility constraints. Unlike traditional forecasting models that focus solely on prediction accuracy, UC-Search integrates a feasibility automaton and a bounded search mechanism to derive risk-adjusted actions based on uncertainty. The methodology includes UC-Beam and a UCT-style UC-MCTS for path scoring and action selection, leveraging different types of uncertainty (epistemic, aleatoric, and propagated) as risk terms. The authors present a myopic-collapse/separation theorem that delineates conditions under which bounded lookahead can enhance decision-making. Empirical results demonstrate that UC-Search outperforms existing methods like CEM, MPPI, and risk-aware random approaches across various test scenarios, including a public delayed-control suite and inventory management tasks. The findings suggest that incorporating uncertainty into decision-making processes can yield significant improvements in performance, particularly in environments characterized by delayed feedback and strict constraints.
Methodology
The methodology involves a test-time decision search framework that integrates a forecast-trace, a hard-feasibility automaton, and an uncertainty-guided first-action search layer. It employs UC-Beam for bounded search and UC-MCTS for diagnostics, utilizing uncertainty estimates to inform path scoring and action selection.
Results
UC-Search demonstrated superior performance in a public 9-family, 33-series delayed-control suite, achieving normalized thresholds significantly higher than validation-selected CEM, MPPI, and risk-aware random methods. In a delayed-inventory validation, it also outperformed classic base-stock control methods, indicating its effectiveness in managing delayed decisions under constraints.
Implications
The findings suggest that UC-Search can be effectively applied in various domains requiring time-series decision-making under uncertainty, such as finance, inventory management, and resource allocation. Its model-agnostic nature allows for broad applicability across different forecasting models.
How Modular Is a Frontier Mixture-of-Experts? A Pre-registered Causal Test in Which Apparent Expert Modularity Mostly Dissolves
NLP
Large Language Models
Interpretability
- Only one out of six pre-registered expert families shows robust modularity.
- Apparent modularity is sensitive to the choice of corpus, metric, and statistical bar.
- Ablation-based assessments of modularity require careful control to avoid misleading conclusions.
- The study provides a pre-registered causal testing framework for evaluating expert modularity.
Read more
How Modular Is a Frontier Mixture-of-Experts? A Pre-registered Causal Test in Which Apparent Expert Modularity Mostly Dissolves
Summary
This paper investigates the modularity of Sparse Mixture-of-Experts (MoE) models, specifically focusing on the Command A+ model, which consists of 218 billion parameters and 128 experts. The authors hypothesize that experts in MoE models may form functional modules based on capabilities or languages. To test this, they establish a causal framework with pre-registered hypotheses and a routing-mass atlas. They conduct ablation studies to assess whether removing certain expert families selectively impacts their hypothesized capabilities. The findings reveal that only one of the six pre-registered families, the Arabic-language family, demonstrates robust modularity, while the others exhibit fragile modularity that varies with the choice of corpus, metric, and statistical criteria. This study emphasizes the importance of rigorous testing in evaluating expert modularity in MoE models and provides a cautionary note on the interpretability of such models.
Methodology
The authors employed a pre-registered causal testing protocol to evaluate expert modularity in the Command A+ MoE model. They created a routing-mass atlas and defined six families of experts based on their routing behavior. The study involved ablation experiments where specific families were masked, and their effects on performance were measured across four different metrics. The results were analyzed using bootstrap confidence intervals to ensure robustness.
Results
The results indicated that only the Arabic-language family exhibited clear modularity, while the other families failed to show selective effects under rigorous testing. The apparent modularity of these families was found to be dependent on the measurement approach, with variations across different corpora and metrics. The study confirmed that the method could detect modularity when present, as evidenced by a positive control.
Implications
The findings suggest that claims of modularity in MoE models should be approached with caution, as they may not hold under rigorous testing conditions. This has implications for the interpretability and safety of AI models, as well as for future research on expert specialization in machine learning.
Data Augmentation: A Fourier Analysis Perspective
Theory
Efficient ML
- Partial data augmentation can achieve statistical benefits comparable to full augmentation using a smaller subset of group elements.
- The theoretical framework employs Fourier analysis and representation theory to analyze the effectiveness of partial augmentation.
- Statistical optimality is maintained as long as the sampled subset size is sufficiently large relative to the invariant dimension.
- Exact invariance cannot be achieved without averaging over the entire group, emphasizing the limitations of partial methods.
Read more
Data Augmentation: A Fourier Analysis Perspective
Summary
This paper investigates the effectiveness of partial data augmentation in machine learning, particularly when full augmentation is computationally infeasible due to large symmetry groups. Data augmentation is a widely used technique that enhances training datasets by applying transformations based on known symmetries. The authors develop a theoretical framework using Fourier analysis and representation theory to explore whether partial augmentation can achieve similar statistical benefits as full augmentation. They demonstrate that for a broad class of learning problems, using a randomly sampled subset of group elements for augmentation can yield the same minimax rates as full augmentation, with the approximation error diminishing as the subset size increases. The paper also establishes that enforcing exact invariance requires averaging over the entire group, highlighting the trade-offs between computational efficiency and statistical optimality. This work provides a unified perspective on both full and partial data augmentation, offering insights into the conditions under which partial methods can be effective.
Methodology
The authors utilize Fourier analysis and representation theory to develop a framework for analyzing data augmentation. They focus on projection-based density and regression estimators to derive theoretical results regarding the effectiveness of partial augmentation.
Results
The study shows that partial data augmentation using a randomly sampled subset of group elements can achieve the same minimax-optimal sample complexity as full augmentation, provided the subset size is appropriately chosen. The results indicate that the required size depends only on the invariant dimension and not on the total size of the symmetry group.
Implications
This research has significant implications for machine learning practitioners, particularly in fields where data augmentation is essential but computational resources are limited. It suggests that effective augmentation strategies can be developed without the need for exhaustive group transformations, thus enabling broader application of augmentation techniques in various domains.
Closed-Loop Graph Algorithm Execution with Small Language Models: Step Accuracy and Rollout Reliability
NLP
Large Language Models
Graph Learning
- Introduces a closed-loop framework for evaluating small language models in graph algorithm execution.
- Demonstrates that high next-step prediction accuracy does not ensure reliable overall execution.
- Identifies significant differences in performance between traversal and weighted graph procedures.
- Utilizes a comprehensive evaluation methodology that includes teacher-forced and autonomous rollout assessments.
Read more
Closed-Loop Graph Algorithm Execution with Small Language Models: Step Accuracy and Rollout Reliability
Summary
This paper investigates the execution of graph algorithms using small language models (SLMs) in a closed-loop framework, where the model predicts actions that influence subsequent states. The study focuses on understanding the reliability of these models in executing structured algorithms, particularly in scenarios where multiple dependent decisions are made. The evaluation framework encompasses various classical graph procedures and synthetic graph families, assessing both local decision quality and global execution behavior through metrics such as step accuracy and rollout reliability. The findings reveal that while SLMs can achieve high accuracy in predicting individual actions, this does not guarantee reliable execution across complete rollouts, especially for weighted algorithms. The research emphasizes the importance of evaluating algorithmic language models through comprehensive closed-loop rollouts rather than isolated predictions, highlighting the potential for significant discrepancies between local competence and global reliability.
Methodology
The study employs a closed-loop formulation where small language models predict actions based on a textual description of the graph and current state. A deterministic executor applies these actions to transition states. The evaluation involves two regimes: teacher-forced step evaluation, where the model is tested on known correct states, and autonomous rollout evaluation, where the model's predictions influence subsequent states. The experiments cover six graph algorithms across three random graph families, with metrics assessing both local and global execution performance.
Results
The results indicate that traversal policies maintain high step accuracy throughout complete rollouts, while weighted algorithms exhibit significant gaps between step accuracy and overall execution reliability. The study finds that even with high accuracy in predicting individual actions, the models struggle to maintain correctness across entire executions, particularly in weighted scenarios. The analysis of prefix survival and intervention diagnostics reveals that aggregate performance scores can mask critical differences in execution behavior.
Implications
The findings suggest that while small language models can be effective in executing structured algorithms, their reliability in autonomous settings is limited, particularly for complex decision-making tasks. This research underscores the need for improved evaluation methods that account for the cumulative impact of prediction errors, which could inform the development of more robust algorithmic language models.
Sesame: Structure-Aware Molecular Generation via Spatial Density-Map Conditioning
Generative Models
- Introduces a novel density map conditioning architecture for structure-aware molecular generation.
- Supports both de novo generation and fragment-conditioned lead optimization through a unified conditioning mechanism.
- Implements a hybrid discrete-continuous diffusion process for effective molecular generation.
- Utilizes trajectory finetuning to enhance the quality of generated molecules.
Read more
Sesame: Structure-Aware Molecular Generation via Spatial Density-Map Conditioning
Summary
The paper introduces Sesame, a novel diffusion-based molecular generation model designed to enhance drug discovery by understanding small molecule structures and protein-ligand interactions. Sesame employs a unique spatial pairformer module that conditions on partial molecular structures and surrounding protein pockets, represented as continuous spatial density maps. This approach allows for both de novo molecular generation and fragment-conditioned lead optimization, enabling medicinal chemists to refine existing compounds effectively. The model integrates a hybrid diffusion process that manages both discrete atom types and continuous coordinates, alongside a trajectory finetuning scheme that improves generation quality by training on the model's own sampling rollouts. Sesame is trained on extensive datasets of ligands and protein-ligand complexes, demonstrating its capability to generate chemically sensible and pocket-compatible molecules while retaining significant portions of the input fragments.
Methodology
Sesame employs a diffusion-based generative model that utilizes a spatial pairformer module to condition on spatial density maps representing molecular structures and protein pockets. The model integrates a hybrid diffusion process to handle both discrete and continuous data types, and it incorporates a trajectory finetuning mechanism that leverages the model's own sampling outputs for improved training.
Results
The model demonstrates a 94.8% retention rate of the seeding fragment as a substructure in generated molecules during fragment-conditioned generation. This indicates that the conditioning mechanism effectively guides the molecular generation process. The overall quality of generated molecules is enhanced through the trajectory finetuning approach.
Implications
Sesame's ability to generate and optimize molecular structures has significant implications for drug discovery, allowing for more efficient lead optimization and the exploration of chemical space. The integration of human insights with generative chemistry can streamline the drug design process, potentially leading to faster development of new therapeutics.
Speculative Decoding at Temperature Zero: A Scoped Safety-Invariance Screen with a 48,072-Sample Expansion
Large Language Models
Efficient ML
Theory
- Introduces the Typical-Acceptance Invariance Screen (TAIS) for assessing safety in speculative decoding.
- Demonstrates no detectable safety divergence in outputs from speculative versus target-only decoding at temperature zero.
- Utilizes a large dataset of 64,855 samples to validate findings across multiple model configurations.
- Establishes a clear boundary between inference-time acceleration and other safety considerations in model training and deployment.
Read more
Speculative Decoding at Temperature Zero: A Scoped Safety-Invariance Screen with a 48,072-Sample Expansion
Summary
This paper investigates the safety implications of speculative decoding in large language models (LLMs) at temperature zero, where a draft model proposes tokens for a target model to verify. The author introduces the Typical-Acceptance Invariance Screen (TAIS), a behavioral-equivalence framework designed to assess whether draft-side behavior can leak into safety-scored outputs. The study employs a comprehensive dataset comprising 16,783 confirmatory samples and 44,066 matched expansion samples, testing various model configurations, including Llama and Qwen families, under different execution settings. The findings reveal no detectable safety divergence between speculative and target-only decoding outputs, with the largest observed effect size (Cohen's h) being 0.024, significantly below the trivial-effect threshold. The paper emphasizes the importance of maintaining safety during inference-time acceleration and provides a structured approach to evaluate safety in speculative decoding scenarios.
Methodology
The study employs the Typical-Acceptance Invariance Screen (TAIS) to compare outputs from speculative decoding and target-only decoding. It measures byte-identity, evaluates equivalence using TOST tests at ±3pp, and computes per-task Cohen's h to assess safety divergence. The methodology includes a confirmatory core of 16,783 samples and an expansion of 44,066 samples across various model configurations and execution settings.
Results
The results indicate that speculative decoding at temperature zero does not lead to safety divergence from target-only decoding, with a maximum Cohen's h of 0.024. Out of 27 per-task contrasts, 25 passed the TOST equivalence criteria, and the DPO-adversarial draft produced byte-identical outputs to the canonical draft across 4,006 samples. A separate 70B production-scale probe showed AdvBench refusal rates consistent with expectations, although it was not counted as a TAIS pass due to the lack of a matched target-only arm.
Implications
The findings suggest that speculative decoding can be safely implemented in LLMs without compromising safety-scored outputs, which is crucial for real-world applications where model safety is paramount. The TAIS framework can be utilized in future research to evaluate safety in other machine learning contexts, particularly in inference-time acceleration scenarios.
Emergent Capabilities Arise Randomly from Learning Sparse Attention Patterns
NLP
Large Language Models
Theory
- Emergent capabilities in transformer models arise stochastically and are influenced by model size.
- Learning task-relevant attention patterns is crucial for the emergence of capabilities.
- Context length and pattern sparsity significantly affect the learning difficulty of attention patterns.
- Scaling the number of attention heads improves learning efficiency, while increasing head dimension yields diminishing returns.
Read more
Emergent Capabilities Arise Randomly from Learning Sparse Attention Patterns
Summary
This paper investigates the phenomenon of emergent capabilities in transformer language models, specifically how these capabilities arise stochastically during training. The authors demonstrate that larger models tend to acquire capabilities such as pattern completion and indirect object identification earlier due to their ability to learn task-relevant attention patterns. They conduct experiments using synthetic datasets, including linear maps and cellular automata, to analyze how context length and pattern sparsity affect the learning of attention patterns. The findings indicate that the difficulty of learning these patterns is a key bottleneck for emergent capabilities. Additionally, the paper explores the impact of scaling attention heads and introduces alternative architectures, such as MLP-Mixer, which outperforms transformers in specific tasks. The results provide insights into the mechanisms behind the abrupt emergence of capabilities, suggesting that architectural choices can significantly enhance a model's ability to learn sparse attention patterns, thereby improving downstream performance.
Methodology
The authors trained transformer models on synthetic datasets designed to isolate the learning of attention patterns. They varied context length and pattern sparsity to assess their impact on model performance. Additionally, they compared different architectural configurations, including the number of attention heads and the use of alternative token-mixing mechanisms.
Results
The study found that larger transformer models acquired emergent capabilities earlier and more reliably than smaller models. The emergence of capabilities was linked to the learning of relevant attention patterns, which varied based on context length and sparsity. The MLP-Mixer architecture demonstrated superior performance in learning complex attention patterns compared to traditional transformers on specific tasks.
Implications
These findings suggest that understanding the mechanisms behind emergent capabilities can guide the design of more efficient transformer architectures and training strategies. By focusing on improving the learning of sparse attention patterns, researchers can enhance the performance of language models in various downstream tasks.
Learning to Trigger: Reinforcement Learning at the Large Hadron Collider
Reinforcement Learning
- Introduced a reinforcement learning framework for real-time trigger threshold optimization at the LHC.
- Demonstrated significant improvements in signal efficiency and background rate stability using RL-based methods.
- Developed two new variants of Group-Filtered Policy Optimization tailored for streaming control.
- Achieved successful transfer of the RL agent from simulated to real collision data without fine-tuning.
Read more
Learning to Trigger: Reinforcement Learning at the Large Hadron Collider
Summary
This paper addresses the challenge of real-time event filtering (triggering) at the Large Hadron Collider (LHC), where static and hand-tuned trigger menus often become suboptimal due to changing detector conditions and background noise. The authors propose a reinforcement learning (RL) approach to dynamically adjust trigger thresholds, framing the problem as a sequential decision-making task. They adapt Group-Filtered Policy Optimization (GFPO) to this context, introducing two variants (GFPO-F and GFPO-FR) that ensure background rate feasibility during training. The methodology is validated on simulated data and real collision data from the CMS experiment. Results show significant improvements in maintaining background rate tolerance and enhancing signal efficiency, marking the first successful application of RL for trigger control in real LHC data. The findings suggest that RL can effectively optimize complex decision-making processes in high-throughput scientific environments.
Methodology
The authors formulated the problem of adaptive thresholding as a reinforcement learning task, utilizing a sequence-based observation model that captures streaming data. They adapted the Group-Filtered Policy Optimization method to ensure feasibility in background rates while optimizing for signal efficiency. The RL agent was trained on simulated data and then tested on real collision data from the CMS experiment.
Results
The RL agent increased the fraction of in-tolerance time intervals by 48% for the total transverse energy (HT) trigger and 28% for the anomaly-detection (AD) trigger. It also achieved a cumulative gain of up to 2% in signal efficiency during in-tolerance intervals. When applied to real collision data, the agent improved in-tolerance rates by 56% for HT and 28% for AD, demonstrating effective transfer from simulation to real-world application.
Implications
This work has significant implications for optimizing real-time decision-making in high-energy physics experiments and potentially other fields requiring dynamic threshold adjustments under strict constraints. The methods developed could be applied to various scientific and engineering domains where adaptive control is necessary.
Reconstructing GRACE Terrestrial Water Storage with Spatio-Temporal Graph Neural Networks: An Application to South America
Graph Learning
Time Series
- Introduces a spatio-temporal graph neural network for reconstructing TWS anomalies from meteorological data.
- Achieves high correlation with GRACE observations, demonstrating effectiveness in capturing hydrological dynamics.
- Outperforms traditional reconstruction methods in terms of predictor efficiency and accuracy.
- Successfully reproduces major climatic events, validating the model's applicability in climate science.
Read more
Reconstructing GRACE Terrestrial Water Storage with Spatio-Temporal Graph Neural Networks: An Application to South America
Summary
This paper addresses the challenge of reconstructing Terrestrial Water Storage (TWS) anomalies from GRACE satellite data, which only provides observations from 2002 onwards. The authors propose a novel approach using a multivariate time series graph neural network (MTGNN) to learn the relationship between daily meteorological data from ERA5 and monthly GRACE observations, extending the reconstruction back to 1940. Unlike traditional methods that treat spatial data independently, the MTGNN captures spatial dependencies through a hybrid adjacency matrix that integrates geodesic proximity and lagged correlations of climatic time series. The model is evaluated over South America, achieving a grid-cell Pearson correlation of 0.69 and a basin-mean correlation of 0.94 against GRACE/GRACE-FO data from 2002 to 2023. The reconstruction effectively reproduces significant climatic events such as the 2015/16 El Niño and 2020/21 La Niña, demonstrating its robustness compared to existing methods. The authors also highlight the efficiency of their approach, requiring significantly fewer predictors than other models while maintaining competitive accuracy. The implementation is made publicly available to support reproducibility and further research.
Methodology
The authors adapted a multivariate time series graph neural network (MTGNN) to model the relationship between daily meteorological data and monthly TWS observations. The model uses a hybrid adjacency matrix to encode spatial dependencies, allowing it to capture both local hydrological interactions and larger-scale climatic teleconnections.
Results
The MTGNN achieved a grid-cell Pearson correlation of 0.69 and a basin-mean correlation of 0.94 when evaluated against GRACE/GRACE-FO data from 2002 to 2023. The model demonstrated near-zero bias and accurately reproduced the spatial patterns associated with significant climatic events like El Niño and La Niña.
Implications
This research has significant implications for climate science, particularly in understanding and managing freshwater resources under climate variability. The ability to reconstruct historical TWS data enhances the analysis of long-term trends and impacts of climate change, providing valuable insights for water resource management and policy-making.
Training Dynamics of Neural Software Defect Predictors under Coupled Data-Quality Issues
Theory
- Investigates the interaction of class imbalance and class overlap in neural software defect prediction.
- Proposes a novel empirical protocol to analyze training dynamics under various data-quality conditions.
- Aims to catalog training-dynamics patterns associated with data-quality issues for better model diagnostics.
- Highlights the limitations of relying solely on endpoint performance metrics for understanding model behavior.
Read more
Training Dynamics of Neural Software Defect Predictors under Coupled Data-Quality Issues
Summary
This paper investigates the training dynamics of neural software defect predictors (SDPs) in the presence of coupled data-quality issues, specifically class imbalance and class overlap. While previous studies have primarily focused on endpoint performance metrics to assess the impact of these issues, this research aims to explore how these data-quality problems manifest in the internal training dynamics of neural networks. The authors conduct a controlled intervention study using class-level datasets from the Unified Bug Dataset, training a fixed multilayer perceptron (MLP) under various conditions: imbalance-only, overlap-only, and joint conditions. By logging training dynamics per epoch and monitoring fidelity through coupling ratios, the study characterizes patterns using effect sizes, trajectories, and sensitivity analyses. The expected outcome is a comprehensive empirical protocol and a taxonomy of training-dynamics patterns that can aid in diagnosing data-quality issues in SDP models, ultimately enhancing the reliability of defect predictions in software engineering.
Methodology
The authors conducted a controlled intervention study using class-level datasets from the Unified Bug Dataset, training a fixed multilayer perceptron under different conditions (imbalance-only, overlap-only, and joint). They logged training dynamics metrics per epoch and monitored fidelity through coupling ratios, analyzing the data using effect sizes, trajectories, and sensitivity analyses.
Results
The study is expected to produce a candidate taxonomy of training-dynamics patterns associated with class imbalance, overlap, and their coupling. It aims to provide insights into how these issues affect the training dynamics of neural networks, offering earlier diagnostic signals for data-quality problems in software defect prediction.
Implications
The findings could significantly enhance the understanding of how data-quality issues affect neural network training dynamics, leading to better data cleaning, model debugging, and maintenance decision-making in software engineering. This could ultimately improve the reliability and effectiveness of software defect prediction models.
A Zeroth-Order Deep Learning Method for Fully Nonlinear Parabolic Partial Differential Equations with Unknown Coefficients
Theory
Optimization
- Introduces a model-free approach to solving fully nonlinear parabolic PDEs with unknown coefficients.
- Utilizes zeroth-order derivative estimators from Monte Carlo simulations for learning solutions and their derivatives.
- Establishes a non-asymptotic error bound and analyzes bias-variance tradeoff for the proposed method.
- Demonstrates competitive performance in numerical experiments across various dimensions.
Read more
A Zeroth-Order Deep Learning Method for Fully Nonlinear Parabolic Partial Differential Equations with Unknown Coefficients
Summary
This paper addresses the challenge of solving high-dimensional fully nonlinear parabolic partial differential equations (PDEs) with unknown coefficients using a novel zeroth-order deep learning approach. Traditional methods often rely on automatic differentiation, which can lead to instability and errors in high dimensions. The authors propose a 'representing-then-learning' strategy that utilizes zeroth-order derivative (ZOD) estimators derived from perturbed Monte Carlo trajectories. This model-free method allows for the learning of solutions and their derivatives solely based on function evaluations, circumventing the need for explicit knowledge of the underlying PDE operators. The paper provides a statistical learning analysis, establishing a non-asymptotic error bound that decomposes total error into various components, including discretization and approximation errors. The authors also derive the sample complexity of the learned representations in Sobolev space, demonstrating the method's effectiveness through numerical experiments in moderate and high dimensions.
Methodology
The authors employ a 'representing-then-learning' paradigm, where derivatives are represented through zeroth-order derivative estimators derived from Monte Carlo trajectories. This allows for a fully model-free approach to learning solutions and their derivatives based on function evaluations, without requiring explicit knowledge of the PDE operators.
Results
The proposed method achieves a non-asymptotic error bound that accounts for discretization, approximation, statistical errors, and ZOD bias. Numerical experiments validate the method's competitive performance in solving high-dimensional PDEs, showcasing its effectiveness compared to existing techniques.
Implications
This work has significant implications for scientific machine learning applications, particularly in fields where high-dimensional PDEs are prevalent, such as fluid dynamics, climate modeling, and reinforcement learning. The model-free approach enables the solution of complex systems where traditional methods fail due to unknown coefficients.
Is Variational Monte Carlo Robust? Sharp Moment Thresholds and Heavy-tailed Stochastic Optimization
Theory
Optimization
- VMC's stochastic optimization is heavily influenced by the nodal geometry of the wave function.
- Local energy and gradient estimators in VMC are generally heavy-tailed and lack higher moments.
- The proposed PS-Clip-VMC variant improves robustness and convergence properties of VMC.
- Preliminary results indicate PS-Clip-VMC significantly outperforms traditional VMC methods.
Read more
Is Variational Monte Carlo Robust? Sharp Moment Thresholds and Heavy-tailed Stochastic Optimization
Summary
This paper investigates the robustness of Variational Monte Carlo (VMC), a key algorithm in electronic structure theory, particularly in the context of modern neural network approaches like FermiNet. The authors establish that the stochastic optimization problem inherent in VMC is significantly influenced by the nodal geometry of the wave function. They demonstrate that for many relevant ansatz classes, including Slater-Jastrow wave functions, the local energy and gradient estimators are typically heavy-tailed and do not possess higher moments. This finding indicates that the moment assumptions necessary for the convergence of VMC are often violated. To address these challenges, the authors propose a new variant of VMC called PS-Clip-VMC, which involves clipping both the local energy and gradient random variables. They prove that PS-Clip-VMC converges in expectation and with high probability in the weak moment regime, and preliminary experiments suggest that it outperforms standard VMC methods in terms of robustness when applied to training FermiNet on atomic systems with up to 18 electrons.
Methodology
The authors analyze the moment properties of local energy and gradient estimators in VMC, proving that they are often heavy-tailed for common wave function ansatz classes. They introduce PS-Clip-VMC, which clips the local energy and gradient estimators to enhance convergence properties. The methodology includes theoretical proofs of convergence and empirical validation through experiments on FermiNet.
Results
The paper establishes that the moment assumptions required for VMC convergence are frequently not met due to the heavy-tailed nature of the estimators. The introduction of PS-Clip-VMC is shown to converge in expectation and with high probability, addressing the limitations of standard VMC. Experimental results indicate that PS-Clip-VMC is more robust than traditional methods when applied to complex atomic systems.
Implications
The findings suggest that VMC can be made more reliable for practical applications in quantum chemistry and electronic structure calculations. The robustness of PS-Clip-VMC may facilitate more accurate simulations of molecular systems, potentially impacting computational chemistry and materials science.
Tensorion: A Tensor-Aware Generalization of the Muon Optimizer
Optimization
Computer Vision
Efficient ML
- Tensorion extends the Muon optimizer to higher-order tensors, preserving the multilinear structure crucial for optimization.
- The optimizer utilizes a linear minimization oracle over a specially defined tensor norm ball, balancing computational efficiency and optimization effectiveness.
- Experiments show that Tensorion outperforms conventional optimizers like Adam in terms of convergence and stability on tensor-based tasks.
- The proposed method adapts unfolding strategies for tensors, enhancing the practical application of tensor optimization in deep learning.
Read more
Tensorion: A Tensor-Aware Generalization of the Muon Optimizer
Summary
The paper introduces Tensorion, a novel tensor-aware optimizer that generalizes the Muon optimizer to handle higher-order tensors, which are prevalent in modern machine learning models. Traditional first-order optimizers like Adam treat parameters as flat vectors, neglecting the inherent multilinear structure of tensors. Tensorion addresses this limitation by employing a linear minimization oracle (LMO) over a tensor norm ball, specifically designed to balance the need for a tight upper bound on the tensor spectral norm while ensuring computational tractability. The optimizer adapts its approach based on the unfolding matrices of tensors, allowing it to recover the Muon optimizer when restricted to order-2 tensors (matrices). The authors validate Tensorion through experiments on various tensor-based computer vision tasks, demonstrating its superior convergence behavior and stability in gradient updates compared to Adam and other tensor-aware optimizers.
Methodology
Tensorion employs a linear minimization oracle (LMO) that operates over a tensor norm ball, specifically designed to ensure both a tight upper bound on the spectral tensor norm and computational tractability. The optimizer adapts its unfolding strategies based on the nuclear matrix norms of the tensors involved, allowing for efficient optimization without requiring additional recomputation at each iteration.
Results
The evaluation of Tensorion on multiple computer vision classification datasets across convolutional neural networks (CNNs) and transformer architectures indicates consistent improvements in convergence rates and stability of gradient updates compared to traditional SGD and Adam-based optimizers.
Implications
Tensorion's development suggests significant potential for optimizing tensor-valued parameters in deep learning models, particularly those utilizing convolutional and transformer architectures. Its ability to maintain the structural integrity of tensors during optimization could lead to more efficient training algorithms and improved performance in various machine learning applications.
Lightweight Transformer Models for On-Device Fault Detection: A Benchmark Study on Resource-Constrained Deployment
Efficient ML
Time Series
Theory
- Lightweight transformers can match traditional ML methods in accuracy but at a much higher resource cost.
- TinyBERT-4L is the most deployment-friendly transformer model with a balance of size and latency.
- INT8 quantization can significantly reduce model size while maintaining high accuracy.
- An adaptive inference pipeline can optimize performance by routing most predictions through a lightweight model.
Read more
Lightweight Transformer Models for On-Device Fault Detection: A Benchmark Study on Resource-Constrained Deployment
Summary
This paper investigates the deployment of lightweight transformer models for on-device fault detection, addressing the challenges posed by resource-constrained environments. The study benchmarks traditional machine learning methods (Random Forest, XGBoost, SVM, Logistic Regression) against lightweight transformer architectures (DistilBERT, TinyBERT-6L, TinyBERT-4L, MobileBERT) across three public datasets: NASA C-MAPSS, SECOM, and UCI AI4I 2020. The evaluation focuses on classification performance (F1-score, AUC), model size, and CPU inference latency, while also exploring INT8 dynamic quantization and a two-stage adaptive inference pipeline. Results indicate that while lightweight transformers can achieve competitive accuracy on well-separated datasets, they come with significantly larger model sizes and higher latencies compared to traditional methods. The TinyBERT-4L model is identified as the most efficient option for deployment. The study also highlights the limitations of both traditional and transformer methods in handling severely imbalanced datasets, emphasizing the need for improved strategies in fault detection tasks.
Methodology
The study evaluates ten model configurations across three datasets, employing traditional ML methods and lightweight transformers. It utilizes INT8 dynamic quantization for model compression and proposes a two-stage adaptive inference pipeline to enhance efficiency. The datasets are preprocessed, and models are fine-tuned before evaluation on classification metrics, model size, and latency.
Results
The results show that on the NASA C-MAPSS dataset, lightweight transformers achieved an F1-score of 87.8%, but with a model size 100 times larger and latency 9000 times greater than traditional methods. TinyBERT-4L was the most efficient transformer at 55 MB and 18 ms latency. INT8 quantization reduced model size by 25% while preserving 86.9% F1. The adaptive pipeline achieved 87.6% F1 with an average latency of 19.5 ms, effectively routing 97.9% of predictions through a quantized model.
Implications
The findings suggest that while lightweight transformer models can be adapted for on-device fault detection, significant trade-offs in size and latency must be considered. The proposed adaptive inference pipeline offers a promising approach for optimizing performance in resource-constrained environments, potentially enhancing real-time diagnostics in consumer electronics and IoT devices.
A Time-Reparameterized Cumulative Intensity Extrapolation Sampler for Discrete Flow Matching
Generative Models
Efficient ML
NLP
- Introduces TR-CIE sampler to improve sampling efficiency in DFM.
- Utilizes a schedule-based time reparameterization to mitigate stiffness.
- Implements a cumulative-intensity extrapolation rule for better approximation.
- Requires only one function evaluation per step, maintaining efficiency.
Read more
A Time-Reparameterized Cumulative Intensity Extrapolation Sampler for Discrete Flow Matching
Summary
This paper introduces the Time-Reparameterized Cumulative Intensity Extrapolation (TR-CIE) sampler, designed to enhance sampling efficiency in discrete flow matching (DFM) under constraints of limited function evaluations (NFE). DFM leverages continuous-time Markov chain dynamics for generative modeling on discrete state spaces, but existing sampling methods often rely on discretizations like Ï„-leaping, which can be inefficient. The TR-CIE sampler consists of two main components: a schedule-based time reparameterization that adjusts the time grid according to the noise schedule, thereby reducing stiffness during the terminal sampling phase, and a cumulative-intensity extrapolation updating rule that utilizes cached model outputs to improve the approximation of cumulative intensities on a non-uniform time grid. Theoretical analysis demonstrates that the TR-CIE sampler maintains convergence and bounds local approximation errors, requiring only one NFE per step without additional model evaluations compared to standard Ï„-leaping. Experimental results across synthetic tasks, text generation, and text-to-image benchmarks indicate that TR-CIE significantly enhances sampling quality when NFE is limited.
Methodology
The TR-CIE sampler employs a two-component approach: first, it rescales the time grid based on a noise schedule to alleviate stiffness in sampling; second, it uses a cumulative-intensity extrapolation rule that leverages previous outputs to enhance the accuracy of cumulative intensity approximations on a non-uniform time grid. The method is theoretically analyzed for convergence and error bounds.
Results
The TR-CIE sampler was tested on synthetic tasks, text generation, and text-to-image benchmarks, showing marked improvements in sampling quality under conditions of limited function evaluations compared to traditional sampling methods.
Implications
The TR-CIE sampler has potential applications in generative modeling tasks that involve discrete state spaces, such as natural language processing and tokenized image generation, where efficient sampling is crucial for performance.
DREG: A Layer-Wise Jacobian Regularization as a General-Purpose Penalty
Computer Vision
NLP
Theory
- DREG outperforms traditional regularizers in accuracy and noise robustness.
- It is particularly effective under the GELU activation function.
- DREG shows significant advantages in data-scarce environments.
- The method requires minimal implementation effort, functioning as a plug-and-play regularizer.
Read more
DREG: A Layer-Wise Jacobian Regularization as a General-Purpose Penalty
Summary
This paper introduces DREG (Derivative Regularization penalty), a novel layer-wise Jacobian regularization technique aimed at enhancing the performance of neural networks. Through an extensive empirical study involving 960 experiments across various activations, regularizers, datasets, and noise conditions, the authors investigate the effectiveness of DREG. The findings reveal that DREG consistently outperforms traditional regularizers, achieving superior accuracy in both clean and noisy environments, particularly under the GELU activation function prevalent in modern transformer architectures. The results indicate that DREG is especially beneficial in scenarios with limited training data, acting as a geometric inductive bias that compensates for data scarcity. Notably, DREG operates with a fixed hyperparameter, making it a plug-and-play solution for neural networks without requiring extensive tuning or architectural modifications.
Methodology
The authors conducted a factorial experiment involving 960 runs, systematically evaluating DREG against five competing regularizers (Dropout, Spectral Normalization, Weight Decay, IGPen, and no regularization) across four activation functions, eight datasets, and two noise conditions. This approach allowed for a comprehensive analysis of DREG's performance across various scenarios.
Results
DREG achieved the highest overall accuracy and clean-regime performance among all evaluated regularizers, significantly outperforming the unregularized baseline and other methods like Weight Decay and IGPen. It ranked second in noise robustness, only behind Spectral Normalization. The performance advantage of DREG was most pronounced in data-scarce conditions, confirming its role as a valuable regularization strategy.
Implications
DREG presents a promising regularization approach that can be easily integrated into existing neural network architectures, particularly in applications where data is limited or where high accuracy is required in challenging environments. Its effectiveness across various domains suggests potential for widespread adoption in both vision and NLP tasks.
AsyncOPD: How Stale Can On-Policy Distillation Be?
Large Language Models
Reinforcement Learning
Efficient ML
- AsyncOPD addresses the staleness issue in on-policy distillation by decoupling rollout generation from learner updates.
- The study reveals that forward KL is more robust to stale data compared to reverse KL, which is more vulnerable.
- Existing asynchronous reinforcement learning stabilization techniques do not effectively mitigate OPD staleness.
- A multi-sample Monte Carlo estimator is proposed to reduce variance in reverse KL OPD implementations.
Read more
AsyncOPD: How Stale Can On-Policy Distillation Be?
Summary
The paper presents AsyncOPD, a novel approach to on-policy distillation (OPD) that addresses the challenges posed by stale data in asynchronous training environments. OPD is crucial for enhancing the performance of large language models (LLMs) through student-teacher training frameworks. However, the traditional synchronous OPD suffers from a bottleneck due to the time-consuming nature of rollout generation. The authors systematically investigate the effects of stale data on OPD, particularly focusing on the implications of using finite teacher-score caches. They find that the direction of the Kullback-Leibler (KL) divergence significantly influences the robustness of the model to stale rollouts, with forward KL being more resilient than reverse KL. The study also explores whether techniques from asynchronous reinforcement learning can stabilize OPD, concluding that simpler OPD-specific methods outperform these approaches. Additionally, the paper introduces a multi-sample Monte Carlo estimator to mitigate variance in the reverse KL case. The proposed AsyncOPD framework demonstrates a significant improvement in training throughput while maintaining comparable accuracy to synchronous methods.
Methodology
The authors conducted a systematic study of staleness in asynchronous OPD by analyzing the effects of KL divergence direction on model performance. They compared various stabilization methods from asynchronous reinforcement learning and proposed a new estimator for reverse KL that leverages multi-sample Monte Carlo techniques. The AsyncOPD framework was implemented and tested against traditional synchronous OPD methods.
Results
The experiments demonstrated that AsyncOPD significantly enhances training throughput by 1.6× to 3.8× compared to strict synchronous training, while achieving comparable accuracy levels. The findings also highlighted the vulnerability of reverse KL to stale rollouts and the effectiveness of the proposed multi-sample Monte Carlo estimator in reducing variance.
Implications
The findings of this study have significant implications for the training of large language models, particularly in scenarios where efficient training is critical. The AsyncOPD framework can be applied to improve the performance of various LLM applications, enabling faster and more effective model updates without sacrificing accuracy.
Closing the Loop: Formally Verified Law as a Reward Signal for Self-Improving Legal AI
NLP
Large Language Models
Reinforcement Learning
- Current legal AI systems lack the capability for autonomous legal reasoning due to their reliance on unverifiable outcomes.
- The proposed architecture integrates LLMs with formal verification to ensure provable correctness in legal reasoning.
- The system provides structural guarantees for legal argumentation, addressing open-textured legal analysis.
- Demonstrated effectiveness through practical examples in various legal contexts.
Read more
Closing the Loop: Formally Verified Law as a Reward Signal for Self-Improving Legal AI
Summary
This paper addresses the limitations of current legal AI systems, which are not autonomous legal reasoners due to their reliance on probabilistic outcomes rather than provable correctness. The authors propose a novel architecture based on the 'LLM proposes, verifier disposes' paradigm, which integrates large language models (LLMs) with a formal verification process tailored for legal reasoning. This architecture consists of multiple layers: autoformalization into a formal legal calculus, a verification kernel, and explanation generation based on formal proof traces. The proposed system ensures that every necessary stage of legal argumentation is addressed, maintaining valid deductive links between steps. The authors demonstrate the architecture's effectiveness through examples in procedural deadlines, Commerce Clause analysis, and cross-jurisdictional sanctions. By employing a deterministic external verifier, the architecture closes the reinforcement learning loop, providing verifiable outcomes essential for training legal AI systems. The ultimate goal is to achieve legal superintelligence, where the AI can generate legally correct arguments and identify areas of legal contention, while remaining auditable and trustworthy.
Methodology
The authors developed a multi-layer architecture that includes autoformalization into a formal legal calculus, a verification kernel for assessing legal arguments, and an explanation generation mechanism based on formal proof traces. They utilized reinforcement learning from verifier feedback to train the models, ensuring that only verified outputs are produced.
Results
The architecture successfully demonstrated its capability in handling complex legal analyses, such as procedural deadlines in German law and Commerce Clause interpretations in U.S. law. It provided verifiable outcomes for legal problems, thus enhancing the reliability and accountability of legal AI systems.
Implications
This work has significant implications for the development of legal AI systems, potentially leading to more reliable and accountable legal reasoning tools. It suggests a pathway towards achieving legal superintelligence, which could transform how legal analysis and decision-making are conducted.
Digital Twin-Driven Adaptive Sim-to-Real Alignment via Reinforcement Learning for Vibration-Based Bearing Health Monitoring Under Data Scarcity
Reinforcement Learning
Time Series
Optimization
- Proposes a novel RL-driven approach for sim-to-real feature alignment in bearing health monitoring.
- Addresses the limitations of existing class-agnostic domain adaptation methods by recognizing the heterogeneous nature of fault classes.
- Utilizes a three-stage framework that combines physics-based pretraining, RL for adaptive alignment, and an asymmetry-aware training strategy.
- Achieves a cross-equipment linear probing accuracy of 92.8% without the need for encoder retraining.
Read more
Digital Twin-Driven Adaptive Sim-to-Real Alignment via Reinforcement Learning for Vibration-Based Bearing Health Monitoring Under Data Scarcity
Summary
This paper addresses the challenges of vibration-based health monitoring of rotating machinery, particularly focusing on bearing fault diagnosis under data scarcity. The authors highlight the limitations of existing domain adaptation methods that apply a class-agnostic global transformation, which fails to account for the heterogeneous nature of sim-to-real gaps across different fault classes. To overcome these challenges, they propose a three-stage digital twin framework utilizing reinforcement learning (RL) for adaptive feature alignment. The first stage involves pretraining a ResNet-based encoder on physics-induced bearing signals to create a feature manifold. In the second stage, a Proximal Policy Optimization (PPO) agent formulates the feature alignment as a continuous-action Markov decision process, allowing for state-dependent policy learning that adapts to class-specific gaps. The final stage fine-tunes the encoder with an asymmetry-aware strategy, reserving real data for the Normal class while augmenting fault classes with RL-aligned simulated samples. The framework is validated on multiple datasets, demonstrating significant improvements in monitoring capability and fault diagnosis accuracy.
Methodology
The methodology consists of a three-stage digital twin framework. Stage 1 involves pretraining a ResNet-based encoder on simulated bearing signals. Stage 2 formulates feature alignment as a continuous-action Markov decision process, solved using Proximal Policy Optimization to learn a state-dependent policy. Stage 3 fine-tunes the encoder with an asymmetry-aware strategy that reserves real data for the Normal class and augments fault classes with simulated samples aligned through the learned policy.
Results
The proposed framework was validated on the XJTU-SY and CWRU datasets, as well as a self-built slewing bearing testbed. The results showed a significant improvement in fault diagnosis accuracy, achieving a cross-equipment linear probing accuracy of 92.8% without retraining the encoder, indicating strong transferable monitoring capabilities.
Implications
The findings suggest that the proposed RL-driven adaptive alignment approach can enhance the reliability of vibration-based health monitoring systems in industrial applications, particularly in scenarios with limited fault data. This could lead to improved maintenance strategies and reduced operational risks in rotating machinery.
Do Thinking Tokens Help with Safety?
Large Language Models
NLP
Theory
- Thinking tokens do not significantly enhance safety decision-making in reasoning models.
- The outcome of compliance or refusal can be predicted early in the thinking process.
- Existing safety interventions often lead to over-refusal and suppress deliberation signals.
- The thinking process is more akin to prefix completion than to genuine deliberation.
Read more
Do Thinking Tokens Help with Safety?
Summary
This paper investigates the effectiveness of 'thinking tokens' in enhancing the safety of large reasoning models (LRMs). The authors challenge the prevailing belief that a deliberative thinking process improves alignment and safety by allowing models to evaluate their responses against safety principles. Through empirical analysis of various frontier open-weight reasoning models, including GPT-OSS and Qwen, the study reveals that the eventual outcomes of refusal or compliance can be predicted with high accuracy based on the first token's hidden representation, even before any visible thinking occurs. The findings indicate that the thinking process resembles prefix completion rather than genuine deliberation, with most outcomes being determined early in the thinking trace. Additionally, existing safety interventions tend to lead to over-refusal behaviors while suppressing deliberation signals. The authors conclude that current reasoning models exhibit less deliberative safety behavior than assumed, highlighting the need for new methods that foster true safety deliberation.
Methodology
The authors conducted a series of experiments on various large reasoning models, analyzing the hidden representations of the first token in the thinking trace to predict compliance or refusal outcomes. They evaluated the impact of thinking on safety decisions and assessed the effectiveness of existing inference-time and training-based safety interventions.
Results
The study found that the refusal/compliance outcome is strongly predictable from the first token's hidden representation with AUROC scores between 0.84 and 0.95 and approximately 88% balanced accuracy. It was also observed that the thinking process does not significantly alter the predicted outcomes after the initial tokens, and existing safety interventions tend to exacerbate over-refusal rates while diminishing deliberation signals.
Implications
The findings suggest that current approaches to enhancing safety in reasoning models may be misguided. There is a critical need for developing new methodologies that genuinely encourage deliberative thinking processes, which could lead to more reliable safety outcomes in AI systems.
Parallel Manifold Steering: Efficient Adaptation of Large Associative Memories via Residual Energy Shaping
Large Language Models
Efficient ML
Theory
- H-Res introduces a new method for adapting large Transformers without modifying synaptic weights or increasing sequence length.
- The approach preserves the attention entropy of the model and supports Neural Collapse.
- Empirical results show a 26% improvement in associative retrieval tasks compared to global weight modification methods.
- H-Res avoids the computational overhead of prompt-based methods, making it efficient for structured domains.
Read more
Parallel Manifold Steering: Efficient Adaptation of Large Associative Memories via Residual Energy Shaping
Summary
This paper addresses the challenge of adapting large Transformer models, which function as Dense Associative Memories (DAMs), to new tasks without compromising their pre-trained knowledge. The author introduces H-Res (Hierarchical Residual Steering), a novel mechanism that reshapes the energy landscape of Transformers without altering their global equilibrium or increasing sequence length. H-Res formulates adaptation as a control problem on the activation manifold, learning a state-dependent vector field that directs token trajectories into task-specific basins of attraction. The method preserves the attention entropy of the foundation model and facilitates Neural Collapse, proving to be more effective than traditional global weight modification methods by 26% on associative retrieval tasks. Additionally, H-Res eliminates the computational overhead associated with prompt-based methods, demonstrating scalability to structured domains.
Methodology
The methodology involves introducing a residual control signal into the state evolution of the Transformer, parameterized as a low-rank Multi-Layer Perceptron (MLP). This control signal acts as a learnable vector field on the activation manifold, steering the model's latent states towards task-specific attractors without altering the pre-trained weights. The approach is grounded in energy minimization dynamics, ensuring that the adaptation process begins from the global minimum of the pre-trained energy landscape.
Results
H-Res outperforms traditional adaptation methods by 26% on associative retrieval tasks, demonstrating its effectiveness in reshaping the energy landscape of Transformers. The method also maintains the model's associative bandwidth and avoids the pitfalls of catastrophic forgetting and buffer congestion associated with existing techniques.
Implications
The findings suggest that H-Res could be applied to enhance the adaptability of large language models and vision transformers in various downstream tasks, potentially leading to more efficient and effective use of pre-trained models in real-world applications.
The Gentle Collapse: Distributional Metrics for Continual Learning
Computer Vision
Theory
- Introduces six new metrics for evaluating catastrophic forgetting that provide a continuous view of forgetting dynamics.
- Demonstrates that these metrics reveal information about forgetting that traditional accuracy metrics cannot capture.
- Shows that using metric scores as loss weights can effectively reduce forgetting in continual learning tasks.
- Establishes that the slope of metrics over short time windows is a reliable indicator for prioritizing replay samples.
Read more
The Gentle Collapse: Distributional Metrics for Continual Learning
Summary
This paper addresses the limitations of using accuracy as a metric for evaluating catastrophic forgetting (CF) in continual learning. Traditional accuracy metrics only indicate whether forgetting has occurred, failing to capture the nuanced dynamics of how knowledge is lost. The authors propose six new softmax-derived metrics that provide a continuous characterization of forgetting, including true-label rank (TLR), predictive confidence, and distributional divergence. These metrics are normalized to a range of [0, 1] and do not require modifications to the training process. Experiments on CIFAR-100 demonstrate that these metrics reveal information about forgetting that accuracy does not, such as the Confusion Margin and True-Label Rank, which can distinguish between different levels of forgetting even when accuracy is at 0%. The authors show that using per-sample metric scores as loss weights can reduce forgetting by 1.3 percentage points compared to uniform experience replay. Additionally, they find that the slope of a metric over a small window serves as a stable sampling criterion, outperforming accuracy-trend sampling significantly. Overall, the paper argues that reliance on accuracy has limited the understanding and mitigation of catastrophic forgetting in continual learning.
Methodology
The authors developed six softmax-derived metrics to evaluate forgetting in continual learning. These metrics include true-label rank metrics, confidence metrics, and a distribution-matching metric, all normalized to [0, 1]. They conducted experiments on CIFAR-100 and TinyImageNet, comparing the performance of their metrics against traditional accuracy measures and assessing their effectiveness in reducing catastrophic forgetting through loss re-weighting and trend sampling.
Results
The proposed metrics provided richer insights into the forgetting process, with the Confusion Margin and True-Label Rank showing significant variation even when accuracy was at 0%. Using the log-TLR metric for trend sampling improved performance by reducing forgetting by 7.7 percentage points on TinyImageNet compared to the experience replay baseline. The results highlighted the structural limitations of accuracy in capturing the dynamics of forgetting.
Implications
The findings suggest that adopting these new metrics could enhance the evaluation of continual learning systems, leading to better strategies for mitigating catastrophic forgetting. This could have significant implications for deploying machine learning models in dynamic environments where continual learning is essential.
Learning the Koopman Operator using Attention Free Transformers
Time Series
Theory
Optimization
- Introduction of an attention-free latent memory block to improve long-horizon prediction accuracy.
- Dynamic re-encoding mechanism to detect and correct latent drift, enhancing model robustness.
- Evaluation on three benchmark systems shows significant error reduction compared to existing models.
- Koopman+AFT model outperforms GRU and Transformer autoencoders in long-horizon predictions.
Read more
Learning the Koopman Operator using Attention Free Transformers
Summary
This paper addresses the challenges of learning Koopman operators with autoencoders, particularly the issue of long-horizon prediction drift. The authors introduce two innovative components to enhance the robustness of Koopman predictors: an attention-free latent memory (AFT) block and dynamic re-encoding mechanisms. The AFT block aggregates a short window of past latent states to produce a corrected latent representation before each Koopman update, operating in linear time and with a minimal increase in parameters. This approach effectively captures local temporal context, mitigating error divergence. The dynamic re-encoding mechanism employs lightweight online change-point detection triggers to identify latent drift and project predictions back onto the autoencoder manifold, thereby preventing catastrophic drift. The proposed model is evaluated on three benchmark systems: the Duffing oscillator, Repressilator, and IRMA, demonstrating significant reductions in error accumulation compared to traditional Koopman autoencoders and multi-head attention models. The results indicate that the Koopman+AFT model achieves lower long-horizon errors while maintaining lower inference latency, making it a fast and compact predictor that remains close to the learned manifold over extended prediction horizons.
Methodology
The authors augment a standard Koopman autoencoder with an attention-free latent memory (AFT) block that aggregates past latent states and a dynamic re-encoding mechanism that uses streaming change-point detection to correct latent drift. The AFT operates in linear time and captures local correlations, while dynamic re-encoding projects predictions back onto the learned manifold to prevent drift.
Results
The proposed model consistently reduces mean squared error (MSE) and long-horizon mean cumulative absolute error (MCAE) across the three benchmark systems. The AFT block outperforms matched-capacity multi-head attention models, and when combined with dynamic re-encoding, it yields the most robust performance on systems with switching and feedback dynamics. The Koopman+AFT model shows marked improvements in long-horizon predictions, achieving lower errors over horizons up to 1000 steps.
Implications
The findings suggest that the proposed methods can significantly enhance the stability and accuracy of predictions in dynamic systems, making them applicable in fields such as control systems, robotics, and biological modeling where long-term predictions are crucial.
Deciphering Fingerprints of 3D Molecular Surfaces for Accurate Epitope Prediction
Graph Learning
- SurfBind is a surface-centric framework that enhances epitope prediction by focusing on molecular surface representations.
- The model integrates geometric and physicochemical cues using a Transformer architecture with hierarchical prediction.
- Experiments show that SurfBind outperforms existing methods, achieving state-of-the-art results on benchmark datasets.
- The framework demonstrates strong generalization capabilities across diverse antibody contexts and conformational states.
Read more
Deciphering Fingerprints of 3D Molecular Surfaces for Accurate Epitope Prediction
Summary
This paper introduces SurfBind, a novel framework for epitope prediction that focuses on the geometric and physicochemical characteristics of molecular surfaces. Traditional methods often rely on sequence or backbone structures, which fail to capture the complex and discontinuous nature of epitopes. SurfBind addresses this limitation by utilizing a Transformer-based architecture that integrates surface representations with binder-aware cross-attention and a hierarchical prediction approach. The framework partitions molecular surfaces into irregular patches, allowing for efficient modeling of long-range dependencies and interaction-aware features. The authors demonstrate SurfBind's effectiveness through extensive experiments on benchmark datasets like SAbDab and DB5.5, showing that it achieves state-of-the-art performance and generalizes well to unseen antibodies and conformational states. This work emphasizes the importance of surface-driven modeling in understanding protein-protein interactions and improving epitope prediction accuracy.
Methodology
SurfBind employs a Transformer-based architecture that models molecular surfaces as irregular patches, facilitating the integration of geometric and physicochemical information. It utilizes binder-aware cross-attention to enhance interaction modeling between antibodies and antigens, and implements a hierarchical coarse-to-fine prediction strategy to capture both global and local features.
Results
The evaluation of SurfBind on epitope prediction benchmarks reveals significant improvements in accuracy and robustness compared to existing methods. The model shows strong generalization to unseen epitopes and diverse antibody contexts, highlighting its effectiveness in capturing the complexities of protein-protein interactions.
Implications
The findings suggest that SurfBind can be instrumental in advancing antibody engineering, immunotherapy, and vaccine design by providing more accurate epitope predictions. The approach may also be applicable to other areas of molecular biology where understanding surface interactions is critical.
Holistic Data Scheduler for LLM Pre-training via Multi-Objective Reinforcement Learning
Large Language Models
Reinforcement Learning
Optimization
- Introduction of the Holistic Data Scheduler (HDS) for LLM pre-training.
- HDS employs a multi-objective reward function that considers data quality, inter-domain influence, and model-driven aspects.
- Utilizes the Soft Actor-Critic (SAC) algorithm for reinforcement learning in a continuous control space.
- Achieved 44% fewer training iterations and a 7.2% improvement on MMLU 0-shot task compared to existing methods.
Read more
Holistic Data Scheduler for LLM Pre-training via Multi-Objective Reinforcement Learning
Summary
This paper introduces the Holistic Data Scheduler (HDS), a novel framework for online data mixing in the pre-training of Large Language Models (LLMs). Recognizing the limitations of existing methods that rely on a singular optimization perspective, HDS formulates the data scheduling challenge as a reinforcement learning problem, utilizing the Soft Actor-Critic (SAC) algorithm for enhanced stability and sample efficiency. The core innovation of HDS is its multi-objective reward function, which integrates three critical perspectives: data quality, inter-domain influence, and model weight norms. Through systematic experiments on various LLM sizes, HDS demonstrated significant improvements in training efficiency and model performance. Specifically, on The Pile benchmark, HDS achieved a final validation perplexity with 44% fewer training iterations compared to the next best method and a 7.2% improvement on the MMLU 0-shot task, alongside consistent gains across other benchmarks. This work highlights the importance of a holistic approach to data composition in LLM training, aiming to reduce both financial costs and environmental impact associated with pre-training.
Methodology
The HDS framework formulates the data mixing challenge as a reinforcement learning task, modeled as a Markov Decision Process (MDP). It employs the Soft Actor-Critic (SAC) algorithm to explore the high-dimensional policy space. The state vector captures the model's performance, learning velocity, and stability, while the agent's action corresponds to the sampling probabilities for data domains in the training batch.
Results
HDS achieved a final validation perplexity on The Pile benchmark with 44% fewer training iterations compared to the next best method. Additionally, it improved the MMLU 0-shot task accuracy by 7.2% and showed consistent performance gains across various benchmarks, indicating enhanced training efficiency and model capability.
Implications
The findings suggest that a holistic approach to data scheduling can significantly improve the efficiency and effectiveness of LLM pre-training. This could lead to reduced costs and environmental impacts associated with large-scale model training, making it more sustainable and accessible.
Cyclic Denoising Reveals Ultrastable Memories in Diffusion Models
Generative Models
- Cyclic denoising is a new extraction attack method for image diffusion models.
- The technique reveals ultrastable attractors that correspond to memorized training images.
- Cyclic denoising requires no prior knowledge of training data and operates based on the model's own dynamics.
- The method demonstrates a yielding-like transition in dynamics based on noise amplitude.
Read more
Cyclic Denoising Reveals Ultrastable Memories in Diffusion Models
Summary
This paper introduces a novel technique called cyclic denoising, which involves repeated forward and reverse diffusion at controlled noise amplitudes to extract information from image diffusion models. The method is inspired by the behavior of disordered solids under cyclic mechanical perturbations, where the system evolves into increasingly stable configurations. The authors demonstrate that cyclic denoising can uncover regions of the learned distribution that are typically inaccessible through standard sampling methods. The technique reveals attractors with varying stability, particularly highlighting 'ultrastable' attractors that can be regenerated even from significant corruption. These attractors often correspond to memorized training images, such as stock photos and watermarks. Unlike previous extraction attacks that rely on external signals or prior knowledge of training data, cyclic denoising operates solely based on the model's sampling dynamics, requiring only sampler-level control. The authors validate their approach on both latent and pixel-space diffusion models, observing a transition in dynamics based on noise amplitude, which influences the stability and structure of the recovered images. The findings position cyclic denoising as a significant tool for auditing model memorization, with broader implications for privacy and copyright compliance.
Methodology
The authors implement cyclic denoising by applying forward noising to an image or latent representation at controlled amplitudes, followed by reverse denoising. This process is repeated, allowing the model to explore its generative landscape and identify stable configurations or attractors that persist over multiple cycles.
Results
The study finds that cyclic denoising successfully identifies deep attractors corresponding to memorized images, demonstrating consistent behavior across different diffusion models. The results indicate a spectrum of stability in the attractors, with low amplitudes leading to trivial states and higher amplitudes enabling exploration of deeper, more stable regions.
Implications
The findings suggest that cyclic denoising can serve as a practical tool for auditing generative models, addressing concerns related to privacy and copyright by revealing how models may retain and reproduce sensitive training data. This has significant implications for the ethical deployment of generative AI technologies.
Reliable Conformal Prediction for Ordinal Classification Using the Ranked Probability Score
Theory
- Introduction of RPS as a nonconformity measure for conformal prediction in ordinal classification.
- The method ensures contiguous prediction sets that respect the ordinal structure of labels.
- Theoretical guarantees for marginal coverage and reduced miscoverage severity.
- Model-agnostic approach that can be integrated with various probabilistic predictors.
Read more
Reliable Conformal Prediction for Ordinal Classification Using the Ranked Probability Score
Summary
This paper addresses the challenges of uncertainty quantification in ordinal classification (OC), particularly in high-stakes domains like medicine and finance. The authors introduce a novel conformal prediction (CP) method that utilizes the ranked probability score (RPS) as a nonconformity measure. RPS is a proper scoring rule that effectively captures the ordinal nature of classification tasks. The proposed method is model-agnostic and generates contiguous prediction sets that reflect the ordinal structure of the label space. The authors demonstrate that RPS-based CP not only satisfies marginal coverage guarantees but also minimizes the severity of miscoverage, which is crucial in applications where misclassification can have significant consequences. The method is evaluated across various ordinal datasets, showing that it balances prediction set width and miscoverage effectively compared to existing CP methods.
Methodology
The authors propose a conformal prediction framework that employs the ranked probability score (RPS) as a nonconformity measure. This approach is model-agnostic and allows for the construction of contiguous prediction sets based on cumulative predictive distributions. The method is validated through theoretical analysis and empirical evaluation on ordinal image and tabular datasets.
Results
The RPS-based conformal prediction method produced contiguous prediction sets that maintained a favorable balance between set width and the magnitude of ordinal miscoverage. The empirical results indicated that the proposed method outperformed existing conformal prediction techniques in terms of both coverage and efficiency.
Implications
The findings suggest that RPS-based conformal prediction can significantly enhance uncertainty quantification in ordinal classification tasks, making it particularly valuable in high-stakes applications such as medical diagnosis and financial risk assessment. This method could lead to more reliable decision-making processes in these critical areas.
Adaptive Joint Compression and Synchronisation in Federated Split Learning for IoT Rainfall Prediction
Federated Learning
Time Series
Efficient ML
- Introduction of a joint optimization approach for activation compression and synchronization intervals in FSL.
- Validation of the framework through extensive simulations and real-world Raspberry Pi deployments.
- Demonstrated significant reductions in communication overhead while maintaining predictive performance.
- Adaptive scheduling mechanism adjusts communication parameters based on runtime latency signals.
Read more
Adaptive Joint Compression and Synchronisation in Federated Split Learning for IoT Rainfall Prediction
Summary
This paper presents a novel framework for Federated Split Learning (FSL) aimed at improving IoT rainfall prediction by addressing the communication bottlenecks caused by repeated activation and gradient exchanges. Previous approaches have optimized activation compression or synchronization frequency independently, but this work introduces a joint optimization strategy that adapts both parameters based on latency signals. The proposed system is evaluated using hourly ERA5 data from 11 weather stations, employing a 17-scenario simulation matrix and a real-world deployment on Raspberry Pi devices. The results demonstrate that the adaptive mechanism can effectively balance communication efficiency and predictive performance, achieving significant reductions in activation upload payload and synchronization traffic without compromising model accuracy. The findings highlight the potential for improved resource management in bandwidth-constrained IoT environments, paving the way for more efficient collaborative training methods.
Methodology
The authors developed an FSL framework that employs a latency-driven scheduler to jointly regulate activation compression and synchronization intervals. The system was tested through simulations across various latency profiles and a real-world deployment on Raspberry Pi devices, allowing for the evaluation of communication efficiency and predictive performance.
Results
The adaptive scheduling mechanism resulted in an 87% reduction in activation upload payload and a 54% reduction in synchronization traffic compared to the float32 baseline. The predictive performance, measured by AUPRC, remained stable across configurations, indicating that aggressive quantization and sparser aggregation did not significantly degrade model accuracy.
Implications
The findings suggest that the proposed framework can enhance the efficiency of federated learning in IoT applications, particularly in scenarios with limited bandwidth. This could lead to broader adoption of distributed machine learning techniques in environmental monitoring and smart city applications, ensuring privacy and reducing communication costs.
ASAP: Agent-System Co-Design for Wall-Clock-Centered Auto HPO Research for ML Experiments
Optimization
- ASAP integrates multiple inductive-biased optimizers to enhance sample efficiency in HPO.
- The approach focuses on minimizing real-world wall-clock time rather than just iteration count.
- Innovative techniques such as KV-cache reuse and speculation parallelism are employed to optimize performance.
- Extensive experiments validate the superiority of ASAP over traditional HPO methods.
Read more
ASAP: Agent-System Co-Design for Wall-Clock-Centered Auto HPO Research for ML Experiments
Summary
The paper introduces ASAP, a novel approach to Hyperparameter Optimization (HPO) that addresses key limitations of existing methods by integrating a diverse pool of inductive-biased optimizers under a single LLM agent. Traditional HPO tools often struggle with sample efficiency and performance when faced with diverse problems due to their reliance on specific surrogate models that impose inductive biases. Recent LLM-based methods have shown promise but still suffer from limitations, primarily focusing on iteration count rather than real-world wall-clock time. ASAP overcomes these challenges through an agent-system co-design that emphasizes wall-clock efficiency. It employs a prefix-stable prompt to maximize KV-cache reuse, utilizes speculation parallelism to hide latencies, and incorporates a Self-Tuner to adaptively optimize the speculation threshold. Extensive experiments demonstrate that ASAP consistently outperforms existing baselines across various HPO tasks, highlighting the effectiveness of tool integration and the importance of wall-clock-centered design in HPO research.
Methodology
ASAP employs an agent-system co-design that integrates a diverse pool of HPO optimizers, utilizing an LLM to select among their proposals. It features a KV-cache-aware prefix-stable prompt for efficient caching, speculation parallelism to reduce latency, and a Self-Tuner for adaptive optimization based on execution logs.
Results
The experiments conducted on various modern HPO tasks show that ASAP consistently outperforms baseline methods, demonstrating significant improvements in wall-clock efficiency and optimization quality.
Implications
ASAP's approach could lead to more efficient HPO practices in machine learning, enabling researchers and practitioners to achieve better model performance with reduced computational resources and time. This could facilitate broader applications of machine learning across diverse domains.
EPTS: Elastic Post-Training Sparsity for Efficient Large Language Model Compression
NLP
Large Language Models
Efficient ML
- EPTS provides a unified framework for Multi-Sparsity optimization, eliminating the need for separate optimization sessions.
- The MS-HiLoRA mechanism allows for effective knowledge inheritance across different sparsity levels.
- The MSFM enhances model adaptability to varying sparsity configurations.
- EPTS demonstrates competitive performance against existing methods while improving deployment efficiency.
Read more
EPTS: Elastic Post-Training Sparsity for Efficient Large Language Model Compression
Summary
The paper introduces Elastic Post-Training Sparsity (EPTS), a novel framework designed to enhance the efficiency of Large Language Model (LLM) compression. Traditional Post-Training Sparsity (PTS) methods require separate optimization sessions for each sparsity level, which is time-consuming and limits flexibility in deployment across various hardware configurations. EPTS addresses these limitations by providing a unified Multi-Sparsity framework that allows for a single elastic model to maintain robust performance across different sparsity configurations through a one-shot optimization process. The authors propose two key mechanisms: the Multi-Sparsity Hierarchy LoRA (MS-HiLoRA), which facilitates knowledge transfer from low to high sparsity groups, and the Multi-Sparsity Feature Mixer (MSFM), which enhances adaptability to pruning perturbations by dynamically fusing feature representations. Experimental results on LLaMA and OPT families demonstrate that EPTS achieves competitive performance compared to state-of-the-art methods like SparseGPT and Wanda, while significantly improving efficiency by enabling multi-scenario deployment from a single optimization.
Methodology
The methodology involves the development of the EPTS framework, which integrates the MS-HiLoRA and MSFM mechanisms. MS-HiLoRA enables knowledge transfer between sparsity levels, while MSFM dynamically combines feature representations to improve adaptability. The framework is optimized in a single session, allowing for the generation of a model that performs well across multiple sparsity configurations.
Results
Extensive experiments conducted on LLaMA and OPT families show that EPTS achieves performance comparable to leading methods such as SparseGPT and Wanda. Additionally, EPTS provides significant efficiency gains by allowing for flexible deployment across various hardware scenarios without the need for multiple optimization processes.
Implications
The implications of this work are significant for the deployment of large language models on resource-constrained devices, as EPTS allows for efficient model compression and adaptability to different operational environments. This can facilitate broader applications of LLMs in real-world scenarios where computational resources are limited.
A Fair Evaluation of Graph Foundation Models for Node Property Prediction
Graph Learning
- The study reevaluates nine recent GFMs for node property prediction in a standardized manner.
- Only the latest GFMs based on the Prior-data Fitted Networks paradigm outperform well-tuned GNNs.
- The paper emphasizes the need for a unified evaluation framework in the GFM community.
- Higher computational costs are associated with the GFMs that outperform GNNs.
Read more
A Fair Evaluation of Graph Foundation Models for Node Property Prediction
Summary
This paper addresses the growing interest in Graph Foundation Models (GFMs) for node property prediction, a crucial task in Graph Machine Learning with applications in various industries. Despite the emergence of numerous GFMs, the lack of a unified evaluation framework has hindered reliable comparisons among these models and with traditional Graph Neural Networks (GNNs). The authors conduct a comprehensive empirical study to reevaluate nine recent GFMs against well-tuned GNN baselines. Their findings reveal that only the latest GFMs, particularly those based on the Prior-data Fitted Networks paradigm, demonstrate superior predictive performance compared to GNNs, albeit with increased inference costs. This study highlights the necessity for standardized evaluation metrics in the field to facilitate fair comparisons and advance the development of effective models for node property prediction.
Methodology
The authors conducted a large empirical study comparing nine recent GFMs with strong GNN baselines. They established a more reliable and realistic evaluation setting to assess the predictive performance of these models, addressing issues such as dataset selection and baseline strength that have plagued previous evaluations.
Results
The study found that most GFMs could not outperform well-tuned GNNs, with only a few recent models based on the Prior-data Fitted Networks paradigm achieving better predictive performance. However, these models incurred significantly higher inference costs.
Implications
The results suggest that while GFMs have potential, their practical application may be limited by computational costs. The need for standardized evaluation metrics could lead to more reliable comparisons and advancements in model development for node property prediction tasks across various domains.
Learning with a Single Rollout via Monte Carlo Pass@k Critic
Reinforcement Learning
Large Language Models
NLP
- Introduces SR-PPO for efficient token-level credit assignment in RL for language models.
- Utilizes a single rollout to mitigate the computational cost and improve credit assignment accuracy.
- Employs a Pass@k metric to provide a more selective learning signal compared to traditional Pass@1.
- Demonstrates stable learning dynamics and improved success rates on reasoning benchmarks.
Read more
Learning with a Single Rollout via Monte Carlo Pass@k Critic
Summary
This paper introduces Single-Rollout Proximal Policy Optimization (SR-PPO), a novel reinforcement learning (RL) approach designed to estimate token-level advantages for language models while minimizing the computational burden of multiple rollouts. Traditional methods struggle with credit assignment due to sparse rewards and heterogeneous trajectories, particularly in complex reasoning tasks. SR-PPO addresses these challenges by employing a calibrated token-level credit critic that predicts the Pass@k success probability based on a single rollout per prompt. This method allows for a more selective learning signal, prioritizing difficult prefixes that are less likely to succeed. The authors demonstrate that as k increases, Pass@k approaches a reachability indicator, which can be computed efficiently, thus providing a robust alternative to traditional credit assignment methods. Initial experiments show that SR-PPO achieves stable learning dynamics and significant improvements in Pass@128 success rates on mathematical reasoning benchmarks, indicating its effectiveness in enhancing the training of language models in complex reasoning tasks.
Methodology
The methodology involves training a calibrated token-level credit critic using Monte Carlo outcomes from a single rollout per prompt. The critic predicts the Pass@k success probability, which is then used to assign credit to tokens based on their contribution to the likelihood of achieving a successful outcome. This approach replaces traditional advantage estimation methods that rely on multiple rollouts and bootstrapping.
Results
The results indicate that SR-PPO leads to stable learning dynamics and consistent gains in Pass@128 success rates on benchmarks such as HMMT26 and AIME24. The method demonstrates improved sample efficiency compared to traditional group-based optimization methods, allowing for effective learning from a single trajectory.
Implications
The implications of this research suggest that SR-PPO can significantly enhance the training of language models in complex reasoning tasks, particularly in scenarios where generating multiple rollouts is expensive or impractical. This approach could lead to more efficient RL algorithms in various applications, including mathematical reasoning and code generation.
When Top-1 Fails: Calibrating LoRA Monitors for Masked Diffusion LMs
NLP
Large Language Models
Generative Models
- Top-1 concentration fails as a reliable stability warning for DLM fine-tuning.
- Max gradient norm serves as a more effective parameter-side signal for training stability.
- The study provides a family-calibrated triage protocol with significant predictive precision.
- Calibration of monitoring thresholds should be specific to each DLM family.
Read more
When Top-1 Fails: Calibrating LoRA Monitors for Masked Diffusion LMs
Summary
This paper investigates the efficacy of using top-1 argmax concentration as a stability warning for fine-tuning discrete diffusion language models (DLMs) with low-rank adaptation (LoRA). The authors conducted experiments across 816 configurations from three DLM families, discovering that while the top-1 warning triggered in all configurations, it failed to predict any actual collapses, resulting in zero precision. The authors attribute this failure to pre-equilibrium saturation, where top-1 concentration is already high before optimization. To address this, they evaluated the max LoRA gradient norm as a more reliable signal for monitoring training stability. Their findings indicate that a threshold based on max gradient norm can effectively identify stable configurations, achieving a precision of 0.68 and an F1 score of 0.79 on held-out data, outperforming the top-1 baseline. The paper concludes with a proposed workflow for DLM-LoRA training that emphasizes the use of max-gradient monitoring over top-1 concentration, suggesting that calibration should be model family-specific rather than universal.
Methodology
The authors conducted a series of experiments across 816 DLM PEFT configurations from three model families, measuring top-1 concentration and max gradient norm. They analyzed the correlation between these metrics and actual training stability, employing statistical tests to validate their findings and establish thresholds for monitoring.
Results
The study found that top-1 concentration indicated a collapse warning in all configurations (816/816) but recorded zero actual collapses (0/816), resulting in a precision of 0. The max gradient norm, however, was able to predict stable configurations with a precision of 0.68 and an F1 score of 0.79 on a held-out dataset, significantly outperforming the top-1 baseline.
Implications
The results suggest that practitioners should reconsider the reliance on top-1 concentration as a diagnostic tool in DLM fine-tuning and adopt max gradient norm monitoring instead. This shift could lead to more effective training strategies and improved model stability, particularly in low-cost training scenarios.
EMAgnet: Parameter-Space EMA Regularization for Policy Gradient Self-Play in Large Games
Reinforcement Learning
- EMAgnet introduces an adaptive regularization target using an exponential moving average of policy parameters.
- The method improves upon traditional uniform regularization by focusing on viable strategies and discarding dominated ones.
- Evaluation shows EMAgnet achieves lower exploitability and better performance in various game scenarios.
- The approach is applicable to deep reinforcement learning, extending previous tabular methods to neural network policies.
Read more
EMAgnet: Parameter-Space EMA Regularization for Policy Gradient Self-Play in Large Games
Summary
This paper introduces EMAgnet, a novel regularization technique for policy gradient methods in self-play scenarios, particularly in two-player zero-sum imperfect-information games. Traditional methods, such as Proximal Policy Optimization (PPO), utilize a uniform distribution as a regularization target, which can lead to inefficient learning by equally promoting all actions, including strictly dominated strategies. EMAgnet addresses this limitation by employing an exponential moving average (EMA) of the last-iterate policy's parameters as a dynamic regularization target. This adaptive approach allows the regularization to evolve alongside the agent's strategy, focusing on viable actions while discarding dominated ones. The authors evaluate EMAgnet against standard benchmarks and modified environments with exploration challenges, demonstrating that it consistently achieves lower exploitability and improved performance compared to PPO with uniform regularization. The findings suggest that EMAgnet enhances the efficiency of learning in complex games by better aligning the regularization process with the agent's strategic development.
Methodology
The authors extend the concept of a moving magnet from tabular settings to deep reinforcement learning by implementing EMA regularization within the PPO framework. They maintain an exponential moving average of the policy parameters, which adapts as the policy improves, thus dynamically adjusting the regularization target based on the agent's learning progress.
Results
EMAgnet outperforms PPO with uniform regularization in most tested environments, particularly in scenarios with strictly dominated strategies. The results indicate that EMAgnet leads to lower exploitability and consistent performance gains across various benchmarks, showcasing its effectiveness in complex game settings.
Implications
The findings suggest that EMAgnet could be a valuable tool for enhancing self-play training in AI systems, particularly in complex strategic environments. Its adaptive nature may lead to more efficient learning and better performance in a wide range of applications, including competitive gaming and decision-making systems.
The Inference-Compute Frontier and a Latency-Efficient Architecture for Limit Order Book Prediction
Time Series
Efficient ML
- Identification of a power-law relationship between predictive loss and structural forward work in LOB prediction.
- Demonstration that latency behaves differently from compute, necessitating separate considerations in model design.
- Introduction of FastBiNLOB, an architecture that achieves lower latency while maintaining high predictive accuracy.
- Empirical validation of the inference-compute frontier across various model families.
Read more
The Inference-Compute Frontier and a Latency-Efficient Architecture for Limit Order Book Prediction
Summary
This paper investigates the existence of a scaling-law-style inference-compute frontier in the context of limit order book (LOB) prediction. Utilizing the FI-2010 dataset, the author examines various model architectures, including decision trees and neural networks, to determine the relationship between predictive loss and structural forward work. The findings reveal a power-law relationship that effectively summarizes the empirical frontier, achieving an R² value of 0.941 when extrapolating to high-compute neural architectures. However, the study also highlights that latency does not correlate as strongly with compute, prompting the development of FastBiNLOB, a new architecture designed to optimize both predictive accuracy and latency. FastBiNLOB demonstrates superior performance in terms of macro-F1 scores and reduced latency compared to existing state-of-the-art models. The paper contributes to the understanding of model complexity in LOB prediction and emphasizes the need to consider latency as a distinct design target alongside predictive power.
Methodology
The study employs a comparative analysis of various model architectures on the FI-2010 dataset, focusing on both predictive loss and latency. It uses power-law fitting to characterize the inference-compute frontier and conducts experiments to evaluate the performance of FastBiNLOB against existing models in terms of macro-F1 scores and latency.
Results
The empirical analysis confirms the existence of a scaling-law-style inference-compute frontier in LOB prediction, with an R² value of 0.941 for the power-law fit. FastBiNLOB outperforms existing architectures, achieving lower latency and higher macro-F1 scores on the y10 and y100 targets, with median decreases in inference time of approximately 23% and 60%, respectively.
Implications
The findings suggest that model complexity in LOB prediction should be carefully balanced with latency considerations, especially in high-frequency trading environments. The introduction of FastBiNLOB provides a promising avenue for developing more efficient predictive models that can operate effectively in real-time scenarios.
FactorLibrary: From Polynomials to Circuits via Recursive Subgoals
Reinforcement Learning
Theory
Optimization
- Introduces FactorLibrary to manage combinatorial search space in arithmetic circuit optimization.
- Formulates the problem as a reinforcement learning task with both bottom-up and top-down approaches.
- Demonstrates that the top-down PPO+MCTS agent achieves a 91.8% success rate for complexity up to 8.
- Shows that learned policies generalize well to unseen targets, outperforming random baselines.
Read more
FactorLibrary: From Polynomials to Circuits via Recursive Subgoals
Summary
This paper addresses the challenge of finding minimal arithmetic circuits for polynomials over finite fields, a problem central to algebraic complexity theory. The authors propose a novel framework called FactorLibrary, which utilizes reusable subgoals to manage the combinatorial complexity of the search space. The problem is formulated as a reinforcement learning task approached from both bottom-up and top-down perspectives. The bottom-up agent employs Gumbel-PPO-MCTS, while two top-down agents utilize PPO+MCTS and Soft Actor-Critic (SAC). The FactorLibrary serves as a repository of factorizable subexpressions, enhancing the agents' ability to learn and generalize across training episodes. The results indicate that the top-down PPO+MCTS agent achieves a success rate of 91.8% in finding optimal circuits for polynomials of complexity up to 8, significantly outperforming the bottom-up approach and a random baseline. This work not only advances the understanding of arithmetic circuit optimization but also provides a testbed for symbolic search heuristics applicable in broader contexts such as quantum circuit synthesis and automated proof generation.
Methodology
The authors implemented a reinforcement learning framework with three agents: a bottom-up agent using Gumbel-PPO-MCTS and two top-down agents employing PPO+MCTS and SAC. The FactorLibrary was utilized to store reusable subgoals, facilitating the learning process by breaking down complex goals into manageable components.
Results
The bottom-up agent achieved a near-perfect success rate of 99.2% for small circuits but dropped to 7.8% at higher complexities due to combinatorial explosion. In contrast, the top-down PPO+MCTS agent maintained a 91.8% success rate across complexities C2 to C10, while the SAC agent reached a success rate of 92.8% at a lower computational cost. Both methods significantly outperformed the 40% success rate of a uniform random baseline.
Implications
The findings suggest that reinforcement learning can effectively tackle complex problems in algebraic complexity theory, with potential applications in symbolic reasoning, quantum circuit synthesis, and automated proof generation. The FactorLibrary approach may also inspire new methods in other areas of machine learning that require efficient search strategies.