AI-generated summaries
Today's ML research,
without the noise.
Daily summaries of the latest machine learning papers from arXiv, processed every 8 hours.
48
Papers today
8h
Update frequency
7
Days of history
FedUP: One-Shot Federated Unlearning via Centroid-Guided Plug-in Filters
Federated Learning
- FedUP provides a one-shot federated unlearning solution that reduces latency from minutes to seconds.
- The framework uses lightweight pluggable filters to maintain model performance while unlearning specific data.
- It supports reversibility, allowing for easy restoration of previously forgotten knowledge.
- Extensive experiments validate the effectiveness of FedUP across diverse tasks.
Read more
FedUP: One-Shot Federated Unlearning via Centroid-Guided Plug-in Filters
Summary
The paper introduces FedUP, a novel framework for federated unlearning (FU) that addresses the challenges of non-target knowledge loss and high request latency in decentralized systems. Current FU methods struggle to balance the need for rapid unlearning with the preservation of model performance. FedUP employs lightweight pluggable filters that act as a 'knowledge funnel,' allowing for the removal of specific data while maintaining the integrity of the original model. By freezing the model parameters and training the filters using differentially private class centroid samples, FedUP significantly reduces unlearning latency from minutes to seconds, bypassing the need for extensive client-server communication and complex retraining processes. The framework also supports reversibility, enabling the restoration of forgotten knowledge simply by removing the filters. Experimental results across various image and text tasks demonstrate that FedUP effectively minimizes non-target knowledge loss while achieving high unlearning precision and efficiency.
Methodology
FedUP utilizes a framework of lightweight pluggable filters that are trained on differentially private class centroid samples. The original model parameters are kept frozen, allowing for quick unlearning without extensive retraining. This approach minimizes non-target knowledge loss and reduces communication overhead between clients and the server.
Results
The experiments conducted on various image and text datasets show that FedUP achieves superior unlearning precision and efficiency compared to existing methods, significantly lowering the latency associated with federated unlearning while preserving model performance.
Implications
FedUP's approach to federated unlearning has significant implications for compliance with privacy regulations such as GDPR, particularly in applications where the right to be forgotten is critical. Its efficient and reversible unlearning process can enhance user trust in machine learning systems that handle sensitive data.
3D Masked Autoencoders are Robust Learners of Volumetric and Multimodal Cellular Representations for Microscopy
Computer Vision
Multimodal
- 3D Masked Autoencoders outperform 2D models in cellular representation learning.
- Channel cross-attention and frequency-domain regularization enhance volumetric representation quality.
- Integration of protein sequence information via a pretrained model improves downstream task performance.
- MAE-3D achieves state-of-the-art results in protein localization and interaction tasks.
Read more
3D Masked Autoencoders are Robust Learners of Volumetric and Multimodal Cellular Representations for Microscopy
Summary
This paper investigates the efficacy of 3D Masked Autoencoders (MAE-3D) in learning volumetric and multimodal cellular representations from fluorescence microscopy data, contrasting it with 2D approaches. The authors systematically compare MAE-3D with 2D max-projection and slice-based models on volumetric microscopy datasets, demonstrating that MAE-3D consistently outperforms its 2D counterparts in downstream tasks such as protein-protein interaction and protein localization. The study highlights the importance of maintaining the full 3D structure of cells for more informative representations. Additionally, the integration of a pretrained protein language model (ESM2) through cross-modal supervision enhances representation quality, showing that multimodal alignment significantly benefits volumetric models. The findings underscore the advantages of native 3D modeling and the incorporation of biological knowledge for improved representation learning in single-cell microscopy.
Methodology
The authors developed a 3D Masked Autoencoder (MAE-3D) framework that processes volumetric z-stack images. They compared this with a 2D variant (MAE-2D) that uses maximum intensity projections. The model incorporates channel cross-attention and frequency-domain regularization to leverage 3D spatial context. Additionally, they aligned visual representations with a pretrained protein language model (ESM2) to enhance the representation learning process.
Results
MAE-3D achieved a ROC–AUC of 0.865 on a protein–protein interaction task, surpassing previous methods by up to 0.025. For protein localization, the best 3D model reached a micro AUC of 0.952 and an F1 micro score of 0.742, improving upon previous approaches by 0.003 and 0.010, respectively. These results demonstrate the effectiveness of 3D modeling and multimodal integration in enhancing representation learning.
Implications
The findings suggest that adopting 3D modeling techniques in microscopy can significantly improve the understanding of cellular structures and functions. This approach may lead to advancements in biological research and applications in areas such as drug discovery, disease modeling, and personalized medicine.
Structure-Aware Graph Multi-Task Learning for Dynamic Sparse OD Demand Prediction
Graph Learning
Time Series
Optimization
- SAGMTL effectively addresses the challenges of dynamic sparsity and long-tailed distributions in OD demand prediction.
- The framework decomposes OD prediction into structural state modeling and flow intensity estimation.
- A node-edge collaborative representation module captures essential regional and temporal dynamics.
- SAGMTL outperforms classical spatiotemporal forecasting models and advanced OD prediction methods.
Read more
Structure-Aware Graph Multi-Task Learning for Dynamic Sparse OD Demand Prediction
Summary
The paper addresses the challenge of Origin-Destination (OD) demand prediction in intelligent transportation systems, particularly in the context of dynamically sparse and long-tailed OD flow data. Traditional methods often treat OD prediction as a single flow regression task, which limits their effectiveness in modeling low-frequency and intermittent OD interactions. To overcome these limitations, the authors propose SAGMTL, a Structure-Aware Graph Multi-Task Learning framework that decomposes OD prediction into two main components: structural state modeling and flow intensity estimation. SAGMTL utilizes a node-edge collaborative representation module to capture regional semantics, temporal dynamics, and spatial priors, producing structure-aware representations for dynamic OD interactions. The framework jointly models stable demand patterns and short-term fluctuations through a multi-task decoding module, while a constraint-driven optimization module enhances sparsity awareness and structural consistency. Experimental results on real-world datasets from Beijing, Chengdu, and Nanjing demonstrate that SAGMTL outperforms existing state-of-the-art methods, highlighting the importance of explicitly modeling regional activity, connection states, and flow intensity for robust OD demand prediction.
Methodology
SAGMTL consists of three main components: a representation learning module that encodes static and dynamic features into node and edge representations, a multi-task decoding module that predicts OD connectivity states and flow intensities, and a constraint-driven optimization module that incorporates sparsity control and consistency constraints to improve prediction robustness.
Results
The experimental results indicate that SAGMTL consistently achieves superior performance on OD demand prediction tasks compared to various classical and advanced models, demonstrating its effectiveness in handling sparse and long-tailed data distributions.
Implications
The findings suggest that incorporating structural and semantic information into OD demand prediction models can significantly enhance their robustness and accuracy, with potential applications in traffic scheduling, urban planning, and intelligent transportation systems.
FlowPipe: LLM-Enhanced Conditional Generative Flow Networks for Data Preparation Pipeline Construction
Reinforcement Learning
Generative Models
Large Language Models
- FlowPipe reformulates data preparation pipeline construction as conditional probabilistic flow generation.
- It utilizes Conditional Generative Flow Networks optimized for effective credit assignment.
- Deep Semantic Modulation is introduced to enhance decision-making by integrating LLM-derived logical priors.
- The framework improves exploration efficiency by incorporating failure awareness to prune invalid states.
Read more
FlowPipe: LLM-Enhanced Conditional Generative Flow Networks for Data Preparation Pipeline Construction
Summary
The paper introduces FlowPipe, a novel framework designed to automate the construction of data preparation pipelines, which are essential for enhancing data quality in machine learning workflows. Traditional methods for pipeline construction face significant challenges due to the combinatorial complexity of operator sequences and the high computational cost of end-to-end evaluations. FlowPipe addresses these issues by reformulating pipeline synthesis as conditional probabilistic flow generation over a directed acyclic graph. It employs Conditional Generative Flow Networks (C-GFlowNets) optimized through a Trajectory Balance objective, ensuring effective credit assignment from validation rewards to early actions. Additionally, the framework incorporates Deep Semantic Modulation via Feature-wise Linear Modulation (FiLM) to integrate logical priors from large language models (LLMs), enhancing the decision-making process by adapting to dataset context. To improve exploration efficiency, FlowPipe integrates failure awareness into the flow objective, allowing for early pruning of semantically invalid states. Experimental results demonstrate that FlowPipe significantly outperforms state-of-the-art baselines, achieving an average accuracy improvement of 11.96% and a 12.5× speedup in training convergence across two benchmark suites with 74 real-world datasets.
Methodology
FlowPipe employs Conditional Generative Flow Networks (C-GFlowNets) optimized via a Trajectory Balance objective for credit assignment. It introduces Deep Semantic Modulation using Feature-wise Linear Modulation (FiLM) to adapt decision logic to dataset context and incorporates failure awareness to enhance exploration efficiency.
Results
FlowPipe demonstrates a significant performance enhancement over state-of-the-art methods, with an average accuracy improvement of 11.96% and a 12.5× increase in training convergence speed across two benchmark datasets.
Implications
The proposed framework has the potential to streamline the data preparation process in machine learning, making it more efficient and less reliant on human expertise. This could lead to improved data quality and faster deployment of machine learning models in various applications.
Provable Benefits of RLVR over SFT for Reasoning Models: Learning to Backtrack Efficiently
NLP
Large Language Models
Reinforcement Learning
- RLVR outperforms SFT in teaching efficient backtracking for reasoning tasks.
- SFT fails to learn backtracking strategies due to lack of exposure to dead ends.
- RLVR-trained models achieve exponential improvements in inference-time compute.
- Distilling reasoning traces from RLVR can enhance base model performance.
Read more
Provable Benefits of RLVR over SFT for Reasoning Models: Learning to Backtrack Efficiently
Summary
This paper investigates the advantages of reinforcement learning with verifiable rewards (RLVR) over supervised fine-tuning (SFT) in enhancing reasoning capabilities of large language models (LLMs). The authors model chain-of-thought (CoT) reasoning as a pathfinding problem on graphs and demonstrate that SFT, when trained solely on optimal paths, fails to teach models effective backtracking strategies. In contrast, RLVR allows models to learn backtracking through exploration of unsuccessful paths, leading to improved reasoning performance. The paper provides theoretical proofs showing that RLVR-trained models can achieve exponential improvements in inference-time compute efficiency compared to SFT models. Additionally, the authors propose that reasoning traces from RLVR can be distilled to enhance base models' backtracking capabilities, further optimizing their performance in reasoning tasks.
Methodology
The authors model CoT reasoning as a pathfinding problem on graphs, comparing RLVR and SFT through a theoretical framework. They analyze the learning dynamics of both methods, focusing on backtracking capabilities and the efficiency of inference-time compute. The study includes proofs demonstrating the differences in learning outcomes between the two approaches.
Results
The study proves that SFT does not enable backtracking strategies, while RLVR models can learn to backtrack effectively using outcome rewards. The RLVR model achieves a time complexity of Θ(WK) for reaching targets, compared to Θ(WLK) for SFT models. Additionally, distilling RLVR reasoning traces into a base model recovers the efficient backtracking capability.
Implications
The findings suggest that RLVR can significantly enhance the reasoning abilities of LLMs, making them more efficient in complex decision-making tasks. This has potential applications in various domains requiring advanced reasoning, such as automated problem-solving, code generation, and logical reasoning tasks.
AsyncOPD: How Stale Can On-Policy Distillation Be?
NLP
Large Language Models
Reinforcement Learning
- Asynchronous OPD can improve training efficiency but introduces challenges related to stale data.
- The direction of KL divergence (forward vs. reverse) affects the robustness of OPD to stale rollouts.
- Simpler OPD-specific methods outperform complex asynchronous RL techniques in mitigating staleness.
- Finite teacher-score caches create a bias-variance tradeoff that can be addressed with multi-sample Monte Carlo methods.
Read more
AsyncOPD: How Stale Can On-Policy Distillation Be?
Summary
The paper investigates the challenges of on-policy distillation (OPD) in the context of asynchronous training, particularly focusing on the effects of stale data on model performance. OPD, which trains a student model using its own rollouts guided by teacher feedback, faces a bottleneck similar to reinforcement learning (RL) due to the time-consuming nature of rollout generation. The authors present a systematic study of staleness in asynchronous OPD, revealing that the choice of KL divergence direction significantly impacts the robustness to stale rollouts. They find that teacher-weighted forward KL is more resilient compared to student-weighted reverse KL, which is vulnerable to stale data. The paper also explores whether techniques from asynchronous RL can stabilize OPD learning, concluding that simpler OPD-specific methods, such as recomputing the reverse-KL signal at learner time, are more effective. Additionally, the authors analyze the bias-variance tradeoff introduced by finite teacher-score caches and propose a multi-sample Monte Carlo approach to mitigate variance while maintaining correctability. The paper culminates in the introduction of AsyncOPD, an open-source asynchronous OPD training pipeline that demonstrates significant improvements in training throughput without sacrificing accuracy.
Methodology
The authors conducted a systematic analysis of staleness in asynchronous OPD, comparing the effects of forward and reverse KL divergence on model performance. They evaluated various stabilization techniques from asynchronous RL and proposed a simpler method for recomputing the reverse-KL signal. Additionally, they explored the impact of finite teacher-score caches on estimator design and introduced a multi-sample Monte Carlo approach to reduce variance.
Results
Experiments demonstrated that AsyncOPD achieves a training throughput increase of 1.6× to 3.8× compared to strict synchronous training, while maintaining comparable accuracy levels. The study also highlighted the vulnerabilities of reverse KL under staleness and the effectiveness of the proposed methods in addressing these issues.
Implications
The findings suggest that asynchronous training can be effectively applied to OPD, leading to more efficient training pipelines for large language models. The insights on staleness and the proposed methodologies could inform future research and applications in model distillation and reinforcement learning.
DREG: A Layer-Wise Jacobian Regularization as a General-Purpose Penalty
Theory
Optimization
Computer Vision
- DREG outperforms traditional regularizers in terms of accuracy and robustness, particularly under data scarcity.
- It is particularly effective with the GELU activation function, common in modern transformer architectures.
- DREG requires only a single hyperparameter and no per-dataset tuning, making it a practical drop-in solution.
- The method focuses regularization on layers with the highest activation derivatives, enhancing stability without global constraints.
Read more
DREG: A Layer-Wise Jacobian Regularization as a General-Purpose Penalty
Summary
This paper introduces DREG (Derivative Regularization), a novel layer-wise Jacobian regularization technique aimed at improving the performance of neural networks. The authors conducted a comprehensive empirical study involving 960 experiments across various activations, regularizers, datasets, and noise conditions to evaluate DREG's effectiveness. The findings reveal that DREG consistently outperforms traditional regularizers like Weight Decay and Dropout, particularly in scenarios with limited training data. It achieves the highest accuracy in clean conditions and ranks second in noise robustness, demonstrating its potential as a plug-and-play solution for modern deep learning architectures, especially those utilizing the GELU activation function. The study emphasizes DREG's ability to concentrate regularization where it is most needed, thus addressing the instability often encountered in deeper networks without sacrificing expressiveness.
Methodology
The authors employed a fully-crossed factorial experimental design, testing DREG against five competing regularizers (Dropout, Spectral Normalization, Weight Decay, IGPen, and no regularization) across eight datasets and various noise conditions. This approach allowed for systematic evaluation of DREG's performance under different settings, focusing on its layer-wise Jacobian penalty mechanism.
Results
DREG achieved the highest overall accuracy and clean-regime performance among all evaluated regularizers, significantly outperforming the unregularized baseline and other methods like Weight Decay and IGPen. It ranked second in noise robustness, only behind Spectral Normalization. The results indicated that DREG's advantages were most pronounced in data-scarce environments, highlighting its role as a geometric inductive bias.
Implications
DREG presents a promising regularization strategy for neural networks, particularly in applications where data is limited or where stability in deeper architectures is crucial. Its design allows for easy integration into existing models, making it a valuable tool for practitioners in various domains of deep learning.
Reasoning as Attractor Dynamics: Latent Memory Retrieval via Gibbs-Weighted Energy Minimization
NLP
Large Language Models
Theory
- LLMs can be viewed as Dense Associative Memories that store reasoning patterns as latent attractors.
- Correct reasoning chains correspond to deep attractor basins, while hallucinations correspond to sharp local minima.
- The proposed Gibbs-Weighted Basin Selection mechanism improves reasoning performance by sampling and weighting trajectories based on their stability.
- Empirical results show a 5.38% performance improvement on GSM8K, demonstrating the effectiveness of the proposed method.
Read more
Reasoning as Attractor Dynamics: Latent Memory Retrieval via Gibbs-Weighted Energy Minimization
Summary
This paper presents a novel perspective on the functioning of Large Language Models (LLMs), proposing that they operate as high-dimensional Dense Associative Memories. The author investigates the energy landscape of mathematical reasoning, suggesting that effective reasoning corresponds to deep attractor basins in the model's output distribution, while hallucinations arise from sharp, unstable local minima. To leverage this geometry, the paper introduces a retrieval mechanism based on Gibbs-weighted energy minimization, which samples multiple reasoning paths and weights them by their inverse energy. This approach effectively relaxes the system into a robust solution. The empirical results demonstrate a significant performance improvement of 5.38% on the GSM8K benchmark for Microsoft Phi-3.5, indicating that reasoning should be modeled as a dynamic settling process rather than a greedy next-token prediction. The work bridges the gap between modern Hopfield Networks and LLM reasoning, showcasing the potential of physics-inspired retrieval mechanisms in enhancing model performance.
Methodology
The methodology involves treating reasoning paths as particles in an energy landscape, defining their energy based on spectral entropy. The Gibbs Retrieval Operator is introduced to re-weight trajectories, allowing the system to relax into the global minimum by sampling multiple paths and applying a Gibbs distribution based on their stability.
Results
The proposed method led to a 5.38% increase in performance on the GSM8K benchmark for Microsoft Phi-3.5, achieving an accuracy of 90.1%. This improvement highlights the effectiveness of modeling inference as a dynamic settling process into attractor basins.
Implications
The findings suggest that LLMs can benefit from a more nuanced understanding of reasoning as a memory retrieval process, potentially leading to advancements in model design and performance in reasoning tasks. This approach could be applied to enhance various applications of LLMs in areas requiring complex reasoning.
One Ruler: A Same-Hands Re-Evaluation of Bivariate Causal Direction on Tübingen, with a Parameter-Free Compression Baseline
Theory
- Introduces a 'same-hands' evaluation protocol for bivariate causal direction methods.
- Presents a parameter-free baseline method using sorted-conditional compression.
- Finds that existing method rankings change significantly under standardized evaluation.
- Documents mechanisms that inflate reported accuracy figures in the literature.
Read more
One Ruler: A Same-Hands Re-Evaluation of Bivariate Causal Direction on Tübingen, with a Parameter-Free Compression Baseline
Summary
This paper presents a rigorous re-evaluation of bivariate causal direction methods using a standardized protocol on the Tübingen cause-effect pairs. The author argues that previous comparisons of methods are flawed due to varying protocols, including different subsets of pairs, weightings, and model selections. To address this, the author implements a 'same-hands' protocol where all methods are evaluated under identical conditions without tuning. A novel parameter-free baseline method based on sorted-conditional compression is introduced, which serves as a clean reference point. The results indicate that the rankings of existing methods change significantly under this common evaluation framework, with the baseline achieving a weighted accuracy of 74.7%. This score is statistically comparable to other established methods, challenging the inflated accuracy figures often reported in the literature. The paper also discusses the mechanisms behind these inflated figures and presents a model-free confounding detector. Overall, the findings emphasize the need for standardized evaluation in causal inference research.
Methodology
The methodology involves a strict evaluation protocol where all methods are run by the author on the same dataset of 102 Tübingen cause-effect pairs. Each method is forced to make a decision on every pair without tuning or abstention. The baseline method uses sorted-conditional compression to analyze the data, providing a zero-parameter reference point.
Results
The baseline method achieved a weighted accuracy of 74.7% on the 102 pairs, while other methods like SLOPE and RECI scored 77.2% and 70.7%, respectively. The paper highlights that these scores are statistically indistinguishable from the baseline, indicating that the baseline method performs comparably to the best existing methods.
Implications
The findings suggest that the field of causal inference may need to adopt more standardized evaluation practices to ensure fair comparisons among methods. This could lead to more reliable conclusions about the effectiveness of different causal inference techniques.
Leveraging AutoML for Sustainable Deep Learning: A Multi-Objective HPO Approach on Deep Shift Neural Networks
Efficient ML
Optimization
Computer Vision
- Introduces Deep Shift Neural Networks (DSNNs) as a solution to reduce computational demands in deep learning.
- Combines multi-fidelity and multi-objective optimization techniques to optimize DSNN configurations.
- Achieves a 20% increase in performance and over 60% reduction in emissions compared to default DSNNs.
- Demonstrates that quantizing smaller network portions can optimize energy consumption while maintaining performance.
Read more
Leveraging AutoML for Sustainable Deep Learning: A Multi-Objective HPO Approach on Deep Shift Neural Networks
Summary
This paper addresses the environmental and resource challenges posed by deep learning (DL) models, particularly in low-resource environments. It introduces Deep Shift Neural Networks (DSNNs), which utilize shift operations to reduce computational complexity during inference. The authors leverage AutoML techniques, specifically multi-fidelity (MF) and multi-objective (MO) hyperparameter optimization (HPO), to explore the configuration space of DSNNs, focusing on image classification tasks. The study empirically investigates the trade-offs between accuracy and energy consumption, leading to the identification of optimal DSNN configurations that enhance performance by approximately 20% while reducing emissions by over 60%. The findings reveal that quantizing smaller portions of the network with low precision can yield significant energy savings without compromising performance. The paper contributes to the field of Green AutoML by providing a tailored configuration space for DSNNs and insights into efficient design choices that balance accuracy and energy efficiency.
Methodology
The authors employed a combination of multi-fidelity and multi-objective hyperparameter optimization techniques to explore the configuration space of DSNNs. They extended the SMAC3 HPO package to balance predictive accuracy and energy consumption, utilizing tools like CodeCarbon to assess energy usage and emissions during model training and evaluation.
Results
The study found that optimized DSNN configurations led to a significant improvement in both accuracy (approximately 20% increase) and energy efficiency (over 60% reduction in emissions). The experiments revealed model-specific trade-offs in quantization strategies, indicating that lower precision in smaller network portions can be optimal for energy savings.
Implications
The findings have important implications for developing energy-efficient deep learning applications, particularly in resource-constrained environments such as edge computing and IoT. The insights into DSNN configurations can guide researchers and practitioners in designing models that minimize environmental impact while maintaining high performance.
Efficient Network Inference via Hardware-Aware Architecture Search, Model Pruning & Quantization
Efficient ML
- Investigates efficient network inference for GNSS interference characterization under strict resource constraints.
- Utilizes a combination of iterative structured pruning, static quantization, and hardware-aware zero-shot NAS.
- Demonstrates the trade-offs between predictive performance and deployment efficiency through experimental evaluations.
- Identifies compact model configurations that maintain competitive performance compared to uncompressed baselines.
Read more
Efficient Network Inference via Hardware-Aware Architecture Search, Model Pruning & Quantization
Summary
This paper addresses the challenge of efficient network inference for embedded global navigation satellite system (GNSS) interference monitoring, which requires fast and memory-efficient processing of large volumes of raw in-phase and quadrature (IQ) samples. The authors propose a framework that combines iterative structured pruning, post-training static quantization, and hardware-aware zero-shot neural architecture search (NAS) to optimize deep neural networks (DNNs) for resource-constrained environments. Starting with MCUNet as a baseline, the study evaluates how model compression and architecture optimization impact model size, computational complexity, and memory usage while preserving performance. Experiments conducted on a GNSS interference dataset demonstrate the effectiveness of the proposed methods, revealing that the combination of compression techniques and hardware-aware design significantly enhances the deployability of ML models on embedded platforms. The findings provide practical insights for developing compact ML solutions for real-time GNSS interference monitoring.
Methodology
The authors employed a deployment-oriented compression pipeline that includes iterative structured pruning to reduce model complexity and post-training static quantization for efficient inference. They also applied hardware-aware zero-shot NAS to optimize network architecture and pruning configurations based on specific inference-related constraints, such as computational cost and memory requirements.
Results
The experiments revealed that the proposed methods effectively reduced model size and computational complexity while maintaining high classification performance on the GNSS interference dataset. The combined approach of compression and architecture optimization led to the identification of compact models that are suitable for deployment on embedded platforms like the iMXRT1062 MCU and Raspberry Pi devices.
Implications
The findings suggest that the proposed framework can significantly enhance the feasibility of deploying advanced ML models for real-time GNSS interference monitoring on resource-constrained hardware. This has implications for various applications in positioning, navigation, and telecommunications, where reliable signal processing is critical.
Information-Theoretic Classifier-Free Guidance with Adaptive Schedule Optimization
Generative Models
Optimization
Computer Vision
- Introduces an information-theoretic framework for optimizing Classifier-Free Guidance schedules.
- Derives trajectory-level formulas for estimating consistency and coverage without explicit density estimation.
- Develops an adaptive schedule optimization method that allocates guidance selectively across noise levels.
- Demonstrates improved consistency and coverage in image generation tasks compared to constant guidance methods.
Read more
Information-Theoretic Classifier-Free Guidance with Adaptive Schedule Optimization
Summary
This paper addresses the challenge of balancing consistency and coverage in conditional generation using diffusion models, specifically through the lens of Classifier-Free Guidance (CFG). While CFG enhances condition consistency by adjusting a guidance weight, it often compromises diversity and distributional coverage. The authors propose an information-theoretic framework for optimizing the CFG schedule, which allows for a more nuanced control over the trade-off between consistency and coverage. They introduce a clean endpoint reference to define the desired trade-off and derive trajectory-level formulas to estimate the objective from samples and score evaluations, circumventing the need for explicit density estimation. The proposed adaptive schedule optimization method learns non-uniform guidance weights across different noise levels, leading to improved performance in generating images on datasets like ImageNet-512 and COCO. The results demonstrate that the learned schedules achieve better consistency and coverage compared to constant guidance approaches, thereby enhancing the overall quality of generated samples.
Methodology
The authors formulate the CFG schedule optimization problem by using a clean endpoint reference to specify the desired consistency-coverage trade-off. They derive formulas to estimate the consistency and coverage terms from generated samples and score evaluations, avoiding explicit density estimation. The optimization leads to an adaptive guidance schedule that varies across different noise levels.
Results
Experiments conducted on ImageNet-512 and COCO datasets show that the learned guidance schedules outperform constant guidance baselines in terms of condition consistency and distributional coverage, leading to higher quality generated samples.
Implications
The proposed method has significant implications for improving generative models in various applications, including image and text-to-image generation, by allowing for better control over the trade-offs between fidelity to conditions and diversity of outputs.
EML Trees Are Universal Approximators
Theory
Interpretability
Optimization
- EML trees can universally approximate functions in Sobolev spaces W k,∞.
- A generalization of the EML function incorporates six learnable parameters per unit.
- The paper provides an explicit construction of EML representations of multivariate polynomials.
- Empirical validation shows accurate approximation capabilities of EML trees on benchmark functions.
Read more
EML Trees Are Universal Approximators
Summary
This paper investigates the expressive power of tree-structured compositions of the recently introduced Exp-Minus-Log (EML) function, which serves as a continuous analogue of NAND gates. The authors demonstrate that EML trees possess a universal approximation property for functions in Sobolev spaces W k,∞ for k ∈ N. They establish a theoretical framework that allows for the explicit construction of EML trees that can mimic polynomial representations, thereby proving that these trees can approximate a broad class of functions to arbitrary accuracy. The authors also propose a learning algorithm for EML-type trees with fitting parameters and validate its effectiveness through empirical tests on benchmark functions. The findings position EML trees as a robust framework for function approximation, bridging the gap between theoretical understanding and practical application in machine learning.
Methodology
The authors develop a generalization of the EML function that includes six learnable parameters. They prove a universal approximation theorem for EML trees by constructing explicit representations of multivariate polynomials and analyzing their performance in Sobolev spaces. The methodology includes both theoretical proofs and empirical validation through optimization problems.
Results
The paper establishes a universal approximation theorem for EML trees, demonstrating that they can approximate functions in Sobolev spaces W k,∞ arbitrarily well. The authors also extend their results to closed domains and discuss potential extensions to positive and non-positive domains. Empirical tests confirm the effectiveness of EML trees in approximating benchmark functions with limited depth.
Implications
The findings suggest that EML trees could be utilized in various applications requiring function approximation, particularly in contexts where interpretability and compact analytic representations are desired. This work may influence future research in symbolic and interpretable machine learning, as well as optimization problems.
RAVEN: A Regime-Aware Variable-context Expert Network for Financial Time Series Forecasting
Time Series
- RAVEN addresses the limitations of fixed context windows in financial time series forecasting.
- The framework uses a Mixture-of-Experts approach with adaptive context selection.
- RAVEN significantly improves predictive performance on financial datasets, outperforming state-of-the-art models.
- The introduction of a Global Compressed Representation branch enhances temporal coherence.
Read more
RAVEN: A Regime-Aware Variable-context Expert Network for Financial Time Series Forecasting
Summary
The paper introduces RAVEN, a novel Mixture-of-Experts (MoE) framework aimed at improving financial time series forecasting by addressing the limitations of existing models that rely on fixed context windows. Financial time series data, characterized by non-stationarity and regime-dependent dependencies, often leads to poor predictive performance when using static look-back periods. RAVEN adapts to the temporal context of each input sample by constructing a hierarchy of nested windows based on learned importance scores, allowing for dynamic context selection. The framework employs a Cumulative Importance Thresholding (CIT) mechanism to derive these nested windows, which are then routed to specialized experts. Additionally, a Global Compressed Representation (GCR) branch maintains global temporal coherence. The model also incorporates a Correlation-Aware Weighting (CAW) method to align outputs from variable-length experts before aggregation. Through experiments on cumulative log-return prediction and fund sales forecasting, RAVEN demonstrates significant improvements over state-of-the-art methods, achieving higher Pearson correlation and lower mean squared error (MSE) across multiple benchmarks.
Methodology
RAVEN employs a Mixture-of-Experts framework that dynamically determines the temporal context for each input sample using learned importance scores. It constructs nested context windows through a Cumulative Importance Thresholding mechanism and utilizes a Global Compressed Representation branch for coherence. The model also implements Correlation-Aware Weighting to align outputs from variable-length experts.
Results
RAVEN achieved a 9.2% improvement in Pearson correlation on the HS300 dataset and a 20.2% improvement on the S&P500 dataset. It also reduced MSE by 18.2% in fund sales forecasting and outperformed state-of-the-art models in 14 out of 16 metrics across four PEMS traffic benchmarks.
Implications
The adaptive context selection mechanism of RAVEN can be applied to various financial forecasting tasks, potentially enhancing the performance of trading algorithms and risk management systems. Its approach may also inspire new methodologies in other domains where non-stationary data is prevalent.
Forget Without Compromise: Nexus Sampling for Streaming KV-Cache Eviction Under Fixed Budgets
NLP
Large Language Models
Efficient ML
- Deterministic top-K eviction methods lead to permanent loss of important tokens due to their myopic nature.
- Nexus Sampling combines Nexus scoring and weighted reservoir sampling to improve token retention.
- The proposed method shows theoretical advantages in long-run token survival compared to deterministic top-K.
- Empirical results indicate that Nexus Sampling performs comparably to dense attention while being more memory efficient.
Read more
Forget Without Compromise: Nexus Sampling for Streaming KV-Cache Eviction Under Fixed Budgets
Summary
This paper addresses the challenge of Key-Value (KV) cache eviction in long-context and agentic LLM workloads that exceed fixed memory budgets. Existing methods rely on deterministic top-K selection, which can lead to the irreversible loss of subtly important tokens. The authors propose Nexus Sampling, a training-free eviction method that combines Nexus scoring—an iterative approach to identify bridge tokens—with weighted reservoir sampling to retain tokens based on their inclusion probability. Theoretical analysis shows that Nexus Sampling outperforms deterministic top-K in preserving token importance over time. Empirical results demonstrate that at 80% KV cache eviction, Nexus Sampling achieves performance within 1 point of dense attention on LongBench and outperforms top-K baselines in retrieval-heavy tasks, while requiring up to 10× less per-sequence cache memory.
Methodology
The methodology involves two main components: Nexus scoring, which identifies bridge tokens through an iterative process over direct attention scores, and weighted reservoir sampling, which retains tokens based on their inclusion probability rather than deterministic selection. This approach allows for better retention of tokens that may be temporarily less important but are crucial for maintaining context.
Results
Nexus Sampling maintains performance within approximately 1 point of dense attention on LongBench while outperforming traditional top-K methods on retrieval-heavy tasks. It achieves this with up to 10× smaller per-sequence cache memory usage compared to dense FlashAttention-2.
Implications
The findings suggest that Nexus Sampling could significantly enhance the efficiency of KV cache management in LLMs, particularly in applications requiring long-context processing, such as multi-turn dialogue systems and persistent reasoning tasks. This method could lead to more effective use of memory resources in real-time applications.
Concept-Constrained Prompt Learning for Few-Shot CLIP Adaptation
Computer Vision
NLP
Multimodal
- CCPL introduces a lightweight regularization framework for few-shot CLIP adaptation.
- The method anchors class prompts to frozen concept prototypes, enhancing generalization to unseen classes.
- Concept dropout is utilized to reduce over-reliance on specific concepts during training.
- Empirical results show significant improvements in new-class accuracy on certain datasets.
Read more
Concept-Constrained Prompt Learning for Few-Shot CLIP Adaptation
Summary
This paper introduces Concept-Constrained Prompt Learning (CCPL), a novel framework designed to enhance few-shot adaptation of the CLIP model to downstream tasks. The authors identify that traditional class-only prompt optimization can lead to overfitting on base-class supervision, which negatively impacts the model's ability to generalize to unseen classes. CCPL addresses this issue by anchoring learnable class prompts to frozen concept-level text prototypes derived from a pre-constructed concept bank. This method does not require updating the CLIP encoders, thus maintaining efficiency. The framework employs a text-space cosine consistency objective to align class-prompt embeddings with these frozen prototypes, while concept dropout is introduced to mitigate over-reliance on specific concepts. The results demonstrate that CCPL improves the base-to-new harmonic mean on datasets like DTD and EuroSAT, indicating enhanced transferability to new classes, while showing near-neutral performance on OxfordPets. The findings suggest that concept constraints are most effective when they align with the dataset's semantics, highlighting the importance of fine-grained categories in prompt learning.
Methodology
The CCPL framework involves learning shared context tokens and instantiating class prompts by appending class names. Frozen concept prototypes are constructed from a class-level concept bank and are not updated during training. A cosine consistency objective aligns class-prompt embeddings with these prototypes, while concept dropout is applied to enhance robustness. At inference, class-prompt logits can be fused with concept-prototype logits using a tunable weight.
Results
CCPL achieved improvements in the base-to-new harmonic mean on the DTD dataset (+0.6) and EuroSAT (+2.9) compared to the baseline method CoOp. The EuroSAT improvement was primarily driven by enhanced accuracy on new classes. On the OxfordPets dataset, the performance remained near-neutral (−0.1). Ablation studies confirmed the consistent benefits of text-space concept regularization, while the optimal fusion weight was found to be dataset-dependent.
Implications
The findings suggest that incorporating concept constraints can significantly enhance the adaptability of CLIP models in few-shot learning scenarios, particularly in tasks where semantic alignment with dataset characteristics is crucial. This approach could be beneficial in various applications requiring efficient model adaptation with limited labeled data.
Solve for the Hyperparameter, Skip the Search: Kolmogorov-Optimal Scaling Laws for Spline Regression
Theory
Efficient ML
- KORE provides an analytical solution for optimal resolution in spline regression, bypassing the need for exhaustive hyperparameter search.
- The method achieves comparable accuracy to traditional cross-validation techniques while significantly reducing the number of model fits required.
- The approach is based on classical approximation theory, linking bias and variance to the resolution parameter in a closed-form expression.
- KORE is effective across a wide range of input dimensions and interaction orders, outperforming 21 other methods in terms of accuracy per compute unit.
Read more
Solve for the Hyperparameter, Skip the Search: Kolmogorov-Optimal Scaling Laws for Spline Regression
Summary
This paper presents a novel approach to hyperparameter tuning in spline regression, proposing a method that eliminates the need for exhaustive search. The authors derive a closed-form solution for the optimal resolution of spline models, leveraging classical approximation theory to establish a relationship between bias and resolution. By utilizing the Kolmogorov n-width of the smoothness class, they demonstrate that the optimal resolution can be computed analytically, significantly reducing computational costs. The proposed method, KORE (Kolmogorov-optimal Order-aware Resolution Estimation), requires only a small number of model fits to achieve results comparable to traditional cross-validation methods. The paper details how KORE can be applied across various input dimensions and interaction orders, achieving high accuracy while fitting fewer models. The findings suggest that for targets with low interaction complexity, solving for the resolution is more efficient than searching for it.
Methodology
The authors utilize classical approximation theory to derive a closed-form expression for the bias-variance tradeoff in spline regression. They introduce KORE, which fits two pilot resolutions to calibrate bias and noise scales, allowing for the computation of the optimal resolution without exhaustive search. The method is validated through extensive experiments on real datasets.
Results
KORE matches the accuracy of exhaustive 3-fold cross-validation and other traditional model selection criteria (like AIC and BIC) while fitting approximately 8 times fewer models. It ranks first among 21 methods in terms of accuracy delivered per unit of compute across 36 real tabular datasets.
Implications
The findings suggest that KORE can streamline the model selection process in spline regression, making it more efficient and accessible for practitioners. This approach could be extended to other model classes where similar analytical relationships can be established, potentially transforming hyperparameter tuning across various machine learning applications.
Expressivity Saturation: Reduced Affine Region Usage Under Increasing Task Complexity
Theory
Optimization
- The paper introduces a rigorous theorem that bounds the number of affine regions in piecewise-affine MLPs.
- Empirical evidence shows that increasing task complexity leads to reduced usage of affine regions, termed expressivity saturation.
- The reduction in realized affine regions correlates with degraded decision boundaries, impacting classification performance.
- The study connects theoretical insights with empirical observations, highlighting the gap between expressive capacity and utilization.
Read more
Expressivity Saturation: Reduced Affine Region Usage Under Increasing Task Complexity
Summary
This paper investigates the phenomenon of expressivity saturation in piecewise-affine neural networks, particularly focusing on multilayer perceptrons (MLPs) with ReLU activations. The authors highlight a significant gap between the theoretical capacity of these networks to realize affine regions and the actual number of regions utilized during training. They present a rigorous theorem that establishes an upper bound on the number of affine pieces that can be realized along a one-dimensional affine line-segment probe, leading to a neuron-threshold lower bound for representing target functions with specific piece complexity. Empirical analysis reveals that as task complexity increases, the number of realized affine regions decreases, a phenomenon termed expressivity saturation. This reduction in region usage is observed even when the worst-case theoretical capacity of the architecture remains unchanged. The authors also provide geometric and dynamical evidence linking the collapse of realized affine regions to failures in forming effective decision boundaries, indicating that optimization may yield low region usage alongside poor classification performance. Through visualizations of training dynamics, the paper illustrates a consistent refinement process in the affine-region partitions and decision boundaries throughout the optimization process.
Methodology
The authors employ a combination of theoretical analysis and empirical enumeration of affine regions in both one-dimensional and higher-dimensional settings. They analyze the structure of piecewise-affine networks using affine line-segment probes and conduct experiments to observe the effects of increasing task complexity on the number of realized affine regions.
Results
The study establishes a deterministic upper bound on the number of affine pieces realized by MLPs and demonstrates that increasing task complexity leads to a significant reduction in the number of realized regions. This reduction is associated with failures in forming effective decision boundaries, particularly in challenging tasks.
Implications
The findings suggest that understanding the limitations of neural network expressivity under complex tasks is crucial for designing more effective architectures and training protocols. The concept of expressivity saturation may inform future research on optimizing neural networks for complex input-label relationships.
Causal Variational Deep Embedding: A Family of Interventional Generators for Confounded Images
Generative Models
Computer Vision
Theory
- Introduces CAUVADE, a framework for generating images that account for unobserved confounders.
- Proves that the proposed canonical augmented SCM is dense in the class of all augmented SCMs compatible with a given causal diagram.
- Demonstrates the ability to produce diverse interventional distributions that span the feasible region of causal explanations.
- Shows improved performance in generating unconfounded images compared to traditional generative models.
Read more
Causal Variational Deep Embedding: A Family of Interventional Generators for Confounded Images
Summary
This paper addresses the challenge of generating images that are free from confounding biases inherent in training data. Traditional deep generative models often reproduce spurious associations due to unobserved confounders, leading to misleading outputs when users request specific attributes. The authors propose a novel framework called CAUVADE (Causal Variational Deep Embedding), which utilizes a canonical augmented Structural Causal Model (SCM) where the confounder is represented as a discrete latent variable. This approach allows for the exploration of a feasible region of interventional distributions rather than committing to a single causal mechanism. The authors demonstrate that their model is capable of producing diverse interventional samples that better reflect the underlying causal structure of the data. Through experiments on datasets like Color-MNIST, CelebA, and MIMIC-CXR-JPG, CAUVADE shows significant improvements in generating unconfounded images compared to existing models, effectively addressing the issue of confounding in image generation.
Methodology
The authors develop CAUVADE as a Gaussian-mixture Variational Autoencoder (VAE) that incorporates a discrete latent variable representing the confounder. An entropy regularizer is applied to the cluster posterior, allowing the model to explore a family of generators indexed by a parameter γ. This setup enables the model to fit the observational data while spanning the feasible region of interventional distributions.
Results
Experiments reveal that CAUVADE generates diverse interventional samples across various datasets, significantly improving the Fréchet Inception Distance (FID) compared to an unconfounded reference. The model effectively captures the underlying causal relationships in the data, demonstrating its capability to navigate the feasible region of causal explanations.
Implications
The findings suggest that CAUVADE can be applied in scenarios requiring robust image generation free from confounding biases, such as in medical imaging, where accurate representations are crucial. This framework can also enhance the interpretability of generative models by providing a clearer understanding of the causal relationships in the data.
Learning to Place Guards by Reinforcement: A Geo-Free Neural Policy for the Vertex-Guard Art Gallery Problem
Reinforcement Learning
Optimization
Theory
- Introduces a geo-free neural policy for the vertex-guard Art Gallery Problem using reinforcement learning.
- Demonstrates that the policy can achieve competitive guard placements without explicit geometric input during inference.
- Utilizes a probing method to analyze the encoder's representation, revealing its capacity to encode necessary geometric information.
- Shows significant improvements in feasibility when using a classifier on the encoder's embeddings, reducing under-covered polygons.
Read more
Learning to Place Guards by Reinforcement: A Geo-Free Neural Policy for the Vertex-Guard Art Gallery Problem
Summary
This paper investigates the application of neural combinatorial optimization (NCO) to the vertex-guard Art Gallery Problem (AGP), which is an NP-hard problem requiring the selection of polygon vertices to ensure complete visibility of a region. The authors propose a geo-free neural policy trained via reinforcement learning, specifically using a pointer-network architecture. The policy is designed to operate without access to geometric information at inference time, relying solely on vertex coordinates. The study reveals that while the policy can produce competitive guard placements, it struggles with under-covered polygons, particularly when tested on instances outside the training range. To address this, the authors employ a probing technique that freezes the encoder and utilizes a classifier to predict vertex inclusion probabilities based on the learned embeddings. This approach significantly reduces the number of under-covered polygons, demonstrating that the encoder captures sufficient geometric structure for feasibility decisions. The findings suggest that the limitations observed are primarily due to decoder calibration rather than a lack of representational capacity in the encoder. The paper contributes to the understanding of representation and decoder dynamics in reinforcement-trained combinatorial solvers, highlighting the potential for geo-free inference in practical applications.
Methodology
The authors employ a reinforcement learning approach using a pointer-network policy trained with Preference Optimization (PO) on Bradley–Terry preference pairs derived from the policy's rollouts. The policy operates under geo-free inference conditions, relying solely on vertex coordinates without querying a visibility oracle. A classifier is trained on the frozen encoder's embeddings to predict vertex inclusion probabilities, further isolating the representation's effectiveness.
Results
The geo-free policy achieves competitive guard counts compared to classical methods but exhibits a tail of under-covered polygons. The probing technique significantly improves feasibility rates, raising the fraction of polygons meeting the 0.95 feasibility threshold from 85% to 99%, even for out-of-distribution instances. This indicates that the encoder captures sufficient geometric structure, while the decoder's calibration contributes to the observed limitations.
Implications
The findings suggest that reinforcement learning can effectively learn combinatorial optimization tasks without explicit geometric knowledge, which has implications for applications in surveillance, sensor placement, and robotics. The methodology also provides a framework for analyzing neural representations in similar optimization problems.
Sesame: Structure-Aware Molecular Generation via Spatial Density-Map Conditioning
Generative Models
- Introduces a novel density map conditioning architecture for structure-aware molecular generation.
- Supports both de novo generation and fragment-conditioned growth through a unified mechanism.
- Implements a hybrid discrete-continuous diffusion process for effective molecular generation.
- Utilizes trajectory finetuning to enhance the quality of generated molecules.
Read more
Sesame: Structure-Aware Molecular Generation via Spatial Density-Map Conditioning
Summary
The paper presents Sesame, a novel diffusion-based molecular generation model designed to enhance drug discovery by integrating structure-aware generation capabilities. Sesame employs a unique spatial pairformer module that conditions on both partial molecular structures and the surrounding protein pocket, represented as continuous spatial density maps. This approach allows for two primary functionalities: de novo molecular generation and fragment-conditioned lead optimization, enabling medicinal chemists to refine existing molecules effectively. The model's architecture supports a unified diffusion process that handles both discrete atom types and continuous atomic coordinates, addressing the challenges of generating chemically sensible structures. Additionally, a trajectory finetuning scheme is introduced to improve generation quality by training on the model's own sampling rollouts. Sesame is trained on extensive ligand-only and protein-ligand datasets, demonstrating its potential in real-world drug design scenarios.
Methodology
Sesame employs a diffusion-based approach that integrates a spatial pairformer module for conditioning on spatial density maps. The model uses a hybrid diffusion process to manage both discrete atom types and continuous coordinates, along with a self-distillation trajectory finetuning scheme to refine its outputs based on its own sampling trajectories.
Results
The model demonstrates a significant capability in generating molecules that honor the provided fragment structures, with 94.8% of generated molecules retaining the seeding fragment as a substructure. The integration of density maps allows for efficient modeling of protein pockets and enhances the quality of generated molecular structures.
Implications
Sesame's approach could revolutionize drug discovery by providing a robust tool for both de novo molecular design and lead optimization, allowing for more efficient exploration of chemical space and integration of human expertise in the drug development process.
A Survey on Federated Causal Discovery and Inference
Federated Learning
Graph Learning
Theory
- The paper provides a systematic review of Federated Causal Discovery and Inference, addressing the lack of comprehensive surveys in the field.
- FCD and FCI are organized along three axes: methodological paradigms, federation topologies, and structural scopes.
- The authors formalize the relationship between FCD and FCI as complementary stages of a unified federated causal reasoning pipeline.
- Key practical dimensions such as data heterogeneity and missing data are examined in the context of federated causal analysis.
Read more
A Survey on Federated Causal Discovery and Inference
Summary
This paper presents a comprehensive survey on Federated Causal Discovery (FCD) and Federated Causal Inference (FCI), two emerging fields that integrate causal reasoning with federated learning (FL) techniques. The authors highlight the importance of causal reasoning in data-driven decision-making and the challenges posed by data privacy regulations that prevent centralized data analysis. The survey organizes FCD and FCI methodologies along three axes: methodological paradigms (constraint-based, score-based, continuous-optimization, and hybrid), federation topologies (horizontal, vertical, and hybrid), and structural scopes (global versus local). The paper also discusses practical considerations such as temporal dynamics, data heterogeneity, and missing data. A key contribution is the formalization of the connection between FCD and FCI as complementary stages in a unified federated causal reasoning pipeline. The authors emphasize shared concerns regarding privacy, communication efficiency, and theoretical guarantees, and conclude by identifying open challenges for future research in this interdisciplinary field.
Methodology
The authors conducted a systematic review of existing literature on FCD and FCI, categorizing methodologies based on design decisions and practical considerations. They developed multi-dimensional taxonomies to organize the findings and formalized the connection between FCD and FCI.
Results
The survey reveals the rapid growth of the field and highlights the distinct challenges faced in integrating causal methods with federated learning, including structural constraints, heterogeneity in causal mechanisms, and issues of identifiability. The authors provide a structured overview of methodologies and identify gaps in current research.
Implications
The findings of this survey have significant implications for researchers and practitioners in fields requiring causal analysis across distributed datasets, such as healthcare, finance, and social sciences. The insights can guide the development of more effective federated causal analysis methods that respect data privacy while enabling robust decision-making.
GRACE: Gated Refinement for Accurate Causal Edge Discovery in High-Dimensional Time Series
Time Series
Graph Learning
Theory
- GRACE improves causal edge discovery in high-dimensional time series using a two-stage framework.
- The method employs Hard Concrete gates with L0 regularization for robust binary decision-making.
- GRACE significantly enhances F1 scores while maintaining high precision compared to existing methods.
- The framework is computationally efficient, achieving results comparable to nonlinear CI tests at 75x faster speeds.
Read more
GRACE: Gated Refinement for Accurate Causal Edge Discovery in High-Dimensional Time Series
Summary
The paper introduces GRACE, a novel framework designed to enhance causal edge discovery in high-dimensional time series data. Traditional methods face challenges due to the complexity of nonlinear conditional independence (CI) tests and the need for arbitrary thresholds in score-based approaches. GRACE addresses these issues by employing a two-stage process: first, it generates a candidate skeleton using a high-recall constraint-based method, and then refines this skeleton using a gated neural model with Hard Concrete gates and L0 regularization. This approach allows for a robust binary decision-making process, effectively distinguishing true causal relationships from false positives. The authors demonstrate that GRACE significantly improves the F1 score over existing CI methods while maintaining high precision. Additionally, it outperforms attention-based and score-based alternatives, achieving results comparable to costly nonlinear CI tests at a fraction of the computational expense. The framework is validated through systematic experiments on synthetic benchmarks and a real-world river flow dataset, showcasing its effectiveness in recovering causal edges amidst confounding factors and distributional shifts.
Methodology
GRACE operates in two stages: first, it uses a constraint-based method (like CDNOTS or PCMCI) to generate a candidate skeleton of causal edges. In the second stage, a gated neural model refines this skeleton by assigning independent Hard Concrete gates to each candidate edge, trained with L0 regularization to achieve a bimodal distribution of gate values, facilitating robust binary decisions.
Results
Experimental results show that GRACE improves the F1 score over its base CI method while maintaining high precision. It outperforms attention-based and score-based alternatives and matches or exceeds the performance of nonlinear CI tests at a significantly reduced computational cost (75x faster). In a real-world application to the Elbe River dataset, GRACE successfully identified 9 out of 11 causal edges with only 1 false positive, achieving an F1 score of 0.86 and an AUROC of 0.99.
Implications
The GRACE framework has significant implications for various fields requiring causal inference from time series data, such as climate science, finance, and gene regulation. Its ability to handle high-dimensional data efficiently makes it a valuable tool for researchers and practitioners aiming to uncover causal relationships in complex systems.
Stage-dependent integer-binary encoding in factorization-machine black-box optimization
Optimization
- Introduces a stage-dependent encoding framework for black-box optimization using factorization machines.
- Derives conversion formulas between different integer-binary encoding methods to maintain surrogate objectives.
- Demonstrates that one-hot encoding consistently outperforms other encoding methods during the learning stage.
- Shows that the effectiveness of domain-wall encoding for solution search varies based on problem conditions.
Read more
Stage-dependent integer-binary encoding in factorization-machine black-box optimization
Summary
This paper introduces a novel framework for black-box optimization (BBO) using factorization machines, specifically focusing on the stage-dependent integer-binary encoding in the optimization process. The authors highlight the limitations of conventional approaches that utilize a single integer-binary encoding throughout the optimization stages. They propose a stage-dependent FMQA (factorization machine with quadratic-optimization annealing) framework that allows for different encodings during the surrogate learning and solution search phases. The study derives conversion formulas between one-hot and domain-wall QUBO matrices, ensuring that the surrogate objective is preserved across feasible integer states. The performance of the proposed OhDw variant, which combines one-hot encoding for learning and domain-wall encoding for search, is evaluated on the Rastrigin function across various input dimensions and discretization levels. Results indicate that one-hot encoding significantly enhances optimization performance during the learning stage, while the benefits of switching to domain-wall encoding for search are condition-dependent. The findings suggest that the encoding choice is crucial for optimization success, particularly under finer discretization conditions.
Methodology
The authors propose a stage-dependent FMQA framework that employs different integer-binary encodings for the surrogate learning and solution search stages. They derive conversion formulas between one-hot and domain-wall QUBO matrices and evaluate the OhDw variant on the Rastrigin function using various input dimensions and discretization levels.
Results
The experiments reveal that one-hot encoding leads to lower residual errors compared to domain-wall and binary encoding during the learning stage. For specific conditions (N = 5, q = 301), the OhDw variant achieves better optimization results than the one-hot-only FMQA, while for other conditions, the performance varies. Overall, the encoding used in the learning stage is identified as the primary factor influencing optimization performance.
Implications
The findings suggest that adopting a stage-dependent encoding strategy can enhance the efficiency of black-box optimization methods, particularly in complex combinatorial problems. This approach could be beneficial in various applications, including materials design, engineering, and machine learning.
Noise is Signal: Density-Based Outliers as Leading Indicators of Occupational Emergence in Labor Market Text
NLP
- Introduces the Emergence-Density Inversion (EDI) hypothesis, suggesting that noise in job postings indicates novelty.
- Demonstrates that high-EOS outlier groups transition to stable clusters faster than low-EOS groups.
- Implements an extended Emerging Occupation Score (EOS) that improves prediction accuracy for cluster formation.
- Validates the EOS metric through retrospective studies of emerging roles, indicating its predictive capabilities.
Read more
Noise is Signal: Density-Based Outliers as Leading Indicators of Occupational Emergence in Labor Market Text
Summary
This paper challenges the conventional approach in natural language processing (NLP) that discards job postings classified as noise by density-based clustering methods. The author introduces the Emergence-Density Inversion (EDI) hypothesis, positing that low posting density in rapidly evolving job markets signals novelty rather than incoherence. The hypothesis is tested on a longitudinal dataset of 84,988 job postings over eight quarters (Q4 2022–Q3 2024). Results indicate that high-EOS (Emerging Occupation Score) outlier groups transition to stable clusters significantly faster than low-EOS groups. The study also extends the EOS metric by incorporating Temporal Velocity and Cross-Platform Convergence, improving the prediction of cluster formation from an F1 score of 0.61 to 0.74, outperforming several baseline methods. A retrospective analysis confirms that EOS can predict the emergence of new roles 2–3 quarters in advance, with a held-out annotator panel validating the coherence of identified emerging occupations. The findings suggest that the noise class in job postings can serve as an early warning system for occupational changes, highlighting the need for updated occupational classification systems to accommodate emerging roles.
Methodology
The study employs a longitudinal analysis of job postings collected from various recruitment platforms, applying density-based clustering methods (HDBSCAN) to identify noise and emerging roles. The Emerging Occupation Score (EOS) is extended with additional metrics, and the performance is compared against baseline methods such as Isolation Forest and LOF. A failure analysis is conducted to characterize non-emerging groups.
Results
The EDI hypothesis is partially confirmed, showing that high-EOS groups transition to stable clusters in an average of 1.4 quarters, compared to 4.1 quarters for low-EOS groups (p < 0.001). The extended EOS metric achieves an F1 score of 0.74 for predicting cluster formation, significantly outperforming baseline methods. Retrospective validation indicates that EOS can signal emerging occupations 2–3 quarters in advance with 77% precision.
Implications
The findings suggest that labor market analytics can benefit from recognizing noise in job postings as a predictive signal for emerging occupations. This has implications for workforce policy, education funding, and immigration programs, necessitating updates to occupational classification systems to include new roles driven by technological advancements.
Formalizing Task-Space Complexity for Zero-Shot Generalization
Reinforcement Learning
Robotics
Theory
- Introduces signed divergence as a performance-based measure of task dissimilarity.
- Defines task-space complexity in terms of ε-tolerance sets and provides geometric certificates.
- Develops a greedy selection strategy for source contexts with an H(n) approximation guarantee.
- Demonstrates the effectiveness of the proposed methods on both linear and nonlinear control systems.
Read more
Formalizing Task-Space Complexity for Zero-Shot Generalization
Summary
This paper addresses the challenge of zero-shot generalization in contextual dynamical systems, where policies must adapt to diverse conditions without retraining. The authors introduce a novel performance-centric measure of task dissimilarity called signed divergence, which upper-bounds the generalization gap when transferring policies from one context to another. This leads to a formal definition of task-space complexity, defined as the minimum number of source contexts required to ensure that every target context incurs at most an ε generalization gap. The paper establishes ε-tolerance sets that certify when a source policy class can generalize effectively. Under a mild local smoothness assumption, the authors derive geometric certificates and volume bounds for task-space complexity. They propose a greedy selection strategy for source contexts that achieves efficient coverage of the context space, validated through experiments on a Mass-Spring-Damper system and a nonlinear CartPole system. The results demonstrate that this greedy approach outperforms uniform and random selection methods in achieving the same ε-coverage with fewer policies, providing a practical framework for building generalizable control systems with simple policies.
Methodology
The authors develop a performance-based measure of task dissimilarity, signed divergence, which is used to define task-space complexity. They formulate the problem of source selection as a set cover problem and propose a greedy algorithm to select source contexts. The methodology includes deriving ε-tolerance sets and using geometric bounds to certify generalization capabilities.
Results
The experiments show that the greedy selection strategy achieves the same ε-coverage with fewer policies compared to uniform or random baselines. The results validate the theoretical framework and demonstrate the practical applicability of the proposed methods in real-world control scenarios.
Implications
This work has significant implications for the design of control systems that need to operate across varying conditions without extensive retraining. It provides a theoretical foundation for developing more efficient and generalizable reinforcement learning policies, potentially enhancing the performance of robotic systems and autonomous vehicles.
One-Step Flow Matching for Generative Modeling of Path-Dependent Physical Fields
Generative Models
- Introduction of a transformer-based flow matching model for generating path-dependent stress fields.
- Direct generation of stress fields across all time steps, improving efficiency and reducing computational costs.
- Non-Gaussian source distribution reduces complexity in training, allowing for one-step generation of samples.
- Demonstrated significant speedup over traditional finite element methods, making it feasible for complex simulations.
Read more
One-Step Flow Matching for Generative Modeling of Path-Dependent Physical Fields
Summary
This paper addresses the challenges of simulating path-dependent physical fields, particularly in the context of plastic stress fields, which are computationally intensive when using traditional methods like finite element analysis (FEM). The authors propose a novel flow matching (FM) model based on a transformer architecture, which operates within the latent space of a variational autoencoder (VAE). This model formulates the generation of plastic stress fields as a video synthesis task, enabling the direct generation of stress fields across all time steps. A key innovation is the introduction of a non-Gaussian source distribution for flow matching, which reduces the complexity of conditional transport paths during training. The model also incorporates token-level loading embeddings and auxiliary networks to enhance performance. The results indicate that the proposed model can generate high-resolution path-dependent fields efficiently, achieving a computational speedup of 6 to 7 times over traditional FEM on CPUs and approximately two orders of magnitude on consumer-grade GPUs, even with limited training data.
Methodology
The authors developed a flow matching model utilizing a transformer backbone, operating in the latent space of a VAE. The model treats the simulation of plastic fields as a video synthesis task, employing a non-Gaussian source distribution to streamline training and enhance sample generation efficiency.
Results
The proposed model successfully generates high-resolution path-dependent stress fields with a speedup of 6 to 7 times over FEM on CPUs and approximately 100 times faster on consumer-grade GPUs, even with a limited dataset.
Implications
This work has significant implications for engineering fields requiring fast and reliable simulations of complex materials and structures, particularly in mechanical, aerospace, and civil engineering applications. The efficiency of the proposed model could facilitate real-time simulations and enhance the design and analysis processes.
PG-MAP: Joint MAP Optimization for Inference-Time Alignment of Diffusion and Flow-Matching Models
Generative Models
Optimization
Computer Vision
- PG-MAP enables joint optimization of conditioning and latent states during inference, improving generative model performance.
- The framework is training-free and adapts to both diffusion and flow-matching models, demonstrating versatility.
- Empirical results show significant improvements in alignment metrics and human preference evaluations.
- The methodology includes a schedule-adaptive trust region for optimizing variables at each denoising step.
Read more
PG-MAP: Joint MAP Optimization for Inference-Time Alignment of Diffusion and Flow-Matching Models
Summary
The paper introduces PG-MAP, a novel framework designed for inference-time alignment of pretrained text-to-image models, specifically addressing the limitations of existing methods that operate along a single control axis. Traditional approaches often fail to model the joint dependencies between conditioning and latent variables, which can hinder the performance of generative models. PG-MAP formulates the alignment as a trajectory-level Gibbs-MAP optimization problem, allowing for coordinated updates across modalities. This framework is compatible with both diffusion and flow-matching models, adapting to the specifics of each transport type. The authors demonstrate that PG-MAP significantly improves alignment metrics, such as PickScore and Aesthetic, across various diffusion backbones, and achieves high performance on flow-matching models. The framework's effectiveness is further validated through human evaluations, which show a consistent preference for PG-MAP over strong baselines. Additionally, an oracle-routing analysis reveals that the importance of conditioning and latent optimization varies with different prompt types, suggesting potential for further optimization.
Methodology
PG-MAP employs a training-free approach that reformulates inference-time alignment as a proximal MAP optimization problem. It utilizes a forward-consistency coupling to coordinate updates between conditioning and latent states at each denoising step. The framework adapts to the specific requirements of diffusion and flow-matching models, allowing for a unified objective that encompasses both transport types. The optimization process is dynamic, with schedule-adaptive trust regions and a step-dependent active set that selects which variables to refine at each step.
Results
PG-MAP consistently outperforms existing methods across various diffusion models, achieving notable improvements in alignment metrics such as PickScore and Aesthetic. Specifically, it reaches 91.9% PickScore and 75.7% HPS win rates on flow-matching models, with human evaluations indicating a 60-67% preference over strong baselines. The framework's ability to jointly optimize conditioning and latent variables leads to enhanced visual quality in generated images.
Implications
The PG-MAP framework has significant implications for the development of more effective text-to-image generation models. By enabling joint optimization of conditioning and latent variables, it can enhance the quality and coherence of generated images, making it a valuable tool for applications in creative industries, content generation, and interactive media.
The Two-Hump Problem: Bridging the Difficulty Gap in Mathematical Reinforcement Learning
Reinforcement Learning
Theory
Optimization
- Identification of the 'Two-Hump' difficulty distribution in the AC problem.
- Introduction of substitution supermoves to enhance the action space for RL agents.
- Development of the Dual-Ring Transformer architecture to effectively handle large action spaces.
- Creation of two large benchmark datasets, AC-19 and AC-1M, for training and evaluation.
Read more
The Two-Hump Problem: Bridging the Difficulty Gap in Mathematical Reinforcement Learning
Summary
This paper addresses the challenges of applying Reinforcement Learning (RL) to mathematical search problems, specifically focusing on the Andrews-Curtis (AC) conjecture. The authors identify a critical 'Two-Hump' difficulty distribution in the AC landscape, where problem instances are either trivially solvable or effectively impossible, with a lack of intermediate 'hard-but-solvable' instances. To bridge this difficulty gap, the authors propose novel data generation techniques and algorithmic enhancements, including the introduction of supermoves and a specialized Transformer-based architecture. They present two new benchmark datasets, AC-19 and AC-1M, which provide a comprehensive resource for training RL agents. The results demonstrate significant performance improvements over previous baselines, enabling the trivialization of over 100 previously unsolved presentations and making progress toward resolving the AC conjecture. This work highlights the potential of RL in discovering mathematical structures in challenging search spaces.
Methodology
The authors employed a combination of targeted data generation techniques and algorithmic innovations. They introduced substitution supermoves to create a more effective action space and developed the Dual-Ring Transformer architecture to process the problem's structure efficiently. They also utilized exhaustive enumeration and automorphisms to generate high-quality training data.
Results
The proposed methods led to substantial performance improvements, allowing the RL agent to find shorter solution paths than previous approaches. The authors successfully trivialized over 100 new presentations from a benchmark dataset and reduced the number of unsolved examples significantly, providing a clearer path for future research.
Implications
This work has implications for advancing the application of RL in mathematical reasoning and problem-solving. The new datasets and methodologies can facilitate further research in RL for complex mathematical conjectures and potentially lead to breakthroughs in understanding and resolving longstanding mathematical problems.
EMAgnet: Parameter-Space EMA Regularization for Policy Gradient Self-Play in Large Games
Reinforcement Learning
- EMAgnet introduces an adaptive regularization target using an exponential moving average of policy parameters.
- The method outperforms traditional uniform regularization in terms of exploitability in various game environments.
- EMAgnet effectively discards dominated strategies while maintaining coverage over strategically relevant options.
- The approach is applicable to deep reinforcement learning, extending previous tabular methods to more complex settings.
Read more
EMAgnet: Parameter-Space EMA Regularization for Policy Gradient Self-Play in Large Games
Summary
The paper introduces EMAgnet, a novel regularization technique for policy gradient methods in self-play scenarios, particularly in two-player zero-sum imperfect-information games. Traditional methods like Proximal Policy Optimization (PPO) utilize a uniform distribution for regularization, which can lead to inefficiencies by treating all actions equally, regardless of their strategic relevance. EMAgnet addresses this limitation by employing an exponential moving average (EMA) of the last-iterate policy's parameters as a dynamic regularization target. This adaptive approach allows the regularization to evolve alongside the agent's strategy, focusing on viable actions while discarding dominated strategies. The authors evaluate EMAgnet against standard benchmarks and modified environments with exploration challenges, demonstrating that it consistently achieves lower exploitability and improved performance compared to uniform regularization methods. The findings suggest that EMAgnet can enhance the efficacy of self-play training in complex games, making it a significant advancement in the field of reinforcement learning.
Methodology
The authors extend the concept of a moving magnet from tabular settings to deep reinforcement learning by implementing EMA regularization in the PPO framework. They replace the uniform regularization target with an EMA of the policy's parameters, allowing the regularization to adapt as the policy improves. The method involves updating the EMA after each PPO iteration, ensuring that the regularization focuses on viable strategies while gradually forgetting dominated ones.
Results
EMAgnet was evaluated against PPO with uniform-magnet regularization across standard two-player zero-sum benchmarks and modified environments with many dominated strategies. The results showed that EMAgnet achieved lower exploitability in most tested environments and demonstrated consistent performance gains, particularly in games with complex strategy spaces.
Implications
The introduction of EMAgnet has the potential to significantly enhance the training of AI agents in complex games, making it a valuable tool for developing more effective self-play strategies. This could lead to advancements in AI applications in various domains, including competitive gaming, strategic decision-making, and other areas requiring robust reinforcement learning techniques.
Holistic Data Scheduler for LLM Pre-training via Multi-Objective Reinforcement Learning
Large Language Models
Reinforcement Learning
Optimization
- Introduction of the Holistic Data Scheduler (HDS) for LLM pre-training.
- HDS utilizes a multi-objective reward function to optimize data mixing.
- Achieved 44% fewer training iterations on The Pile dataset compared to existing methods.
- Demonstrated a 7.2% improvement in 0-shot accuracy on the MMLU benchmark.
Read more
Holistic Data Scheduler for LLM Pre-training via Multi-Objective Reinforcement Learning
Summary
This paper introduces the Holistic Data Scheduler (HDS), a novel framework for online data mixing in the pre-training of Large Language Models (LLMs). Recognizing the limitations of existing methods that optimize data mixing from a singular perspective, HDS formulates the data scheduling challenge as a reinforcement learning problem, utilizing the Soft Actor-Critic (SAC) algorithm for its stability and sample efficiency. The core innovation of HDS is its multi-objective reward function, which integrates three critical perspectives: data quality, inter-domain influence, and model performance. Through systematic experiments on various LLM sizes using The Pile dataset, HDS demonstrated significant improvements in training efficiency and model performance, achieving a 44% reduction in training iterations compared to the next best method and a 7.2% increase in 0-shot accuracy on the MMLU benchmark. This work highlights the importance of a holistic approach to data composition in LLM pre-training, paving the way for more efficient and capable models.
Methodology
The HDS framework formulates the data mixing challenge as a reinforcement learning task modeled as a Markov Decision Process (MDP). It employs the Soft Actor-Critic (SAC) algorithm to explore the high-dimensional policy space effectively. The state representation captures the model's performance, learning velocity, and stability, while the agent's actions correspond to the sampling probabilities for different data domains during training.
Results
HDS achieved a final validation perplexity with 44% fewer training iterations compared to the next best method on The Pile dataset. Additionally, it improved the 0-shot accuracy on the MMLU benchmark by 7.2%, along with consistent gains across other benchmarks, demonstrating enhanced training efficiency and model capability.
Implications
The findings suggest that a holistic approach to data scheduling can significantly improve the efficiency and effectiveness of LLM pre-training, potentially reducing costs and environmental impact associated with large-scale model training. This framework could be applied to various LLMs and adapted for different datasets, enhancing their performance across diverse tasks.
NeuroSonic: Conditional Flow Matching for EEG-to-Speech Reconstruction
Multimodal
Audio & Speech
Generative Models
- NeuroSonic introduces a conditional flow-matching framework for EEG-to-speech reconstruction.
- The method learns a deterministic velocity field for transporting corrupted acoustic states to clean speech.
- Utilizes a time-conditioned gated Transformer for joint processing of EEG and audio signals.
- Demonstrates significant improvements over existing methods, especially in challenging artifact-heavy conditions.
Read more
NeuroSonic: Conditional Flow Matching for EEG-to-Speech Reconstruction
Summary
The paper presents NeuroSonic, a novel framework for reconstructing continuous speech from EEG signals, addressing the inherent challenges posed by the weak and variable nature of EEG measurements compared to the structured nature of speech. Traditional methods like GANs and diffusion models struggle with the stochasticity and variability of EEG data, leading to instability in waveform generation. NeuroSonic introduces a conditional flow-matching approach that learns a deterministic probability-flow velocity field to transport noise-corrupted acoustic states towards clean speech based on EEG conditioning. This method utilizes a time-conditioned gated Transformer to process EEG and audio signals embedded in a shared latent space, allowing for effective modeling of acoustic trajectory evolution. The framework was evaluated on the CineBrain and EAV benchmarks, demonstrating significant improvements in distributional realism, spectral fidelity, and perceptual quality, particularly in artifact-heavy segments. The results indicate that deterministic conditional transport is a robust solution for EEG-driven speech reconstruction.
Methodology
NeuroSonic employs a conditional flow-matching approach, where EEG and audio signals are tokenized and embedded into a shared latent space. A time-conditioned gated Transformer processes these embeddings to parameterize a transport ordinary differential equation, allowing for the direct modeling of acoustic trajectory evolution without iterative sampling. The framework focuses on learning a deterministic velocity field that guides the transformation of noise-corrupted audio states into clean speech based on EEG signals.
Results
NeuroSonic outperformed GAN, diffusion, and mean-flow baselines across the CineBrain and EAV datasets, achieving up to a 26.3% improvement in overall perceptual quality. The method showed particularly strong performance in segments with high artifact presence, indicating its robustness against conditioning variability.
Implications
The findings suggest that NeuroSonic could enhance applications in brain-computer interfaces, particularly for individuals with speech impairments, by providing a more stable and effective means of translating neural activity into coherent speech. This could lead to advancements in assistive technologies and communication aids.
When Top-1 Fails: Calibrating LoRA Monitors for Masked Diffusion LMs
NLP
Large Language Models
Generative Models
- Top-1 argmax concentration is ineffective as a stability warning in DLM fine-tuning.
- Actual training collapses were recorded as 0 out of 816 configurations despite top-1 warnings firing in all cases.
- Max gradient norm is proposed as a more reliable indicator of training stability, achieving higher precision and F1 scores.
- Calibration of monitoring thresholds should be done per DLM family rather than using a universal constant.
Read more
When Top-1 Fails: Calibrating LoRA Monitors for Masked Diffusion LMs
Summary
This paper investigates the effectiveness of using top-1 argmax concentration as a stability warning in the context of fine-tuning discrete diffusion language models (DLMs) with low-rank adaptation (LoRA). The authors conducted extensive experiments across 816 configurations from three DLM families, finding that while the top-1 warning consistently triggered for all configurations, it failed to predict any actual training collapses, resulting in zero precision. This failure is attributed to pre-equilibrium saturation, where top-1 concentration is already high before optimization begins. To address this issue, the authors propose using the maximum LoRA gradient norm as a more reliable signal for monitoring training stability. Their evaluation shows that this method can effectively identify stable configurations with a precision of 0.68 and an F1 score of 0.79, outperforming the top-1 baseline. The paper concludes with a workflow for practitioners, recommending the abandonment of the top-1 warning in favor of the max-gradient approach, which should be calibrated per DLM family for optimal results.
Methodology
The authors conducted experiments across 816 configurations from three DLM families, analyzing the effectiveness of top-1 concentration as a stability warning. They also evaluated the max LoRA gradient norm as a parameter-side signal to differentiate between stable and unstable training runs. Statistical tests, including Mann-Whitney U tests, were employed to assess the significance of their findings.
Results
The study found that the top-1 warning had zero precision, as it indicated potential collapses in all configurations without any actual occurrences. In contrast, the max gradient norm method demonstrated a precision of 0.68 and an F1 score of 0.79 in identifying stable configurations on a held-out dataset, significantly outperforming the top-1 baseline.
Implications
The findings suggest that practitioners should reconsider the reliance on top-1 concentration as a diagnostic tool in DLM fine-tuning. By adopting the max gradient norm approach, they can improve the reliability of their training monitoring processes, leading to better outcomes in model performance and stability.
SOAP-Bubbles: Structured Weight Uncertainty for Neural Networks
Optimization
Large Language Models
Theory
- SOAP-Bubbles provide a method for transforming diagonal covariances into non-diagonal ones using SOAP's preconditioner.
- The Eigenspace-VON (EVON) optimizer allows for efficient optimization of structured posteriors without extensive changes to training pipelines.
- EVON shows improved performance over existing methods like IVON in both training speed and final loss for language models.
- The approach captures richer representations of weight uncertainty, beneficial for various applications in deep learning.
Read more
SOAP-Bubbles: Structured Weight Uncertainty for Neural Networks
Summary
The paper introduces SOAP-Bubbles, a novel approach to structured weight uncertainty in neural networks, addressing the challenges of cost and implementation complexity associated with existing methods. The authors adapt the SOAP optimizer to run the IVON variational method in the eigenspace of SOAP's preconditioner, allowing for the transformation of diagonal covariance estimates into non-diagonal ones. This results in a new optimizer, Eigenspace-VON (EVON), which maintains computational efficiency similar to SOAP while enabling more expressive posterior distributions. The authors demonstrate that EVON can recover the exact Gaussian covariance for logistic regression and significantly outperforms existing diagonal-covariance methods in language model pretraining. The findings suggest that SOAP-Bubbles can enhance the estimation of weight uncertainty in deep learning, making it more practical for real-world applications.
Methodology
The authors propose a method that leverages the SOAP optimizer's preconditioner to transform diagonal covariance estimates into non-diagonal structured posteriors, termed SOAP-Bubbles. They introduce the Eigenspace-VON (EVON) algorithm, which operates in the eigenspace of SOAP's preconditioner and utilizes the IVON variational method to achieve this transformation efficiently.
Results
The experiments show that EVON allows SOAP-Bubbles to scale effectively to language model training, yielding better validation loss compared to IVON under the same computational budget. Additionally, ensembling models sampled from SOAP-Bubbles outperforms ensembling from IVON's diagonal posterior, indicating improved training dynamics and uncertainty representation.
Implications
The proposed method can enhance the practical application of weight uncertainty estimation in deep learning, potentially benefiting areas such as reinforcement learning, model merging, and language generation. It simplifies the implementation of structured variational methods, making them more accessible for large-scale models.
Topological Out-of-Domain Generalization in Dynamical Systems Reconstruction
Theory
Time Series
- Identified three core shortcomings in existing DSR models that limit out-of-domain generalization.
- Proposed feature splitting as a key remedy to improve model performance across different dynamical regimes.
- Derived a closed-form bound on the reliable extrapolation range for predictions.
- Demonstrated significant improvements in zero-shot prediction capabilities through empirical validation.
Read more
Topological Out-of-Domain Generalization in Dynamical Systems Reconstruction
Summary
This paper addresses the challenge of predicting the behavior of dynamical systems (DS) beyond the regimes observed during training, a critical issue in scientific machine learning. The authors analyze the limitations of existing dynamical systems reconstruction (DSR) models, particularly focusing on hierarchical and hyper-network approaches. They identify three main structural discrepancies that hinder out-of-domain (OOD) generalization: dense dependence of model Jacobians on latent features, mismatched geometrical properties between models and real systems, and nonlinear parameter dependencies introduced by time discretization. To overcome these issues, the authors propose a combination of remedies, including feature splitting, and derive a closed-form bound on the reliable extrapolation range. Empirical results demonstrate that these techniques enable accurate zero-shot predictions into new dynamical regimes, effectively addressing the limitations of previous models and enhancing OOD generalization capabilities in DSR.
Methodology
The authors conducted a mathematical analysis of hierarchical DSR frameworks, extending their focus from piecewise linear RNNs to more general discrete and continuous-time models like Neural ODEs. They identified structural discrepancies affecting OOD generalization and proposed feature splitting as a solution. The methodology involved deriving bounds on extrapolation reliability and validating the proposed techniques through empirical experiments.
Results
The proposed methods significantly improved zero-shot out-of-domain generalization across tipping points in dynamical systems. The empirical results confirmed that the adjustments made to the model structure allowed for accurate predictions in previously unseen dynamical regimes, demonstrating the effectiveness of the feature splitting approach and the theoretical bounds established.
Implications
The findings have significant implications for scientific modeling and machine learning applications, particularly in fields requiring accurate predictions of complex dynamical systems under novel conditions. The improved OOD generalization capabilities can enhance the reliability of models used in climate science, neuroscience, and other domains where understanding system behavior under varying parameters is crucial.
Rapid FinFET Modelling Using an Autoencoder
Efficient ML
- Utilizes an autoencoder for efficient FinFET modeling.
- Incorporates drain-to-source voltage (VDS) as an input feature.
- Achieves high accuracy in reconstructing I-V curves with minimal training data.
- Extracts critical device metrics directly from the model.
Read more
Rapid FinFET Modelling Using an Autoencoder
Summary
This paper presents a novel machine learning framework utilizing an autoencoder (AE) for efficient modeling of FinFET devices. The authors first calibrated a BSIM-CMG model to generate a dataset of current-voltage (ID-VG) characteristics, which served as the training data for the autoencoder. The AE compresses full I-V curves into a low-dimensional latent space, effectively encoding essential device physics. A significant innovation in this work is the incorporation of the drain-to-source voltage (VDS) as an input feature, which enhances the model's ability to capture bias-dependent variations. The trained autoencoder successfully reconstructs full I-V curves and extracts critical device metrics such as threshold voltage (VTH), subthreshold slope (SS), and peak transconductance (gm). The results demonstrate that data-driven compact models, derived from actual characterization data, can achieve high accuracy with minimal training data, providing a powerful tool for rapid device characterization, modeling, and circuit-level simulation.
Methodology
The methodology involves calibrating a BSIM-CMG model to generate ID-VG data, which is then preprocessed for training the autoencoder. The data is normalized and structured to facilitate effective learning. The autoencoder architecture features a symmetric encoder-decoder structure that compresses the I-V curves into a latent representation, allowing for the reconstruction of full curves and extraction of device parameters.
Results
The trained autoencoder successfully reconstructs full I-V curves and accurately extracts key metrics such as threshold voltage (VTH), subthreshold slope (SS), and peak transconductance (gm). The model demonstrates high accuracy with a limited amount of training data, showcasing its effectiveness in rapid device modeling.
Implications
The findings suggest that this autoencoder-based approach can significantly reduce the time and expertise required for FinFET modeling and characterization, making it a valuable tool for circuit designers and researchers in semiconductor technology.
An LLM-based Two-Stage Transformer Framework for Cross-Domain Bearing Fault Diagnosis with Limited Data
Time Series
- Introduces a knowledge-guided framework for simultaneous dataset and condition shifts in bearing fault diagnosis.
- Establishes explicit knowledge transfer mechanisms that outperform traditional implicit alignment methods.
- Develops a dynamic classification head for seamless adaptation across heterogeneous fault taxonomies.
- Demonstrates superior performance with limited labeled data, achieving 92.61% accuracy.
Read more
An LLM-based Two-Stage Transformer Framework for Cross-Domain Bearing Fault Diagnosis with Limited Data
Summary
This paper addresses the challenges of bearing fault diagnosis in industrial environments, particularly when faced with dataset heterogeneity, variations in operating conditions, and limited labeled data. Existing methods typically tackle these issues in isolation, which limits their effectiveness. The authors propose a novel knowledge-guided two-stage transfer learning framework that utilizes a lightweight GPT-2-style Transformer architecture with causal self-attention for hierarchical feature extraction from vibration signals. This framework establishes explicit pathways for transferring knowledge from multi-source pre-training to target adaptation, effectively addressing the dual-shift challenge through multi-source learning, prototype-based knowledge modulation, and taxonomy-adaptive classification. Experimental results on four real-world datasets demonstrate that the proposed framework achieves an average accuracy of 92.61% with only 10% labeled target data, significantly outperforming state-of-the-art methods by 17.24 percentage points. This work provides a practical approach for cost-effective predictive maintenance in Industry 4.0 applications.
Methodology
The proposed framework consists of two stages: the first focuses on multi-source learning to create generalizable representations from diverse datasets, while the second employs prototype-based knowledge modulation for adapting to the target domain. The architecture leverages a lightweight Transformer model for feature extraction and incorporates explicit knowledge transfer mechanisms at both parameter and feature levels.
Results
The framework was validated on four real-world datasets, achieving an average accuracy of 92.61% with only 10% of the labeled target data. This performance surpasses existing state-of-the-art methods by 17.24 percentage points, demonstrating the effectiveness of the proposed approach in handling dual-shift challenges.
Implications
The findings suggest that the proposed framework can significantly enhance predictive maintenance strategies in industrial settings, particularly in scenarios where labeled data is scarce. This could lead to reduced downtime and maintenance costs, facilitating the adoption of advanced diagnostic techniques in Industry 4.0.
Systematic Exploration of 4-Expert Heterogeneous Mixture-of-Experts via Automated Pipeline Search
Computer Vision
Efficient ML
Optimization
- Introduces an automated pipeline for exploring heterogeneous MoE4 architectures.
- Identifies a coverage bias in the search space, anchored to a single architecture family.
- Proposes a stratified random sampling method to mitigate coverage bias.
- Finds ShuffleNet and MobileNetV3 as high-yield families for ensemble accuracy.
Read more
Systematic Exploration of 4-Expert Heterogeneous Mixture-of-Experts via Automated Pipeline Search
Summary
This paper introduces an automated search pipeline for heterogeneous 4-Expert Mixture-of-Experts (MoE4) architectures, leveraging the LEMUR neural network dataset. The authors replace manual model design with a deterministic code-assembly generator that systematically combines various base architecture families into MoE4 ensembles. The pipeline incorporates a convolutional gating network, temperature scaling, mixup augmentation, and cosine-annealed learning rate scheduling. Over a 28-day evaluation period on an NVIDIA RTX 4090, the pipeline generated 4,463 candidate models, of which 1,021 were successfully evaluated. A significant finding was that the search space was biased towards a single family, AirNet, covering only 4.8% of the theoretical combinations. The authors identified the cause of this bias and proposed a stratified random sampling method to address it. Within the evaluated models, ShuffleNet and MobileNetV3 were found to produce the highest accuracy ensembles, while FractalNet and MNASNet were deemed less effective. The best-performing model achieved a Top-1 accuracy of 68.0% on CIFAR-10, demonstrating the effectiveness of automated assembly in generating competitive models without manual tuning.
Methodology
The authors developed a deterministic code-assembly generator that programmatically assembles combinations of four base model families from the LEMUR database into valid MoE4 models. A multi-stage validation pipeline was implemented to ensure model integrity before GPU evaluation. The automated campaign ran for 28 days, generating and evaluating models with persistent state management and fault tolerance.
Results
The automated pipeline generated 4,463 candidate models, with 1,021 successfully evaluated. The analysis revealed a significant bias towards the AirNet family, covering only 4.8% of the possible combinations. The highest accuracy ensembles were produced by ShuffleNet and MobileNetV3, while FractalNet and MNASNet were less effective. The best model achieved a Top-1 accuracy of 68.0% on CIFAR-10.
Implications
The findings suggest that automated architecture search can effectively identify high-performing model combinations, while also highlighting the importance of addressing biases in search methodologies. This work may influence future research in neural architecture search and the development of more robust automated systems for model generation.
A Verifiable Search Is Not a Learnable Chain-of-Thought
Theory
Large Language Models
Reinforcement Learning
- The assumption that all solvable tasks can be learned as a chain-of-thought is challenged.
- A significant gap exists between the accuracy of verifiable solvers and the performance of fine-tuned models, especially for cryptarithm tasks.
- The concept of 'verdict-as-token' highlights the limitations of model outputs in decision-making despite high arithmetic accuracy.
- Forward-derivable tasks are learnable, while those requiring backtracking search are not, unless the search is precomputed.
Read more
A Verifiable Search Is Not a Learnable Chain-of-Thought
Summary
This paper challenges the assumption that any task solvable by a short program can be effectively taught to a model as a chain-of-thought (CoT). The author investigates this through a controlled study involving nine reasoning tasks generated by deterministic processes. The study reveals that while some tasks can be learned effectively, others, particularly those requiring backtracking search, cannot be distilled into a forward-derivable CoT. The author reverse-engineers these tasks into Python solvers, achieving high accuracy on most but struggling with cryptarithm tasks. The findings indicate that the model can perform arithmetic correctly but fails to carry out the search process as a coherent left-to-right derivation. The paper introduces the concept of 'verdict-as-token,' where the model's outputs are correct in arithmetic but lack fidelity in decision-making. The author concludes that the learnability of a task is contingent upon its ability to provide a faithful forward chain-of-thought, and suggests that removing the search from the trace can make such tasks learnable through memorization and verification rather than direct search.
Methodology
The author reverse-engineers nine reasoning tasks into Python solvers, evaluates their performance, and attempts to distill their procedures into a low-rank adaptation (LoRA) over a large language model. The study employs supervised fine-tuning (SFT), reinforcement learning from verifiable rewards (RLVR), and self-training (STaR) to assess the learnability of the tasks.
Results
The results show that while five of the nine tasks achieved over 98% accuracy with their solvers, the cryptarithm task remained resistant to learning, with model performance plateauing at 0.01-0.07 across various training methods. The introduction of a controlled intervention demonstrated that revealing the cipher key significantly improved performance, indicating that the search process itself was the binding constraint.
Implications
The findings suggest that for certain complex reasoning tasks, traditional methods of teaching models through chain-of-thought may not be effective. Instead, precomputing and memorizing task structures could be a more viable approach. This has implications for the design of AI systems that need to tackle similar reasoning challenges.
Physiology-Aware CNN and Zero-Shot Multimodal LLMs for ECG Image Classification: A Comparative Study
Computer Vision
Large Language Models
Multimodal
- Physiology-aware CNNs outperform zero-shot LLMs in ECG image classification.
- LeadGroupECG model effectively captures anatomical relationships among ECG leads.
- Zero-shot multimodal LLMs show limited diagnostic discrimination capabilities.
- CNN models achieved high ROC-AUC scores, indicating strong classification performance.
Read more
Physiology-Aware CNN and Zero-Shot Multimodal LLMs for ECG Image Classification: A Comparative Study
Summary
This study investigates the effectiveness of zero-shot multimodal large language models (LLMs) and physiology-aware convolutional neural networks (CNNs) in classifying 12-lead ECG images as normal or abnormal. The authors highlight the unique challenges of ECG image interpretation, which relies on precise waveform morphology and lead relationships, distinguishing it from general image analysis. The research compares three prominent LLMs (GPT-5.2, GPT-4.1, and Gemini-2.5 Pro) under fixed zero-shot conditions against a newly developed physiology-aware CNN model, LeadGroupECG, which aggregates features from anatomical lead groups. The models were evaluated on both an internal test set and the external PTB-XL dataset. Results indicate that CNN-based models achieved stable discrimination with ROC-AUC scores ranging from 0.92 to 0.94 internally and 0.85 to 0.86 externally. In contrast, the zero-shot LLMs performed poorly, with ROC-AUC scores around 0.5, suggesting that while LLMs can generate narratives, their diagnostic capabilities are limited without task-specific training. The study concludes that domain-specific architectures are essential for reliable ECG interpretation.
Methodology
The study employed a comparative framework involving data preparation of ECG images, training of a physiology-aware CNN model (LeadGroupECG) alongside traditional CNN baselines (ResNet18, DenseNet121, VGG16), and evaluation of zero-shot multimodal LLMs. All models were tested on a large-scale ECG dataset, with performance assessed using ROC-AUC metrics.
Results
CNN-based models demonstrated robust performance with internal ROC-AUC scores of 0.92-0.94 and external scores of 0.85-0.86. The LeadGroupECG model significantly improved internal performance while maintaining external generalization. In contrast, zero-shot LLMs exhibited near-chance performance with ROC-AUC scores around 0.5.
Implications
The findings suggest that while multimodal LLMs can assist in generating ECG narratives, they are not yet reliable for diagnostic purposes without specific training. This highlights the importance of developing specialized models for ECG interpretation in clinical settings.
Weight-Space Geometry of Offline Reasoning Training
Reinforcement Learning
Large Language Models
Theory
- SFT, RFT, and RIFT produce nearly collinear weight updates with similar performance metrics.
- DFT diverges significantly in weight direction compared to reward-weighted methods.
- Offline GRPO introduces a substantial orthogonal component while remaining in the SFT loss basin.
- DPO achieves the highest accuracy on GSM8K and AIME26, despite using a smaller learning rate.
Read more
Weight-Space Geometry of Offline Reasoning Training
Summary
This paper investigates the weight-space geometry of various offline reinforcement learning (RL) losses used for reasoning distillation, specifically focusing on six methods: SFT, RFT, DFT, RIFT, Offline GRPO, and DPO. The authors analyze the weight updates produced by these methods when trained on identical data from a single base model (Qwen3-4B) using attention-only LoRA. The study employs several analytical techniques, including cosine similarity, principal-angle subspace analysis, linear mode connectivity, and CKA, to assess the mechanistic differences between the methods. Key findings reveal that SFT, RFT, and RIFT yield nearly collinear weight updates and similar accuracy on the GSM8K benchmark, while DFT diverges significantly in weight direction. Offline GRPO introduces an orthogonal component to the SFT direction, and DPO operates in a near-orthogonal subspace, achieving the highest accuracy on both GSM8K and AIME26 benchmarks despite using a smaller learning rate. The results suggest that different loss formulations lead to distinct weight updates, which has implications for both practical implementation and theoretical understanding of offline RL methods.
Methodology
The authors conducted a controlled experiment using six offline reasoning loss methods, training them on identical math rollouts from a single base model. They analyzed the resulting weight updates using cosine similarity, principal-angle subspace analysis, linear mode connectivity, and CKA to understand the geometric relationships between the weight updates of different methods.
Results
The analysis showed that SFT, RFT, and RIFT had a cosine similarity of ≥0.97 in their weight updates, with comparable accuracy on GSM8K (87-88%). DFT produced more distinctive updates, while Offline GRPO added a significant orthogonal component. DPO, operating in a near-orthogonal subspace, achieved the highest accuracy (93.5% on GSM8K) and demonstrated a mode-connectivity barrier.
Implications
The findings suggest that the choice of loss function in offline reasoning training can lead to fundamentally different weight updates, which may affect model performance and interpretability. This has implications for practitioners in selecting appropriate loss functions and for researchers in understanding the underlying mechanisms of offline RL.
MGI: Member vs Generated Inference
Generative Models
Computer Vision
- Introduction of the Member vs Generated Inference (MGI) task.
- Existing membership inference methods are inadequate for distinguishing between training members and generated outputs.
- The proposed Data Circuit Breaker (DCB) method effectively addresses the MGI challenge.
- DCB combines signals from an autoencoder and latent generator to improve classification accuracy.
Read more
MGI: Member vs Generated Inference
Summary
The paper introduces the Member vs Generated Inference (MGI) task, which aims to determine whether a given sample is a true training member of a generative model or an output generated by that same model. As generative models produce increasingly realistic samples, distinguishing between training data and generated outputs becomes challenging. The authors highlight that existing membership inference methods often misclassify generated samples as training members, while attribution-based methods may incorrectly label true members as generated. This misclassification arises from the reliance on likelihood-related signals that are similarly high for both training examples and generated outputs. To address these challenges, the authors propose a novel method called Data Circuit Breaker (DCB), which employs a three-stage approach: (1) an autoencoder-based filtering step to identify generated samples, (2) a membership inference step on non-generated samples using the latent generator, and (3) a cross-generator attribution step to compare log-probabilities across multiple model versions. The DCB method effectively distinguishes training members from generated samples across various generative models, even in cases where models reproduce near-duplicates of training data. The paper demonstrates that DCB outperforms existing methods and generalizes well to derivative model settings, where new models are trained on generated data.
Methodology
The authors propose the Data Circuit Breaker (DCB) method, which consists of three stages: (1) filtering generated samples using an autoencoder, (2) performing membership inference on non-generated samples with the latent generator, and (3) conducting cross-generator attribution to compare log-probabilities across different model versions.
Results
The DCB method consistently outperforms existing membership inference and attribution methods across multiple generative models, effectively distinguishing between training members and generated samples, even in challenging scenarios involving model memorization and derivative model settings.
Implications
The findings have significant implications for the security and reliability of generative models, particularly in applications where distinguishing between real and generated data is crucial, such as in content creation, data privacy, and model training practices.
Learning the Koopman Operator using Attention Free Transformers
Time Series
Theory
Optimization
- Introduction of an attention-free latent memory block for improved prediction accuracy.
- Dynamic re-encoding mechanism to correct latent drift and maintain model robustness.
- Demonstrated effectiveness across multiple benchmark systems with significant error reduction.
- Lower inference latency compared to traditional models while maintaining accuracy.
Read more
Learning the Koopman Operator using Attention Free Transformers
Summary
This paper addresses the challenges of learning Koopman operators with autoencoders, particularly the issue of long-horizon prediction errors that occur due to drift off the learned manifold. The authors propose two innovative components to enhance the robustness of Koopman predictors: an attention-free latent memory (AFT) block and dynamic re-encoding mechanisms. The AFT block aggregates a short window of past latent states to correct the current latent before each Koopman update, achieving linear time complexity and requiring significantly fewer parameters than traditional multi-head attention. The dynamic re-encoding mechanism employs lightweight online change-point detection methods to identify latent drift and project predictions back onto the autoencoder manifold, thereby preventing catastrophic drift. The proposed model is evaluated on three benchmark systems—the Duffing oscillator, Repressilator, and IRMA—demonstrating a consistent reduction in error accumulation compared to standard Koopman autoencoders and multi-head attention models. The results indicate that the combination of AFT and dynamic re-encoding leads to improved long-horizon prediction accuracy while maintaining lower inference latency, making the model a fast and compact predictor that remains on the learned manifold over extended prediction horizons.
Methodology
The authors augment a standard Koopman autoencoder with an attention-free latent memory block that captures local temporal context and a dynamic re-encoding mechanism that uses streaming change detection to correct latent drift. The model is evaluated on three benchmark systems, measuring mean squared error (MSE) and long-horizon mean cumulative absolute error (MCAE).
Results
The proposed model consistently outperformed standard Koopman autoencoders and matched-capacity multi-head attention models across the three benchmark systems, showing marked reductions in long-horizon error. The AFT block was particularly effective in stabilizing predictions in systems with switching and feedback dynamics, while dynamic re-encoding further enhanced robustness.
Implications
The findings suggest that the proposed methods can significantly improve the performance of Koopman operators in predicting nonlinear dynamics, with potential applications in various fields such as control systems, biology, and fluid mechanics. The techniques may also be beneficial for other machine learning tasks that require long-horizon predictions.
Temporal-Spectral Alignment with Frequency Adaptation for Source-Free Time-Series Adaptation
Time Series
- Introduces a novel SFDA framework for time-series data that adapts at the signal level rather than the feature level.
- Proposes a lightweight Frequency Adaptation Layer (FAL) for spectral alignment, enhancing adaptation efficiency.
- Demonstrates superior performance on benchmark datasets, achieving state-of-the-art results in macro F1-score.
- Addresses both temporal and spectral shifts in time-series data, which are often overlooked in existing methods.
Read more
Temporal-Spectral Alignment with Frequency Adaptation for Source-Free Time-Series Adaptation
Summary
This paper addresses the challenge of source-free domain adaptation (SFDA) for time-series data, which involves transferring knowledge from a pre-trained source model to an unlabeled target domain without access to source data. The authors propose a novel approach called Temporal-Spectral Alignment with Frequency Adaptation (SAFA), which focuses on both temporal dependencies and spectral characteristics of time-series data. The method introduces a Frequency Adaptation Layer (FAL) that modulates the phase and amplitude of target signals in the frequency domain to align them with the source distribution. This approach is distinct from traditional methods that typically adapt in the feature space, which may overlook critical spectral shifts. The authors demonstrate the effectiveness of SAFA through extensive experiments on multiple benchmark datasets, showing that it achieves state-of-the-art performance in macro F1-score while maintaining robustness and generalization across diverse time-series applications.
Methodology
The proposed SAFA framework employs a Frequency Adaptation Layer (FAL) that recalibrates the spectral response of target time-series signals by adjusting their phase and amplitude in the frequency domain. During the adaptation phase, the entire source model remains frozen, and the adaptation process is driven solely by the FAL, which ensures that the target data aligns closely with the source distribution. This method contrasts with traditional approaches that typically involve fine-tuning the feature extractor or classifier.
Results
The experiments conducted on three heterogeneous time-series benchmarks (WISDM, MFD, and Boiler) demonstrate that SAFA consistently outperforms existing SFDA methods, achieving state-of-the-art performance in macro F1-score. The results indicate that the proposed method effectively addresses the challenges posed by spectral shifts and temporal dependencies in time-series data.
Implications
The findings suggest that the proposed SAFA framework can be effectively applied in various real-world scenarios where labeled source data is unavailable, such as in healthcare monitoring, financial forecasting, and environmental sensing. By addressing both temporal and spectral shifts, the method enhances the reliability of time-series classification in diverse applications.
When Is an LLM Worth It for Hyperparameter Optimization? A Budget-Matched Study on Tabular Data Finds the Warm-Start Is a Default Configuration, Not the Model
Optimization
Large Language Models
- The warm-start advantage in LLM-HPO is attributed to a fixed default configuration rather than the model's suggestions.
- LLM proposals add minimal improvement to cross-validation accuracy and none to held-out test accuracy.
- Classical search methods seeded with a sensible default outperform LLM advisors within a few evaluations.
- The study emphasizes the importance of proper baseline comparisons in evaluating LLM performance in HPO.
Read more
When Is an LLM Worth It for Hyperparameter Optimization? A Budget-Matched Study on Tabular Data Finds the Warm-Start Is a Default Configuration, Not the Model
Summary
This paper investigates the effectiveness of large language models (LLMs) as advisors for hyperparameter optimization (HPO) in tabular data settings. The authors conduct a budget-matched, multi-seed study across eight PMLB tabular benchmarks, comparing an LLM advisor (LLM-OptFlow) against four classical optimization methods: random search, Optuna-TPE, Gaussian-process Bayesian optimization, and successive halving. The study reveals that the initial strong performance of the LLM advisor is primarily due to a fixed default configuration rather than the model's output. The LLM's own proposals contribute only marginally to cross-validation accuracy and do not improve held-out test performance. When classical search methods are seeded with the same default configuration, they quickly match and eventually surpass the LLM advisor's performance. The findings suggest that practitioners should prioritize classical search methods seeded with sensible defaults over LLM advisors for tabular HPO tasks.
Methodology
The authors employed a budget-matched, multi-seed experimental design to compare the performance of an LLM advisor against classical HPO methods across eight tabular benchmarks. They utilized paired tests and bootstrap 95% confidence intervals to analyze the results, ensuring a rigorous evaluation of the LLM's contributions versus classical methods.
Results
The study found that the LLM advisor's initial strong performance was due to a fixed default configuration, achieving 88.7% mean best cross-validation accuracy. The LLM's own proposals only improved accuracy by 0.40 percentage points on cross-validation and had no significant impact on held-out test accuracy. When classical search methods were seeded with the same default, they matched the LLM advisor's performance within a few evaluations and surpassed it thereafter.
Implications
The findings suggest that for hyperparameter optimization in tabular data, practitioners should consider using classical search methods with sensible default configurations instead of relying on LLM advisors, which do not provide significant advantages. This could lead to more efficient and effective optimization strategies in applied machine learning.
KLip-PPO: A per-sample KL perspective on PPO-Clip
Reinforcement Learning
Optimization
Theory
- Establishes a per-sample equivalence between PPO-Clip and PPO-KL surrogates.
- Demonstrates that both surrogates produce indistinguishable training outcomes on benchmark tasks.
- Clarifies the implicit structure of the PPO-Clip surrogate as a per-sample KL penalty.
- Suggests new avenues for algorithmic generalization based on the identified structural features.
Read more
KLip-PPO: A per-sample KL perspective on PPO-Clip
Summary
This paper presents KLip-PPO, a novel perspective on the Proximal Policy Optimization (PPO) algorithm, specifically focusing on the relationship between the clipped surrogate and the Kullback-Leibler (KL) penalty. The authors demonstrate that the gradient of the PPO-Clip surrogate can be exactly reproduced by a KL surrogate with a per-sample coefficient that varies based on the importance ratio and advantage. This finding clarifies the structural features of the clipped surrogate, revealing that it acts as a per-sample KL penalty with a step function shape at the trust region boundary. The empirical validation on five MuJoCo continuous-control benchmarks shows that both the clipped and KL surrogates yield indistinguishable training curves, reinforcing the idea that they are not merely alternative algorithms but are fundamentally linked. The paper also discusses potential extensions of this framework, suggesting that the per-sample coefficient can serve as a design axis for future developments in policy optimization algorithms.
Methodology
The authors derive a per-sample gradient identity that connects the PPO-Clip and PPO-KL surrogates. They validate this identity empirically through experiments on five MuJoCo continuous-control benchmarks, comparing the performance of both surrogates in terms of training curves and stability.
Results
The empirical results indicate that the training curves produced by both the PPO-Clip and PPO-KL surrogates are indistinguishable across all tested benchmarks, reinforcing the theoretical findings of the paper. This suggests that the two methods can be viewed as different manifestations of the same underlying principle.
Implications
The findings have significant implications for the design of reinforcement learning algorithms, suggesting that future research can leverage the identified per-sample KL penalty structure to create more robust and flexible policy optimization methods. This could lead to improved performance in various reinforcement learning applications.
CLIP-guided Diffusion Model for Backdoor Generation in Sensor-based Human Activity Recognition
Generative Models
Time Series
Multimodal
- Introduction of IMU-DM-CLIP, a backdoor training technique for HAR models.
- Demonstration of the effectiveness of backdoor attacks with low injection rates.
- Identification of challenges in generating unbiased synthetic data for HAR.
- Evaluation of the stealthiness and performance of the proposed backdoor generation technique.
Read more
CLIP-guided Diffusion Model for Backdoor Generation in Sensor-based Human Activity Recognition
Summary
This paper addresses the challenges of data scarcity in Human Activity Recognition (HAR) using Inertial Measurement Unit (IMU) sensors by proposing a novel backdoor training technique called IMU-DM-CLIP. The authors leverage diffusion models, which are effective in generating high-quality synthetic data, to facilitate trigger-based attacks on HAR models. The proposed method allows for the generation of synthetic sensor data that can be manipulated to include backdoor triggers, enabling adversaries to influence model behavior during inference. The study demonstrates that even with a minimal backdoor injection rate of 10%, the attack remains effective. The paper also discusses the implications of backdoor attacks on the integrity of HAR systems, particularly in the context of IoT and wearable devices, highlighting the need for robust defenses against such vulnerabilities.
Methodology
The authors propose a diffusion model-based approach to generate synthetic IMU data for HAR, integrating a backdoor attack mechanism. The method involves fine-tuning a CLIP-guided diffusion model to create data that includes specific triggers, allowing for targeted misclassification during inference. The evaluation includes empirical analysis to assess the attack's success rate and stealthiness.
Results
The empirical analysis shows that the IMU-DM-CLIP technique successfully executes backdoor attacks on HAR models with a low backdoor injection rate of 10%. The results indicate that the generated synthetic data can effectively mislead the HAR model while maintaining a high level of realism.
Implications
The findings raise concerns about the security of HAR systems, particularly in applications involving IoT and wearable devices. The ability to inject backdoors into synthetic data generation processes necessitates the development of robust defenses to protect against such vulnerabilities, ensuring the reliability and safety of machine learning applications in sensitive areas like health monitoring and training.
Collapsed Effective Operators for Higher-order Structures
Graph Learning
Theory
- Introduction of Collapsed Effective Operators for higher-order structures.
- The operator condenses higher-order interactions into a single vertex-level representation.
- Preserves positive semi-definiteness and lowers system energy under higher-order connectivity.
- Empirical improvements in spectral clustering and signal smoothing.
Read more
Collapsed Effective Operators for Higher-order Structures
Summary
This paper addresses the limitations of existing spectral operators in modeling higher-order structures, which are essential for capturing complex relational data. The authors introduce Collapsed Effective Operators, a novel approach that condenses higher-order degrees of freedom into a single vertex-level operator using Schur complementation of a graded Laplacian. This operator effectively encodes long-range interactions mediated by topology, preserving positive semi-definiteness and lowering system energy under higher-order connectivity. The proposed method enhances spectral clustering, signal smoothing, and facilitates the integration of topological features into neural network architectures through positional encoding. The paper emphasizes the importance of capturing the influence of higher-order structures on vertex-level dynamics, overcoming the challenges of fusing multi-rank information into coherent signals for node-level tasks.
Methodology
The authors derive the Collapsed Effective Operators by applying Schur complementation to a graded Laplacian, allowing for the integration of higher-order structures into a vertex-level operator. This approach contrasts with traditional methods that operate rank-by-rank or require ad hoc aggregation schemes.
Results
The proposed operator demonstrates improved performance in spectral clustering and signal smoothing tasks. It effectively encodes the influence of higher-order structures on vertex dynamics, providing a principled construction for node-level predictions.
Implications
The findings suggest that the Collapsed Effective Operators can enhance various applications in machine learning that involve complex relational data, such as social networks, biological systems, and collaborative networks, by providing a more coherent representation of higher-order interactions.