AI-generated summaries
Today's ML research,
without the noise.
Daily summaries of the latest machine learning papers from arXiv, processed every 8 hours.
48
Papers today
8h
Update frequency
7
Days of history
Zeus: Towards Tuning-Free Foundation Model for Time Series Analysis
Time Series
- ZEUS is a unified TSFM that operates without task-specific fine-tuning.
- It incorporates a multi-scale Transformer architecture for efficient long-sequence modeling.
- MOTM allows ZEUS to learn diverse task-specific inductive biases in a single framework.
- Experimental results show competitive performance across five key time series tasks.
Read more
Zeus: Towards Tuning-Free Foundation Model for Time Series Analysis
Summary
The paper introduces ZEUS, a unified tuning-free Time Series Foundation Model (TSFM) designed to enhance performance across various time series analysis tasks without requiring task-specific fine-tuning. ZEUS addresses two main challenges in multi-task generalization: the architectural dilemma between point-level granularity and long-sequence scalability, and the training dilemma posed by divergent inductive biases across different tasks. To tackle these issues, ZEUS employs a multi-scale Transformer architecture that utilizes point-wise tokenization and a U-shaped hierarchy, which balances fine-grained detail with computational efficiency. Additionally, it introduces Multi-Objective Temporal Masking (MOTM), a strategy that supports various tasks such as extrapolation, interpolation, and global abstraction within a single framework. Experimental results demonstrate that ZEUS achieves state-of-the-art performance across five benchmark tasks, showcasing its capability as a general-purpose TSFM without the need for tuning or retraining, thus paving the way for broader applications in time series analysis.
Methodology
The methodology involves a multi-scale Transformer architecture that combines point-wise tokenization with a U-shaped hierarchy to manage both fine-grained fidelity and scalability. The Multi-Objective Temporal Masking (MOTM) strategy is employed to expose the model to various corruption patterns during pretraining, enabling it to learn a versatile representation that supports multiple downstream tasks.
Results
ZEUS consistently outperforms existing task-specific models in a tuning-free setting across five representative tasks, including point forecasting, probabilistic forecasting, anomaly detection, imputation, and classification. The results indicate that ZEUS achieves state-of-the-art performance on well-established benchmarks, demonstrating its generalizability and effectiveness.
Implications
The development of ZEUS has significant implications for time series analysis, as it allows for out-of-the-box deployment across diverse tasks without the need for fine-tuning. This can streamline workflows in various applications, such as forecasting, anomaly detection, and data imputation, making advanced time series analysis more accessible and efficient.
A Filtered Mixture-of-Generators for Fully Synthetic Survival Training
Generative Models
Optimization
Time Series
- FoGS introduces a novel pipeline for synthetic data construction in survival analysis, focusing on sample selection from multiple generators.
- The method improves downstream performance metrics on the majority of evaluated datasets compared to traditional single-generator approaches.
- FoGS maintains privacy margins while providing a viable alternative to real-data training in clinical settings.
- The study identifies key trade-offs in synthetic data selection, emphasizing the importance of balancing plausibility and population coverage.
Read more
A Filtered Mixture-of-Generators for Fully Synthetic Survival Training
Summary
This paper addresses the challenges of training survival models in clinical settings, where data is scarce and privacy regulations limit data sharing. The authors propose a novel approach called FoGS (Filtered Mixture-of-Generators for Survival analysis), which reframes the construction of synthetic training data as a sample selection problem rather than a simple generation task. FoGS utilizes a heterogeneous pool of four distinct tabular generators and scores each generated sample using an ensemble of survival models trained on real data. The selection policy is optimized through an outer loop that adjusts generator quotas and scorer weights based on downstream performance, while an inner loop tunes the survival model itself. The evaluation of FoGS on 16 public datasets demonstrates that it can improve downstream performance metrics, such as the concordance index (C-index) and integrated Brier score (IBS), often matching or exceeding the performance of models trained on real data. The findings suggest that this approach can effectively substitute for real-data training in privacy-restricted environments, thereby facilitating the development of survival models in clinical research.
Methodology
FoGS employs a two-level pipeline where a candidate pool of synthetic samples is generated from four distinct tabular generators. Each sample is scored by an ensemble of survival models trained on real data, using proper scoring rules as a plausibility proxy. An outer optimization loop tunes the selection policy based on downstream performance, while an inner loop optimizes the survival model (XGBoost-Cox) using the selected synthetic data.
Results
FoGS achieved mean improvements of +2.17 in C-index and +0.67 in IBS across 16 public datasets, outperforming real-data training on most cohorts. Specifically, it improved metrics on 9 out of 16 datasets and at least one metric on 13 datasets, with statistical significance indicated by one-sided Wilcoxon tests (p = 0.039 and p = 0.035).
Implications
The findings suggest that FoGS can significantly enhance the training of survival models in clinical research, particularly in scenarios where data is limited and privacy concerns are paramount. This approach may facilitate better decision-making in healthcare by enabling the use of synthetic cohorts for model training.
Fixed-Set Robustness in Programming by Example: Example Corruption and Semantic Partition Recovery
Theory
- Introduces the concept of fixed-set worst-case corruption in PBE systems.
- Proposes version-space partition aggregation (VPA) as a defense mechanism.
- Demonstrates that low-margin PBE tasks are particularly susceptible to adversarial attacks.
- Shows that VPA can recover from certain corruptions but struggles with low semantic vote margins.
Read more
Fixed-Set Robustness in Programming by Example: Example Corruption and Semantic Partition Recovery
Summary
This paper addresses the robustness of Programming by Example (PBE) systems against adversarial corruption of input-output examples. Unlike traditional approaches that model noise as stochastic processes, the authors focus on worst-case corruption scenarios where an adversary strategically alters examples to degrade the performance of the synthesizer. They formalize the concept of fixed-set worst-case corruption and propose a defense mechanism called version-space partition aggregation (VPA), which synthesizes programs from disjoint groups of examples and evaluates them based on semantic signatures. The study reveals that low-margin PBE tasks are particularly vulnerable to adversarial attacks, which are often overlooked by existing noise-tolerant models. Through experiments on various datasets, including curated and generated tasks, the authors demonstrate that while VPA can recover from certain types of corruption, its effectiveness diminishes when the semantic vote margin is low. The findings highlight the need for robust defenses in PBE systems, especially in real-world applications where adversarial manipulation is a concern.
Methodology
The authors implemented a deterministic string domain-specific language (DSL) for PBE, along with exact-within-bounded-pool and heuristic corruption search methods. They conducted experiments using VPA to synthesize programs from disjoint example groups and evaluated their performance against various corruption scenarios using curated and generated datasets.
Results
The results indicate that targeted adversarial corruption can significantly impact the performance of PBE systems, with VPA showing promise in recovering from certain corruptions. However, the effectiveness of VPA is contingent on the semantic vote margin, which often fails in realistic tasks. The experiments revealed that while random noise models achieved modest success rates, VPA struggled under conditions of low margin, leading to a drop in accuracy.
Implications
The findings suggest that PBE systems need to incorporate robust defenses against adversarial attacks, particularly in applications involving shared or sensitive data. The study emphasizes the importance of understanding the vulnerabilities of PBE systems to improve their reliability and security in real-world scenarios.
Neural Certificate Pricing for Combinatorial Optimization Problems
Optimization
Theory
Graph Learning
- NCP transforms the certification process into a learnable optimization pipeline.
- The framework separates learned price signals from structural certificates for effective recovery.
- Local stability results indicate robustness of the recovery process against price prediction errors.
- NCP outperforms state-of-the-art methods in various CO problem classes.
Read more
Neural Certificate Pricing for Combinatorial Optimization Problems
Summary
This paper introduces Neural Certificate Pricing (NCP), a novel unsupervised learning framework designed to tackle combinatorial optimization (CO) problems. Traditional CO approaches struggle with the exponential search space required to certify optimality, as certifiable discrete structures can only be verified in polynomial time. NCP leverages this asymmetry by training a neural network to predict certificate-level dual prices, which are then used in a structured recovery layer to construct a primal marginal that is certificate-consistent. The authors demonstrate that NCP can effectively transform the certification process into a mechanism for recovering optimal solutions. They establish a local stability result indicating that small errors in predicted prices lead to minimal degradation in objective value. The framework is validated across three classes of CO problems, showing significant improvements over existing neural and heuristic baselines in terms of performance and computational efficiency.
Methodology
NCP employs an unsupervised learning framework where a neural network predicts dual prices that guide the recovery of a certificate-consistent primal marginal. The method utilizes a structured recovery layer to ensure that the recovered solution adheres to the necessary structural constraints of the CO problem.
Results
The experimental results indicate that NCP consistently outperforms existing neural baselines by significant margins or matches their performance with reduced computational costs. The framework also exhibits enhanced generalization to out-of-distribution scenarios, demonstrating its robustness and adaptability.
Implications
The introduction of NCP has the potential to revolutionize approaches to solving complex combinatorial optimization problems, making them more efficient and accessible. Its unsupervised nature allows for broader applications in real-world scenarios where labeled data is scarce or unavailable.
Predicting Early Stages Of Alzheimer's Disease And Identifying Key Biomarkers Using Deep Artificial Neural Network And Ensemble Of Machine Learning Methodologies
Theory
- Developed an automatic diagnostic system for early-stage Alzheimer's Disease.
- Addressed data challenges including missing values and class imbalance.
- Utilized advanced feature selection techniques to identify significant biomarkers.
- Implemented both ensemble and deep learning models for comparative analysis.
Read more
Predicting Early Stages Of Alzheimer's Disease And Identifying Key Biomarkers Using Deep Artificial Neural Network And Ensemble Of Machine Learning Methodologies
Summary
This thesis presents a comprehensive study aimed at developing an automatic diagnostic system for the early stages of Alzheimer's Disease (AD) using advanced machine learning techniques. The study highlights the challenges associated with AD, including its unpredictable onset and progression, which complicates timely diagnosis and treatment. The research utilizes a dataset from the Alzheimer’s Disease Neuroimaging Initiative, addressing issues such as missing values and class imbalance through iterative imputation and the borderline SVM-SMOTE algorithm, respectively. Feature selection is performed using both wrapper-based and embedded techniques to identify significant biomarkers associated with AD. The methodology includes a stacking-based ensemble model comprising Logistic Regression, Extra Tree, Bagging KNN, and LightGBM classifiers, alongside a deep learning model based on Artificial Neural Networks (ANN). A comparative analysis of these models is conducted using performance metrics such as precision, recall, F1-score, and AUC-ROC to identify the most effective classifier for early AD diagnosis. The study aims to assist clinicians in understanding key biomarkers and improving diagnostic accuracy, thereby facilitating early intervention and management of the disease.
Methodology
The study employed a dataset from the Alzheimer’s Disease Neuroimaging Initiative, applying iterative imputation for missing values and the borderline SVM-SMOTE algorithm to tackle class imbalance. Feature selection was performed using wrapper-based and embedded techniques. A stacking-based ensemble model was constructed with various classifiers, and a deep learning model (ANN) was also implemented for comparison.
Results
The comparative analysis revealed the performance of different models in diagnosing early stages of Alzheimer's Disease, with specific classifiers demonstrating superior accuracy based on metrics like precision, recall, F1-score, and AUC-ROC. The study successfully identified key biomarkers associated with the disease.
Implications
The findings of this research have significant implications for clinical practice, as they provide a framework for early diagnosis of Alzheimer's Disease, potentially leading to timely interventions and better management of the condition. The identification of key biomarkers can also enhance understanding of the disease's progression and inform future research.
SA-HGNN: Sample-Adaptive Hyperbolic Graph Neural Network for EEG-Based Depression Recognition
Graph Learning
- SA-HGNN introduces a Sample-Adaptive Graph Construction module for personalized brain network topologies.
- Utilizes hyperbolic graph convolution to effectively capture hierarchical relationships in brain connectivity.
- Incorporates an Attention Pooling module to mitigate noise interference in EEG signals.
- Demonstrates superior performance over traditional GNNs in EEG-based depression recognition tasks.
Read more
SA-HGNN: Sample-Adaptive Hyperbolic Graph Neural Network for EEG-Based Depression Recognition
Summary
This paper introduces the Sample-Adaptive Hyperbolic Graph Neural Network (SA-HGNN), a novel approach aimed at enhancing EEG-based depression recognition by accurately modeling the hierarchical structure of brain networks affected by Major Depressive Disorder (MDD). Traditional Graph Neural Networks (GNNs) struggle to capture the complex hierarchical relationships inherent in brain connectivity due to their reliance on Euclidean space metrics. The SA-HGNN addresses this limitation by employing a hyperbolic space metric, which allows for better representation of hierarchical structures. The model consists of three main components: a Sample-Adaptive Graph Construction module that dynamically creates personalized brain network topologies, a Hyperbolic Graph Convolution module that leverages hyperbolic geometry to capture latent hierarchical relationships, and an Attention Pooling module that filters out redundant noise from EEG signals. Extensive experiments conducted on public EEG datasets demonstrate that SA-HGNN outperforms traditional GNNs in both resting-state and task-related paradigms, showcasing its robustness against noise and its effectiveness in identifying abnormal functional connectivity patterns in the brains of patients with depression.
Methodology
The methodology involves constructing a hyperbolic graph neural network that integrates three core modules: a Sample-Adaptive Graph Construction module for dynamic topology creation, a Hyperbolic Graph Convolution module for capturing hierarchical relationships, and an Attention Pooling module for noise reduction in EEG data.
Results
The results indicate that SA-HGNN significantly outperforms traditional GNNs based on Euclidean metrics across various EEG datasets, demonstrating its effectiveness in recognizing depression-related brain connectivity patterns.
Implications
The findings suggest that SA-HGNN could lead to more accurate and timely diagnostic tools for Major Depressive Disorder, potentially improving patient outcomes through better understanding and recognition of brain connectivity disruptions.
Learning Generalizable Skill Policy with Data-Efficient Unsupervised RL
Reinforcement Learning
Robotics
Efficient ML
- Introduction of GenDa framework to enhance data efficiency and generalizability in unsupervised RL.
- Skill relabeling mechanism to address non-stationary skill semantics and improve pre-training efficiency.
- Complementary Information Bottleneck (CIB) to ensure robustness against distribution shifts.
- Demonstrated superior performance on diverse benchmarks compared to state-of-the-art methods.
Read more
Learning Generalizable Skill Policy with Data-Efficient Unsupervised RL
Summary
This paper addresses the limitations of current off-policy unsupervised reinforcement learning (URL) methods, specifically focusing on the issues of non-stationary skill semantics and brittle generalization. The authors propose a unified framework called GenDa (Generalizable Data-efficient Agent) to enhance the scalability of URL. GenDa introduces a skill relabeling mechanism to improve data efficiency during pre-training by mitigating semantic drift, which occurs when the same skill leads to different behaviors over time. Additionally, the paper presents a Complementary Information Bottleneck (CIB) that encourages the learned skill policy to focus on ego-centric features, thus enhancing robustness against distribution shifts in downstream tasks. The proposed methods are evaluated across various benchmarks, demonstrating significant improvements in data efficiency and generalizability compared to existing approaches. The results indicate that GenDa can effectively learn coherent and reusable skill policies, even in high-dimensional state environments where previous methods struggle.
Methodology
The authors formalize two critical challenges in off-policy URL: sample inefficiency due to semantic drift and brittle generalization from overfitting to global context. They introduce a skill relabeling mechanism to counteract semantic drift and a CIB module to focus on ego-centric features, preventing reliance on global contextual information. The framework is evaluated through experiments on various state and pixel benchmarks to assess data efficiency and skill policy generalization.
Results
GenDa significantly outperforms existing unsupervised RL methods in terms of data efficiency and generalizability. The experiments show that the proposed skill relabeling and CIB approaches lead to more consistent skill execution across different environments, particularly in high-dimensional state settings where prior methods fail to discover meaningful skills.
Implications
The findings suggest that GenDa can serve as a robust foundation for developing scalable unsupervised RL systems, potentially impacting various applications in robotics and autonomous systems where efficient skill learning and adaptability are crucial.
ZO-Act: Efficient Zeroth-Order Fine-Tuning via One-Shot Activation-Informed Low-Rank Subspaces
NLP
Large Language Models
Optimization
- ZO-Act utilizes activation-informed low-rank subspaces for efficient fine-tuning of large language models.
- The method reduces perturbation dimensions, leading to lower variance in gradient estimation.
- It supports momentum-based optimizers and quantized model fine-tuning by freezing original weights.
- Experiments show ZO-Act outperforms strong ZO fine-tuning baselines across multiple tasks.
Read more
ZO-Act: Efficient Zeroth-Order Fine-Tuning via One-Shot Activation-Informed Low-Rank Subspaces
Summary
The paper introduces ZO-Act, a novel zeroth-order (ZO) fine-tuning method designed to optimize large language models (LLMs) without the need for backpropagation, which can be memory-intensive. Traditional ZO methods often suffer from high variance due to perturbations in the full parameter space. ZO-Act addresses this by restricting perturbations to a low-rank subspace informed by input activations. This approach allows for a one-shot computation of a fixed activation basis at initialization, enabling the optimization of lightweight coefficient matrices instead of full model weights. The authors demonstrate that this method reduces the effective perturbation dimension, enhances convergence stability, and supports quantized model fine-tuning by keeping low-bit weights frozen. Through experiments on models like Llama-3-8B and OPT-13B, ZO-Act shows consistent improvements over existing ZO fine-tuning baselines across various language tasks, including understanding, question answering, and commonsense reasoning.
Methodology
ZO-Act computes a low-rank basis from input activations during initialization and optimizes only the corresponding lightweight coefficient matrices. This method transforms the fine-tuning process into subspace optimization, allowing for effective zeroth-order optimization while maintaining memory efficiency.
Results
The experiments conducted on Llama-3-8B, OPT-13B, and INT4 Llama-3-8B demonstrate that ZO-Act consistently achieves better performance than existing ZO fine-tuning methods across language understanding, question answering, and commonsense reasoning tasks, indicating its effectiveness for both full-precision and quantized models.
Implications
ZO-Act presents a significant advancement in the fine-tuning of large language models, particularly in scenarios where memory constraints are critical. Its ability to maintain performance while reducing computational overhead opens new avenues for deploying LLMs in resource-limited environments.
Self-Gating Attention for Efficient Time Series Forecasting
Time Series
Efficient ML
- Introduces Self-Gating Attention (SGA) to improve efficiency in time series forecasting.
- Reduces computational complexity from quadratic to linear with respect to look-back length.
- Utilizes a shared attention matrix for common patterns and a residual component for input-specific variations.
- Demonstrates competitive performance against state-of-the-art attention mechanisms across multiple datasets.
Read more
Self-Gating Attention for Efficient Time Series Forecasting
Summary
This paper addresses the inefficiencies of standard self-attention mechanisms in time series forecasting, which exhibit quadratic time and memory complexity relative to the look-back length. The authors propose a novel attention mechanism called Self-Gating Attention (SGA) that reduces computational costs while maintaining forecasting performance. SGA utilizes a shared learnable matrix to capture common attention patterns across timestamps, complemented by an input-dependent residual component to account for variations. This approach allows SGA to achieve linear time and memory complexity. The authors integrate SGA into various forecasting models and evaluate its performance against standard self-attention and lightweight attention variants across nine real-world datasets, including those from electricity, finance, and weather domains. The results demonstrate that SGA significantly enhances inference efficiency while preserving competitive forecasting accuracy, thus providing a viable solution for resource-constrained environments.
Methodology
The authors conducted qualitative and quantitative analyses to identify redundancy in standard self-attention score computations. They designed SGA, which replaces traditional query and key projections with a shared learnable matrix and an input-dependent residual component. The effectiveness of SGA was tested by integrating it into various forecasting backbones and comparing performance metrics against standard self-attention and lightweight variants on multiple datasets.
Results
SGA achieved significant improvements in inference efficiency while maintaining competitive forecasting performance. The experiments showed that SGA outperformed standard self-attention mechanisms in terms of computational efficiency without sacrificing accuracy, as evidenced by results on nine publicly available datasets.
Implications
The proposed SGA mechanism has the potential to enhance the deployment of time series forecasting models in environments with limited computational resources, such as edge devices and high-throughput systems. This could lead to broader applications in industries requiring real-time predictive analytics, such as finance, healthcare, and smart grid management.
DecompRL: Solving Harder Problems by Learning Modular Code Generation
Reinforcement Learning
Large Language Models
Generative Models
- DecompRL decomposes complex problems into smaller, independently solvable modules.
- The framework significantly reduces GPU costs by shifting the computational burden to CPU evaluations.
- DecompRL outperforms traditional RL methods and achieves higher success rates on challenging benchmarks.
- The approach enhances exploration and maximizes the utility of recombined solutions.
Read more
DecompRL: Solving Harder Problems by Learning Modular Code Generation
Summary
The paper introduces DecompRL, a novel reinforcement learning (RL) framework designed to tackle complex problems by decomposing them into smaller, independently solvable sub-functions. Traditional approaches, such as repeated sampling and RL with verifiable rewards, struggle with high computational costs and limited diversity in solutions. DecompRL addresses these challenges by shifting the focus from generating monolithic solutions to creating modular code that can be recombined. This method allows for a significant reduction in GPU costs while increasing the number of candidate solutions. The authors demonstrate that DecompRL outperforms existing RL baselines on benchmarks like LiveCodeBench and CodeContests, achieving higher problem-solving rates and efficiency. The hierarchical training process involves learning both a decomposition policy to break down problems and an implementation policy to generate code for each sub-function. This approach not only enhances exploration but also maximizes the utility of recombinations, making it a valuable tool for solving hard tasks that standard methods cannot address.
Methodology
DecompRL employs a hierarchical generation approach where problems are decomposed into sub-functions. It trains two policies: a decomposition policy that identifies sub-problems and an implementation policy that generates code for these sub-functions. The framework allows for the recombination of multiple implementations of each sub-function, leading to a combinatorial explosion of candidate solutions while maintaining a manageable computational cost.
Results
DecompRL demonstrated superior performance compared to standard and diversity-optimized RL baselines, solving up to 35% of the hard subset of problems in LiveCodeBench. The method effectively reduced GPU token costs by approximately 50 times, enabling the generation of a vast number of candidate solutions without proportionally increasing computational expenses.
Implications
The findings suggest that modular code generation can significantly enhance the capabilities of large language models in solving complex programming tasks. DecompRL's approach could be applied in various domains requiring hierarchical problem-solving, such as competitive programming, formal proofs, and scientific computing, potentially leading to more efficient and scalable AI systems.
Do LLMs Truly Generalize in the Molecular Domain? A Perturbation-Based Analysis
Large Language Models
Graph Learning
- LLMs exhibit limited generalization in the molecular domain, with performance sensitive to small structural changes.
- The Molecular Perturbation framework allows for systematic evaluation of model robustness through controlled structural edits.
- In-Context Tuning (ICT) can enhance model stability by anchoring predictions to structurally similar molecules.
- The study highlights the disconnect between probabilistic modeling in LLMs and the rigid topological constraints of chemical structures.
Read more
Do LLMs Truly Generalize in the Molecular Domain? A Perturbation-Based Analysis
Summary
This paper investigates the generalization capabilities of Large Language Models (LLMs) in the molecular domain, particularly focusing on their performance under structural perturbations. The authors introduce a Molecular Perturbation framework that generates valid structural variants of training molecules using controlled Graph Edit Distance (GED) to assess the robustness of molecular LLMs. The findings reveal that even minor structural edits can lead to significant performance drops, indicating a narrow local trust region and fragility in the models' sensitivity to structural changes. To address this issue, the authors explore In-Context Tuning (ICT), which conditions predictions on structurally similar molecules, showing that it can partially mitigate the observed fragility and expand the local trust region. The study emphasizes the importance of aligning model predictions with chemically meaningful similarities to improve the stability of molecular LLMs against structural variations.
Methodology
The authors developed a Molecular Perturbation framework that generates syntax-valid structural variants of molecules using controlled Graph Edit Distance (GED). They conducted empirical analyses to evaluate model performance under various perturbations and examined the effects of In-Context Tuning (ICT) on model robustness.
Results
The analysis demonstrated that even a single structural edit could lead to substantial performance degradation in molecular tasks. The introduction of ICT showed promise in partially expanding the local trust region and improving performance under structural perturbations, although it did not completely eliminate sensitivity to changes.
Implications
The findings suggest that enhancing LLMs with mechanisms like ICT could improve their applicability in molecular discovery and related fields, where understanding structural variations is crucial. This research points toward the need for models that can better align with the complexities of chemical space.
Unveiling the Non-Monotonic Effect of Privacy on Generalization under Byzantine Robustness
Federated Learning
Theory
Optimization
- The privacy-robustness-optimization trilemma does not extend to generalization error.
- In high-noise regimes, increasing privacy improves generalization performance.
- In low-noise regimes, increased privacy can lead to worse generalization due to the influence of Byzantine participants.
- The effectiveness of membership inference attacks is critical in determining generalization behavior.
Read more
Unveiling the Non-Monotonic Effect of Privacy on Generalization under Byzantine Robustness
Summary
This paper investigates the interplay between local differential privacy (LDP) and Byzantine robustness in distributed learning systems, revealing a non-monotonic relationship between privacy and generalization error. The authors establish that while increasing privacy reduces generalization error in high-noise regimes (strong privacy), it can degrade generalization in low-noise regimes (weak privacy). They analyze the algorithmic stability of Byzantine-robust distributed learning under LDP constraints, demonstrating that the effectiveness of membership inference attacks (MIA) plays a crucial role in this relationship. The study identifies a threshold that separates two distinct privacy regimes, each exhibiting different generalization behaviors. Empirical evaluations support the theoretical findings, refining the understanding of the privacy-robustness-utility trade-off in distributed learning contexts.
Methodology
The authors analyze the algorithmic stability of a Byzantine-robust distributed learning algorithm under LDP constraints, deriving lower and upper bounds on generalization error. They identify a threshold for membership inference attacks that delineates two privacy regimes and conduct empirical evaluations to validate their theoretical insights.
Results
The study reveals a non-monotonic relationship between privacy and generalization error, with distinct behaviors in weak and strong privacy regimes. In the weak privacy regime, increasing noise worsens generalization, while in the strong privacy regime, it improves generalization. Empirical results corroborate the theoretical predictions, highlighting the relevance of the identified worst-case scenarios in practical applications.
Implications
The findings have significant implications for the design of distributed learning systems, particularly in balancing privacy and robustness. Understanding the non-monotonic effects of privacy can guide the implementation of effective privacy-preserving mechanisms in federated learning and other distributed frameworks.
Many Voices, One Reward: Multi-Role Rubric Generation for LLM Judging and Reward Modeling
NLP
Large Language Models
Reinforcement Learning
- Identifies 'dimensional blind spots' as a critical failure in single-voiced rubric generation.
- Introduces Multi-Role Rubric Generation (MRRG) to aggregate diverse evaluative perspectives.
- Demonstrates that MRRG outperforms existing single-role rubric generation methods.
- Provides a unified scoring interface applicable to both LLM evaluation and RLVR.
Read more
Many Voices, One Reward: Multi-Role Rubric Generation for LLM Judging and Reward Modeling
Summary
The paper addresses the challenge of generating reliable reward and preference signals for evaluating large language models (LLMs) on open-ended tasks. It critiques existing annotation-free rubric generators that rely on a single evaluator, which can lead to 'dimensional blind spots'—overlooking important aspects of human preference. To overcome this, the authors propose a novel framework called Multi-Role Rubric Generation (MRRG), which elicits evaluation criteria from multiple complementary roles (e.g., user, domain expert, educator) to create a more comprehensive rubric. This rubric serves as an auditable scorer for validating pairwise preferences and providing rewards for Reinforcement Learning with Verifiable Rewards (RLVR). The empirical results demonstrate that MRRG consistently outperforms single-role baselines across various models and benchmarks, leading to improved reward signals for open-ended generation tasks.
Methodology
The authors develop MRRG as a training-free and reference-free framework that involves multiple evaluative roles. Each role generates specific rubric items, which are then pooled and deduplicated to form a comprehensive rubric-based scorer. This scorer can be utilized for both validating preferences and as a reward model in RLVR settings.
Results
MRRG consistently outperformed single-role rubric generation baselines across multiple benchmarks, achieving improvements of 3.1–16.4 percentage points on preference validation tasks and enhancing reward signals by 1.7 points on BiGGen Bench and 3.4 points on HealthBench-Hard in RLVR experiments.
Implications
The proposed MRRG framework has the potential to enhance the evaluation and optimization of LLMs in open-ended tasks by providing a more nuanced understanding of human preferences. This could lead to better alignment of LLM outputs with user expectations and improved performance in diverse application scenarios.
QFedAgent: Quantum-Enhanced Personalized Federated Learning for Multi-Agent Activity Recognition
Federated Learning
Multimodal
Robotics
- Introduction of QFedAgent, a quantum-enhanced personalized federated learning framework.
- Utilization of variational quantum circuits for efficient multimodal data fusion.
- Achieved a 10× reduction in parameters compared to classical fusion methods.
- Demonstrated high accuracy (97.7%) on the OPPORTUNITY dataset under non-IID conditions.
Read more
QFedAgent: Quantum-Enhanced Personalized Federated Learning for Multi-Agent Activity Recognition
Summary
The paper presents QFedAgent, a novel hybrid quantum-classical framework for personalized federated learning (FL) aimed at multi-agent activity recognition. Traditional FL methods struggle with heterogeneous and non-independent identically distributed (non-IID) data generated by multi-agent systems, particularly in multimodal sensor applications. QFedAgent addresses these challenges by integrating a variational quantum circuit (VQC) fusion module that efficiently models interactions between accelerometer and gyroscope data using quantum state encoding and entanglement. This approach significantly reduces the number of parameters required for model training, achieving a 10× reduction compared to classical methods. The framework employs a shared encoder for feature extraction, while maintaining client-specific adapters and classifiers to ensure personalization. Evaluations on the OPPORTUNITY dataset demonstrate that QFedAgent achieves a mean test accuracy of 97.7% under subject-based non-IID conditions, showcasing its competitive performance against conventional federated learning baselines while maintaining lower fusion complexity.
Methodology
The QFedAgent architecture consists of dual CNN encoders that process accelerometer and gyroscope signals to produce modality embeddings. These embeddings are fused using a VQC layer that captures cross-modal interactions through quantum entanglement. The framework employs a personalized federated learning approach, where only the encoder and VQC parameters are shared for aggregation, while client-specific adapters and classifiers remain private.
Results
The QFedAgent framework achieved a mean test accuracy of 97.7% on the OPPORTUNITY dataset, demonstrating its effectiveness in handling non-IID data distributions while significantly reducing the complexity of the fusion module.
Implications
QFedAgent has potential applications in privacy-sensitive domains such as healthcare monitoring and industrial automation, where efficient and accurate activity recognition is crucial. The integration of quantum computing techniques into federated learning frameworks could pave the way for more advanced and efficient machine learning models in distributed environments.
WARP: Weight-Space Analysis for Recovering Training Data Portfolios
NLP
Large Language Models
Interpretability
- WARP recovers domain mixtures from fine-tuned model weights, addressing the access asymmetry in AI research.
- The framework generates pseudo-checkpoints through model merging, allowing for the estimation of training data distributions.
- WARP outperforms traditional membership inference methods and variants with access to true training trajectories.
- The method remains robust across different training recipes, including overtraining scenarios.
Read more
WARP: Weight-Space Analysis for Recovering Training Data Portfolios
Summary
The paper introduces WARP (Weight-space Analysis for Recovering Training Data Portfolios), a novel framework designed to recover the training data mixtures used to fine-tune foundation models from their released weights. The authors highlight the issue of access asymmetry in AI, where the training data recipes behind models are often proprietary and undisclosed, limiting researchers' ability to understand model behaviors and capabilities. WARP addresses this by interpolating between base and fine-tuned models using model merging techniques to generate pseudo-checkpoints that approximate the training trajectory. From these checkpoints, WARP extracts geometric features that represent the training data distribution and maps them to domain proportions using either a parameter-free softmax readout or a trained MLP projector. The framework was empirically validated using BERT and GPT-2, demonstrating its effectiveness in recovering domain mixtures with low mean absolute error (MAE) compared to existing methods, including membership inference. WARP's robustness across various training scenarios, including early-stop and overtrained checkpoints, further underscores its utility in analyzing model training data.
Methodology
WARP employs model merging techniques to interpolate between base and fine-tuned models, creating pseudo-checkpoints that simulate the training trajectory. It then extracts geometric features from these checkpoints and maps them to domain proportions using either a parameter-free softmax readout or a multi-layer perceptron (MLP) trained on synthetic mixtures.
Results
In controlled experiments with BERT and GPT-2, WARP achieved a mean absolute error (MAE) of 0.046 for BERT and 0.104 for GPT-2 across four text datasets, outperforming sample-level membership inference baselines and a variant that utilized the true training trajectory.
Implications
WARP's ability to recover training data portfolios from model weights can enhance transparency in AI, facilitate auditing for data contamination, and improve understanding of model behaviors, ultimately aiding in the responsible deployment of foundation models.
TiRex-2: Generalizing TiRex to Multivariate Data and Streaming
Time Series
- TiRex-2 generalizes the original TiRex model to multivariate time series forecasting.
- The model allows for streaming inference with constant computational costs per time step.
- It incorporates both past and future covariates while preserving causality.
- A synthetic coupling pipeline is introduced for scalable multivariate pretraining.
Read more
TiRex-2: Generalizing TiRex to Multivariate Data and Streaming
Summary
The paper introduces TiRex-2, a novel recurrent xLSTM-based time series foundation model designed to extend the capabilities of the original TiRex model to handle multivariate forecasting with both past and future covariates. Unlike existing Transformer-based models that struggle with quadratic complexity and full-history recomputation, TiRex-2 employs a memory-centric architecture that allows for constant per-patch cost during streaming. The model integrates a bidirectional time mixer and an asymmetric grouped-attention variate mixer to maintain causality while incorporating future covariates. A synthetic coupling pipeline is proposed to generate diverse multivariate training instances from univariate datasets, enhancing the model's training distribution. Empirical evaluations demonstrate that TiRex-2 achieves state-of-the-art zero-shot performance on benchmark datasets and maintains stable performance during streaming with constant inference costs.
Methodology
TiRex-2 utilizes a recurrent architecture based on xLSTM, combining a bidirectional time mixer with an asymmetric grouped-attention variate mixer. This design enables the model to process incoming data in a streaming fashion while maintaining causality and efficiently integrating future covariates. The synthetic coupling pipeline generates diverse multivariate samples from univariate datasets to enhance training.
Results
TiRex-2 demonstrates state-of-the-art zero-shot performance on GIFT-Eval and fev-bench datasets. The model remains stable when streaming to arbitrary context lengths and maintains a constant inference cost per patch, with 38.4M parameters in univariate mode and an additional 44.1M for multivariate forecasting.
Implications
The advancements presented in TiRex-2 have significant implications for real-time forecasting applications across various domains, such as finance, healthcare, and industrial monitoring, where timely and accurate predictions are critical. The model's ability to efficiently handle multivariate data and future covariates opens new avenues for developing robust forecasting systems.
GAIA: Geometry-Adaptive Operator Learning for Forward and Inverse Problems
Optimization
Theory
Efficient ML
- GAIA provides a unified framework for solving both forward and inverse problems on arbitrary geometries without retraining.
- The model utilizes a dual-pathway tokenization to explicitly encode geometric information, enhancing adaptability to varying geometries.
- GAIA sets new state-of-the-art results on multiple benchmarks, significantly reducing error rates in inverse problem tasks.
- The approach maintains competitive performance on forward problems while ensuring stable accuracy across varying resolutions.
Read more
GAIA: Geometry-Adaptive Operator Learning for Forward and Inverse Problems
Summary
The paper introduces GAIA, a novel operator learning model designed to address the limitations of existing geometry-adaptive neural operators, which primarily focus on forward problems where inputs and outputs share the same spatial domain. GAIA is capable of handling both forward and inverse problems on arbitrary geometries in a single pass, eliminating the need for retraining or iterative optimization. The model employs a dual-pathway tokenization that encodes both the domain boundary and the interior field distribution into geometry tokens, which are then used to condition integral transform layers through cross-attention mechanisms. This approach allows GAIA to adapt locally to geometric features, making it suitable for boundary value problems (BVPs) and inverse problems where inputs and outputs may reside on different domains. The authors validate GAIA across seven benchmarks, including new or extended tasks for electrical impedance tomography, optical tomography, and 3D Darcy flow, achieving state-of-the-art results and demonstrating significant improvements in accuracy compared to existing methods.
Methodology
GAIA employs a geometry-adaptive integral autoencoder architecture that encodes domain boundaries and interior field distributions into geometry tokens. It utilizes multi-head cross-attention to condition integral kernels on these tokens, allowing for local adaptation to geometric features. This design enables efficient single-pass solutions for both forward and inverse problems without the need for iterative optimization or retraining.
Results
GAIA achieves state-of-the-art performance on all evaluated inverse and boundary value problem tasks, reducing median relative L2 error by 64% on airfoil flow reconstruction and 27% on electrical impedance tomography compared to the next best method. It also outperforms all baselines across various shape categories in the modified mechanical components benchmark, while remaining competitive with specialized solvers on forward problems.
Implications
The development of GAIA has significant implications for fields requiring efficient and accurate solutions to PDEs, such as fluid dynamics, medical imaging, and structural analysis. Its ability to handle varying geometries in a single model could streamline workflows in these domains, reducing computational costs and improving accessibility to advanced simulation techniques.
Geometry-Aware R-Structured Kolmogorov-Arnold Networks
Theory
Interpretability
Efficient ML
- Introduction of GRS-KAN, integrating R-functions into KAN for enhanced interpretability and accuracy.
- Explicit analytical representation of geometric constraints improves predictive performance on regression tasks.
- Demonstrated up to 67% reduction in test RMSE in comparison to traditional KANs.
- Agnostic variant can automatically determine the relevance of geometric priors for learning tasks.
Read more
Geometry-Aware R-Structured Kolmogorov-Arnold Networks
Summary
This paper introduces the Geometry-aware R-Structured Kolmogorov-Arnold Network (GRS-KAN), a novel hybrid neural architecture that integrates R-functions into the Kolmogorov-Arnold Network (KAN) framework. The GRS-KAN architecture combines smooth nonlinear structures learned by KAN branches with analytically encoded geometric or logical constraints using differentiable R-functions. This integration allows for explicit representation of discontinuities, feasible regions, and implicit geometric boundaries within a trainable neural model. The authors propose several variants of GRS-KAN, including additive, multiplicative, and agnostic branch-weighted architectures. The framework is evaluated on regression problems characterized by discontinuities with circular and rectangular supports. Numerical experiments demonstrate that the GRS-KAN models significantly enhance predictive accuracy and boundary localization, achieving up to a 67% reduction in test RMSE compared to standard KANs. Furthermore, the agnostic variant shows the capability to automatically assess the utility of geometric priors for specific learning tasks, thereby improving interpretability and performance in applications requiring formal verification.
Methodology
The GRS-KAN architecture employs KAN branches to learn smooth nonlinear structures while R-functions are used to analytically encode geometric and logical constraints. The architecture includes differentiable logical operations through R-conjunctions and R-disjunctions, allowing for the representation of complex geometric supports. The paper presents three architectural variants and conducts numerical experiments to evaluate their performance on regression problems with known discontinuities.
Results
The GRS-KAN models exhibited substantial improvements in predictive accuracy and boundary localization, with test RMSE reductions of up to 67% compared to standard KANs. The explicit geometric encoding facilitated better interpretability of the learned structures, and the agnostic variant successfully identified the benefits of geometric priors for various regression tasks.
Implications
The GRS-KAN framework has potential applications in safety-critical fields such as pharmaceutical manufacturing and scientific computing, where interpretability and formal verification are essential. The ability to incorporate explicit geometric constraints could enhance the robustness and reliability of machine learning models in these domains.
Expander Sparse Autoencoders: Parameter-Efficient Dictionaries for Mechanistic Interpretability
Interpretability
Efficient ML
Large Language Models
- Introduction of Expander SAEs, a parameter-efficient architecture for sparse coding.
- Demonstrated a significant reduction in learned decoder values while maintaining high reconstruction fidelity.
- Proposed a parallel implementation of OMP that optimizes inference speed and fidelity.
- Provided theoretical guarantees for identifiability of sparse codes under specific conditions.
Read more
Expander Sparse Autoencoders: Parameter-Efficient Dictionaries for Mechanistic Interpretability
Summary
This paper introduces Expander Sparse Autoencoders (Expander SAEs), a novel architecture designed to enhance the mechanistic interpretability of neural networks while maintaining parameter efficiency. Traditional Sparse Autoencoders (SAEs) rely on dense decoders that require a significant number of learned values, which can be computationally expensive, especially with large feature counts. Expander SAEs utilize a left-d-regular expander mask to reduce the number of learned decoder values from mn to dn, where d is much smaller than m. This architecture not only minimizes storage requirements but also improves the efficiency of the matching-pursuit correlation step in sparse coding. The authors demonstrate that varying the sparsity parameter d leads to a consistent trade-off between storage and reconstruction fidelity across several language models, achieving significant reductions in learned decoder values while retaining high fidelity in reconstruction. The paper also presents a parallel implementation of Orthogonal Matching Pursuit (OMP) that leverages the expander structure, yielding a balance between fast inference and high-fidelity decoding. The theoretical contributions include proofs of identifiability for noiseless k-sparse codes under specific conditions, supporting the architecture's effectiveness.
Methodology
The authors developed Expander SAEs by applying a left-d-regular expander mask to the encoder and decoder of traditional SAEs, which reduces the number of learned values. They conducted experiments on various language models to analyze the trade-off between storage and fidelity, and implemented a parallel version of OMP to enhance decoding efficiency. Theoretical proofs were provided to establish conditions for the identifiability of sparse codes.
Results
Experiments showed that the Expander SAE architecture, particularly with d=7 on the Qwen2.5-3B model, achieved 84% of the reconstruction fidelity of a full dense decoder while using 293 times fewer learned decoder values. The findings indicate a smooth storage-fidelity frontier across different models, confirming the architecture's effectiveness in balancing efficiency and performance.
Implications
The proposed Expander SAEs could significantly impact the design of interpretable neural networks, particularly in applications requiring efficient storage and high fidelity in representation learning. This architecture may facilitate better understanding of neural network internals, aiding in the development of more interpretable AI systems.
EVOTS: Evolutionary Transformer Search for Time Series Forecasting
Time Series
Optimization
- Introduction of EVOTS, a modular evolutionary architecture search framework for time-series forecasting.
- Demonstration of superior performance of evolved architectures over hand-designed Transformer variants.
- Effective exploration of diverse architecture space without fixed design constraints.
- Strong performance gains in long-horizon and multivariate forecasting settings.
Read more
EVOTS: Evolutionary Transformer Search for Time Series Forecasting
Summary
The paper introduces EVOTS, an evolutionary neural architecture search framework designed for multivariate time-series forecasting. Unlike traditional approaches that rely on fixed Transformer architectures, EVOTS employs a modular genome representation to dynamically compose attention, feed-forward, and projection components. This allows for the exploration of diverse architectural configurations without the constraints of hand-crafted design rules. The framework is evaluated on four benchmark datasets from the ETT family, demonstrating that the evolved architectures can achieve competitive or superior performance compared to a strong Transformer-based baseline, particularly in multivariate-to-multivariate forecasting settings. The results indicate that evolutionary search can effectively discover flexible and high-performing Transformer-like architectures, especially beneficial for long-horizon forecasting tasks. The paper emphasizes the potential of evolutionary computation to automate architectural design, reducing the need for extensive manual tuning and enhancing adaptability across various forecasting scenarios.
Methodology
The EVOTS framework encodes neural architectures as modular genomes and utilizes a steady-state evolutionary algorithm with weight inheritance to evolve their composition. A repair mechanism ensures structural validity during the evolutionary process, allowing for flexible architectural exploration.
Results
The evolved architectures consistently achieved strong performance on the ETT benchmark datasets, particularly excelling in the multivariate-to-multivariate forecasting setting. They matched or exceeded the performance of the strongest Transformer-based models under identical experimental conditions, with notable improvements at longer prediction horizons.
Implications
The findings suggest that evolutionary architecture search can significantly enhance the adaptability and performance of time-series forecasting models. This approach may lead to more efficient model design processes and better forecasting accuracy across diverse applications, including energy management and industrial monitoring.
Learning the Supports for Categorical Critic in Reinforcement Learning
Reinforcement Learning
Optimization
Theory
- Introduces a dynamic support learning method that eliminates the need for pre-defined support intervals in value function estimation.
- Demonstrates that the mean-squared Bellman error is upper-bounded by the HL-Gauss loss, motivating the need for tighter support intervals.
- Formulates the dynamic support learning as a constrained optimization problem, allowing for automatic adaptation of supports.
- Empirical results show that the proposed method matches or improves upon existing HL-Gauss-based algorithms in continuous-control tasks.
Read more
Learning the Supports for Categorical Critic in Reinforcement Learning
Summary
This paper addresses the challenges of value function estimation in actor-critic based deep reinforcement learning (RL) by proposing a novel approach that dynamically learns the support intervals for value estimation. Traditionally, value functions are trained using mean squared error (MSE) against bootstrapped targets, but this method struggles with the non-stationary and stochastic nature of RL tasks. The authors investigate the Gaussian Histogram Loss (HL-Gauss), which reformulates value estimation as a classification problem, yet it requires pre-defined support intervals that can hinder performance. To overcome this limitation, the authors introduce a method that dynamically learns the lower and upper bounds of the support interval, allowing for a more flexible and adaptive approach. They derive a learning objective that simultaneously learns these bounds and the categorical representation of scalar values, proving that this objective provides a tighter upper bound on the mean-squared Bellman error compared to fixed supports. Empirical evaluations demonstrate that their dynamic support learning approach is competitive with existing methods and often outperforms them on various continuous-control tasks, highlighting the importance of adapting support intervals to the evolving nature of policies during training.
Methodology
The authors propose a constrained optimization framework that formulates dynamic support learning as an adversarial min-max game. This involves introducing a Lagrangian multiplier to enforce a probability mass coverage constraint, enabling the neural network to adjust the support intervals dynamically while penalizing excessive width.
Results
The proposed dynamic support learning method demonstrates competitive performance with fixed-support HL-Gauss algorithms on most continuous-control tasks, with improvements observed in several cases. The results indicate that dynamically adapted supports can outperform any fixed support, emphasizing the necessity for adaptability in reinforcement learning environments.
Implications
This work has significant implications for reinforcement learning, particularly in environments with sparse rewards or non-stationary dynamics. By allowing for dynamic adaptation of support intervals, the proposed method can enhance the robustness and efficiency of value function estimation, potentially leading to improved performance in various RL applications.
Muon as a Residual Connection
Optimization
Theory
- Muon can be interpreted as an implicit residual connection, enhancing representation preservation.
- Orthogonalizing updates sacrifices immediate gradient fidelity for better downstream usability.
- The paper provides a mechanistic explanation of Muon's effectiveness, accessible to a broader audience.
- The findings suggest new avenues for optimizer design that balance local and downstream performance.
Read more
Muon as a Residual Connection
Summary
In this paper, Hao Huang presents a novel interpretation of the Muon optimizer, proposing that it functions as an implicit residual connection during the training of neural networks. The Muon optimizer, known for its effectiveness in training large models, orthogonalizes updates to matrix parameters, which has led to various interpretations of its success. Huang argues that this orthogonalization sacrifices some immediate gradient fidelity but enhances representation preservation for downstream layers. By studying this trade-off in controlled linear optimization settings, the paper reveals that while Muon may slow down the fitting of local targets, it facilitates easier exploitation of learned representations by subsequent layers. This perspective not only provides a conceptual understanding of Muon but also suggests new design principles for optimizers that balance local descent with downstream usability, complementing existing explanations of Muon's behavior.
Methodology
The author conducts a theoretical analysis of the Muon optimizer, comparing its orthogonalized update mechanism to traditional residual connections in deep learning. The study involves controlled linear optimization settings to explore the trade-offs between gradient fidelity and representation usability.
Results
The analysis indicates that while Muon may lead to slower convergence on local targets, it significantly enhances the usability of representations for downstream layers. This finding supports the hypothesis that Muon functions similarly to residual connections, providing a clearer understanding of its optimization dynamics.
Implications
This work has implications for the design of future optimizers in deep learning, suggesting that incorporating principles from residual connections could lead to more effective training strategies. It also encourages researchers to explore new optimization frameworks that balance local descent with the usability of learned representations.
SemiScope: Disentangling Classifier Tuning and Joint Optimization in Semi-Supervised Security Classification
Optimization
Theory
Efficient ML
- SemiScope effectively disentangles the contributions of SSL optimization and classifier tuning.
- Significant performance improvements were observed with SemiScope compared to default SSL methods.
- Classifier hyperparameter optimization alone accounts for a substantial portion of the gains from the joint pipeline.
- A simpler approach using Self-Training and classifier tuning can achieve similar results with less complexity.
Read more
SemiScope: Disentangling Classifier Tuning and Joint Optimization in Semi-Supervised Security Classification
Summary
The paper introduces SemiScope, a framework designed to analyze and optimize semi-supervised learning (SSL) in the context of security classification, where labeled data is scarce. Traditional approaches often treat SSL as a black box, using default parameters and failing to address issues like class imbalance from pseudo-labeling. The authors aim to disentangle the effects of SSL pipeline optimization and classifier tuning to better understand their contributions to performance improvements. SemiScope employs Bayesian Optimization to jointly select SSL settings, confidence filtering, oversampling strategies, classifier families, and hyperparameters. The study compares SemiScope against a control method, Tuned-Clf, which fixes SSL settings to defaults while allowing for extensive classifier tuning. Results show that SemiScope outperforms all default SSL baselines across five datasets, achieving significant gains in g-measure. However, when compared to Tuned-Clf under equal budgets, the performance gains are largely attributable to classifier tuning rather than joint optimization. The findings suggest that a simpler approach—using Self-Training and tuning the classifier—can yield comparable results, indicating that the complexity of joint optimization may not be necessary for practical applications.
Methodology
The authors developed SemiScope as a controlled analysis tool that utilizes Bayesian Optimization to optimize various components of the SSL pipeline, including SSL settings, classifier families, and hyperparameters. They compared the performance of SemiScope against a control method (Tuned-Clf) that maintained default SSL settings while allowing for extensive classifier tuning, using a fixed budget for both methods.
Results
SemiScope outperformed all default SSL baselines on five datasets, with improvements in g-measure ranging from 0.7 to 12.7 points. However, when compared to Tuned-Clf under equal budget conditions, the performance was statistically equivalent on four out of five datasets, indicating that most of the observed gains could be attributed to classifier tuning rather than joint optimization.
Implications
The findings suggest that security practitioners can achieve effective classification results by focusing on tuning classifiers rather than relying on complex joint optimization of SSL pipelines. This could lead to more efficient use of resources and simpler implementation in real-world security applications.
I2RiMA: Spectral Riemannian Representation with Temporal Attention for Mental Stress Detection based on EEG Signals
Time Series
- I2RiMA constructs frequency-specific spatial covariance matrices and maps them to the SPD tangent space.
- The model employs frequency cluster aggregation for effective feature selection and redundancy reduction.
- An intra-inter slice attention module captures both local and global temporal dependencies in EEG data.
- I2RiMA achieves state-of-the-art performance in cross-subject EEG stress detection.
Read more
I2RiMA: Spectral Riemannian Representation with Temporal Attention for Mental Stress Detection based on EEG Signals
Summary
The paper presents I2RiMA, an innovative Intra-Inter Riemannian Manifold Attention Network designed for detecting mental stress through EEG signals. The authors identify significant challenges in cross-subject EEG stress detection due to the subject-dependent and frequency-specific nature of stress-related patterns. Traditional Riemannian methods primarily focus on time-domain spatial covariance, neglecting the critical role of neural oscillations in high-level cognitive state decoding. Furthermore, existing tokenization approaches fragment temporal coherence, which is vital for accurate classification. I2RiMA addresses these issues by constructing spatial covariance matrices at each frequency point and mapping them to the SPD tangent space, thereby preserving channel-wise geometry and frequency-specific cues. The model incorporates frequency cluster aggregation to select informative spectral components and reduce redundancy, resulting in compact representations. Additionally, an intra-inter slice attention module is introduced to integrate local slice-level dynamics with global temporal context across EEG sequences. Experimental results demonstrate that I2RiMA outperforms five state-of-the-art baselines, achieving balanced accuracies of 77.59%, 75.88%, and 82.78% on three datasets while maintaining efficiency with only 1.60M parameters and 31.95M FLOPs.
Methodology
I2RiMA utilizes a two-pronged approach: it constructs spatial covariance matrices independently at each frequency point, preserving the Riemannian geometry of EEG signals, and implements a frequency cluster aggregation module to enhance feature selection. The intra-inter slice attention fusion module is designed to integrate local slice-level spectral dynamics with global temporal context, allowing for a more comprehensive understanding of EEG sequences.
Results
I2RiMA achieved balanced accuracies of 77.59%, 75.88%, and 82.78% on the MIST Control, MIST Stress, and SEED datasets, respectively, outperforming five state-of-the-art methods while maintaining a low computational footprint with only 1.60M parameters and 31.95M FLOPs.
Implications
The findings suggest that I2RiMA could significantly enhance the accuracy of mental stress detection in real-world applications, particularly in non-clinical settings where continuous monitoring is essential. This model could pave the way for more effective mental health interventions and stress management strategies.
Decomposer: Learning to Decompile Symbolic Music to Programs
Generative Models
Reinforcement Learning
Audio & Speech
- DECOMPOSER addresses the challenge of converting MIDI to Strudel code, enhancing the readability and editability of musical programs.
- The framework utilizes a two-stage approach combining supervised fine-tuning and reinforcement learning to optimize both faithfulness and readability.
- A synthetic corpus, STRUDEL-SYNTH, is created to facilitate the training process, addressing the lack of naturally paired data.
- Experimental results show DECOMPOSER achieves superior performance compared to existing LLMs and heuristic converters.
Read more
Decomposer: Learning to Decompile Symbolic Music to Programs
Summary
The paper presents DECOMPOSER, a novel framework designed to tackle the inverse problem of recovering high-level musical instructions from symbolic music, specifically through MIDI-to-Strudel decompilation. The authors identify two main challenges: the scarcity of paired MIDI-Strudel data and the risk of generating unreadable code through simple note-by-note transliteration. To address these issues, the framework employs a two-stage training process. First, it utilizes a synthetic corpus, STRUDEL-SYNTH, to provide supervised fine-tuning for the model, enabling it to produce valid Strudel code. Second, it employs reinforcement learning to optimize the decompilation process, focusing on both the faithfulness of MIDI reconstruction and the readability of the generated code. The evaluation demonstrates that DECOMPOSER significantly outperforms existing closed-source large language models (LLMs) in terms of MIDI reconstruction accuracy while also producing more readable and diverse Strudel code. This work not only advances the field of symbolic music processing but also highlights the potential for applying similar decompilation techniques across various modalities.
Methodology
The DECOMPOSER framework consists of two main stages: (1) supervised fine-tuning using a synthetic corpus of paired MIDI and Strudel programs to teach the model valid code generation, and (2) reinforcement learning that optimizes the decompilation objective by rewarding the model for generating programs that faithfully reconstruct the input MIDI while maintaining code readability.
Results
DECOMPOSER significantly improves MIDI-to-Strudel decompilation, achieving higher MIDI reconstruction faithfulness and producing more readable code compared to both closed-source LLMs and heuristic converters. The evaluation on synthetic and real-world benchmarks indicates a substantial enhancement in performance.
Implications
The findings suggest that DECOMPOSER can facilitate more intuitive and accessible music programming, allowing musicians and developers to better manipulate and understand musical structures. Additionally, the approach may inspire similar decompilation frameworks in other fields, enhancing the usability of generated artifacts.
QuasiMoTTo: Quasi-Monte Carlo Test-Time Scaling
NLP
Large Language Models
Reinforcement Learning
Efficient ML
- QuasiMoTTo improves sample efficiency by generating correlated samples instead of independent ones.
- The method utilizes quasi-Monte Carlo techniques to ensure better coverage of the output space.
- Empirical results show that QuasiMoTTo can achieve similar accuracy with significantly fewer samples.
- The approach is applicable to both language model inference and reinforcement learning.
Read more
QuasiMoTTo: Quasi-Monte Carlo Test-Time Scaling
Summary
The paper introduces QuasiMoTTo, a novel approach to improve sample efficiency in scaling inference compute for language models and reinforcement learning (RL). Traditional parallel sampling methods generate independent samples, leading to redundancy and wasted compute resources. QuasiMoTTo addresses this by generating correlated samples using a reparameterization of autoregressive sampling combined with quasi-Monte Carlo (QMC) methods. This approach allows for the generation of samples that are more evenly distributed across the output space, reducing redundancy while maintaining the marginal distribution of the language model. The authors develop an unbiased bootstrap estimator to evaluate the performance of correlated samplers, demonstrating that QuasiMoTTo can achieve pass@k accuracy comparable to independent samples with 25-47% fewer samples. Additionally, in policy-gradient RL applications, QuasiMoTTo matches the performance of independent sampling with 50% fewer training steps, highlighting its potential for enhancing learning efficiency.
Methodology
QuasiMoTTo employs a reparameterization of autoregressive sampling as inverse-CDF sampling, utilizing quasi-Monte Carlo (QMC) methods to draw uniformly distributed samples. This allows for the generation of correlated samples that maintain the marginal distribution of the language model while improving coverage and reducing redundancy.
Results
QuasiMoTTo matches the pass@k accuracy of independent sampling with 25-47% fewer samples across four reasoning benchmarks. In reinforcement learning, it achieves comparable performance to independent sampling with 50% fewer training steps, demonstrating enhanced sample efficiency.
Implications
The findings suggest that QuasiMoTTo could significantly reduce computational costs in both language model inference and reinforcement learning, making it a valuable tool for improving the efficiency of machine learning applications. This method could lead to advancements in the capabilities of language models and more effective training strategies in RL.
DemoPSD: Disagreement-Modulated Policy Self-Distillation
NLP
Large Language Models
Reinforcement Learning
- DemoPSD selectively adopts teacher guidance based on distribution consistency.
- The framework mitigates privileged information leakage and preserves exploration.
- DemoPSD outperforms existing methods like GRPO and SDPO in experiments.
- The approach balances learning from the teacher with the student's reasoning.
Read more
DemoPSD: Disagreement-Modulated Policy Self-Distillation
Summary
The paper introduces DemoPSD, a novel framework for on-policy self-distillation (OPSD) aimed at improving the training of large language models (LLMs) by addressing issues related to privileged information leakage and exploration suppression. Traditional OPSD methods have been found to lead to overfitting and hinder generalization due to the dense token-level supervision provided by the teacher model, which is conditioned on privileged information. DemoPSD proposes a selective adoption mechanism where the student model utilizes the teacher's guidance only when their output distributions are consistent. When significant divergence occurs, indicating potential over-reliance on privileged information, the student relies more on its own reasoning. This is achieved by steering the student towards a reverse-KL barycenter target, which balances the learning from the teacher with the preservation of the student's reasoning capabilities. The framework is shown to effectively mitigate privileged information leakage and maintain exploration capacity, leading to improved performance across various scientific domains.
Methodology
DemoPSD employs a selective adoption mechanism where the student model adopts the teacher's guidance when their output distributions are similar and relies on its own reasoning when they diverge. It uses a reverse-KL barycenter target to adaptively control the blending of teacher and student distributions at each token position, thereby balancing the learning process.
Results
The experiments conducted on the SciKnowEval dataset across four scientific fields show that DemoPSD maintains higher training entropy (33-98% higher than SDPO) and achieves better validation accuracy, indicating improved exploration and generalization capabilities. The results demonstrate that DemoPSD effectively mitigates privileged information leakage and enhances the model's reasoning abilities.
Implications
The findings suggest that DemoPSD can be a valuable framework for training LLMs in various reasoning tasks, potentially leading to more robust models that generalize better across different domains. This approach could be applied in scenarios where maintaining independent reasoning is critical, such as in scientific research and complex problem-solving.
When Context Compensates for Sparse Event History: AlphaEarth for Spatio-Temporal Point-Process Forecasting
Time Series
- AlphaEarth embeddings provide a standardized method for incorporating spatial context into forecasting models.
- The integration of contextual information significantly improves predictive performance in spatio-temporal point-process models, especially when local event histories are sparse.
- The study demonstrates that the benefits of using external spatial context diminish as more event history is accumulated, but remain positive even with longer histories.
- The research emphasizes the need for models to leverage both event history and contextual information for better forecasting accuracy.
Read more
When Context Compensates for Sparse Event History: AlphaEarth for Spatio-Temporal Point-Process Forecasting
Summary
This paper investigates the effectiveness of using exogenous spatial context to enhance spatio-temporal point-process (STPP) models, particularly in scenarios where local event histories are sparse. The authors propose a model called AlphaEarth (AE), which integrates standardized geospatial embeddings into a log-Gaussian Cox process (LGCP) framework. They compare the performance of an event-only LGCP model with an AE-augmented version across various history lengths in forecasting emergency medical services (EMS) demand in Montgomery County, Pennsylvania. The study finds that AE embeddings significantly improve predictive performance, especially in cases with limited historical data. The results indicate that contextual information can stabilize forecasts when local event history is insufficient, with the most substantial gains observed in sparse-history scenarios. As more event history is accumulated, the advantage of using AE diminishes but remains beneficial even with longer histories. This research highlights the importance of integrating contextual data into forecasting models to enhance their robustness and accuracy in real-world applications.
Methodology
The authors utilize a log-Gaussian Cox process (LGCP) framework to model spatio-temporal events, comparing two configurations: an event-only model and an AE-augmented model that incorporates spatial context through AlphaEarth embeddings. They conduct experiments across eight held-out regions, varying the length of historical data used for forecasting and evaluating the models' predictive performance.
Results
The AE-augmented model shows a 2–6× improvement in predictive density in scenarios with only 1-2 weeks of training data, tapering to a 10–20% improvement with 20-104 weeks of training history. These results indicate that contextual information significantly stabilizes spatially transferred forecasts when local event history is limited.
Implications
The findings suggest that integrating external spatial context can enhance the reliability of forecasting models in various applications, particularly in emergency services and other domains where event histories may be sparse. This approach could lead to more effective resource allocation and planning in critical services.
Efficient Temporal Point Processes via Monotone Alternating Splines
Time Series
Efficient ML
Theory
- Identifies fundamental limitations of Monotone Neural Networks in CCIF modeling.
- Proposes Monotone Alternating Splines (MAS) to enhance flexibility and efficiency.
- Establishes a theoretical foundation for MAS, including generalization error analysis.
- Demonstrates superior performance of MAS on synthetic and real-world datasets.
Read more
Efficient Temporal Point Processes via Monotone Alternating Splines
Summary
This paper addresses the limitations of existing Monotone Neural Networks (MNNs) in modeling Cumulative Conditional Intensity Functions (CCIFs) for Temporal Point Processes (TPPs). The authors identify three structural deadlocks in MNNs: convexity restrictions, saturation limits, and violations of CCIF requirements, which hinder their ability to capture complex temporal dynamics. To overcome these challenges, the paper introduces a novel framework called Monotone Alternating Splines (MAS). MAS separates the modeling of CCIF into two components: an interpolation component that uses piecewise monotone splines for accurate fitting of complex TPP sequences, and an extrapolation component that ensures global monotonicity and robust generalization for future predictions. The authors provide a theoretical foundation for MAS, analyzing its generalization error and demonstrating its superior fitting capabilities compared to MNNs. Extensive experiments on synthetic and real-world datasets show that MAS significantly outperforms existing methods, offering a more flexible and computationally efficient approach to TPP modeling.
Methodology
The authors propose the Monotone Alternating Splines (MAS) framework, which consists of an interpolation component using piecewise monotone splines for accurate fitting and an extrapolation component that maintains global monotonicity. Theoretical analysis is conducted to decompose generalization error and compare approximation capabilities with MNNs.
Results
MAS achieves superior performance in modeling CCIFs, significantly reducing approximation errors and improving computational efficiency compared to existing MNN-based approaches. The framework demonstrates strong fitting accuracy and robust generalization across various datasets.
Implications
The MAS framework can be applied in various domains where temporal point processes are relevant, such as criminology, finance, and neuroscience, providing more accurate forecasting of future events based on historical data.
StateFlow: Dual-State Recurrent Modeling for Long-Horizon Time Series Forecasting
Time Series
- Introduces StateFlow, a dual-state recurrent framework for long-horizon time series forecasting.
- Extends VARNN to capture both primary temporal dynamics and structured local prediction deviations.
- Employs a two-stage optimization strategy to enhance forecasting stability and performance.
- Achieves competitive results against linear, recurrent, convolutional, and Transformer-based models.
Read more
StateFlow: Dual-State Recurrent Modeling for Long-Horizon Time Series Forecasting
Summary
The paper addresses the challenges of long-horizon multivariate time series forecasting (LTSF), particularly issues related to non-stationarity, regime shifts, and error accumulation. The authors extend the Variability-Aware Recursive Neural Network (VARNN) to create StateFlow, a recurrent forecasting framework that utilizes VARNN as a dual-state backbone. This framework captures two complementary signals from the lookback sequence: a hidden-state trajectory representing primary temporal dynamics (like trend and seasonality) and a residual-memory trajectory that accounts for structured local prediction deviations. A chunk-based decoder is employed to summarize these trajectories for direct multi-step forecasting. The authors implement a two-stage optimization strategy, first training the VARNN encoder for one-step predictions and then training a horizon-specific decoder for multi-step forecasting. Experimental results demonstrate that StateFlow achieves competitive performance against various strong baselines while maintaining a compact model design and linear computational complexity.
Methodology
The methodology involves extending the VARNN architecture to support long-horizon forecasting by using dual-state representations: a hidden-state trajectory for primary dynamics and a residual-memory trajectory for local deviations. A chunk-based decoder summarizes these trajectories for multi-step predictions. The two-stage optimization first trains the VARNN encoder with a one-step prediction objective, followed by training a horizon-specific decoder for direct multi-step forecasting.
Results
StateFlow demonstrated competitive performance on standard LTSF benchmarks, outperforming strong baselines including linear, recurrent, convolutional, and Transformer-based models. The results indicate that incorporating deviation dynamics into recurrent models can provide an efficient alternative to attention-based methods for long-horizon forecasting.
Implications
The findings suggest that recurrent architectures can remain competitive in LTSF by explicitly modeling deviation dynamics, which may lead to more robust forecasting in various applications such as energy demand prediction, traffic analysis, and financial modeling.
PRISM: Prioritized Channel Importance with Semi-supervised Domain Adaptation for Cross-Subject EEG Emotion Recognition
Time Series
- PRISM utilizes a lightweight expert ensemble for adaptive channel prioritization in EEG emotion recognition.
- The framework integrates semi-supervised domain adaptation to enhance cross-subject generalization under limited labels.
- PRISM achieves superior performance on benchmark datasets compared to existing state-of-the-art methods.
- The model is designed to be plug-and-play, allowing easy integration with existing architectures.
Read more
PRISM: Prioritized Channel Importance with Semi-supervised Domain Adaptation for Cross-Subject EEG Emotion Recognition
Summary
The paper introduces PRISM, a novel framework designed to enhance cross-subject emotion recognition from EEG signals by addressing two main challenges: channel redundancy and inter-subject variability. PRISM employs a lightweight expert ensemble to assign differentiable, data-dependent weights to EEG channels, effectively amplifying informative electrodes while suppressing less relevant ones. Additionally, it incorporates a semi-supervised domain adaptation strategy that utilizes unlabeled data through confidence-filtered pseudo-labels to drive consistency regularization and domain alignment. This dual approach allows PRISM to improve model generalization and robustness in scenarios with limited labeled data. The framework is model-agnostic, making it compatible with various time-series architectures, and it has been validated on public benchmarks such as DEAP, DREAMER, and SEED, where it consistently outperformed state-of-the-art methods, demonstrating its effectiveness in real-world applications of EEG emotion recognition.
Methodology
PRISM employs a backbone network to encode spatiotemporal EEG features, augmented by a lightweight expert ensemble that learns adaptive per-channel weights. It also implements a semi-supervised domain adaptation strategy that combines supervised learning, consistency regularization, and domain alignment using confidence-filtered pseudo-labels from unlabeled data.
Results
PRISM demonstrated significant improvements in emotion recognition accuracy across multiple datasets (DEAP, DREAMER, SEED) compared to state-of-the-art methods, particularly in scenarios with limited annotations. The framework's ability to adaptively prioritize channels and align domains contributed to its robust performance.
Implications
The findings suggest that PRISM can be effectively utilized in real-world EEG emotion recognition applications, particularly in settings where labeled data is scarce. Its adaptability and integration capabilities make it a valuable tool for researchers and practitioners in neuropsychology and affective computing.
From Pixels to Temporal Correlations: Learning Informative Representations for Reinforcement Learning Pre-training
Reinforcement Learning
Computer Vision
Robotics
- Introduction of a temporal correlation space for better representation learning in RL.
- Development of Multi-scale Temporal Contrastive Learning (MTCL) to model temporal correlations.
- Balanced attention to different elements in videos enhances representation quality.
- Extensive experiments show significant improvements in sample efficiency and performance.
Read more
From Pixels to Temporal Correlations: Learning Informative Representations for Reinforcement Learning Pre-training
Summary
This paper addresses the challenge of improving sample efficiency and performance in Reinforcement Learning (RL) through unsupervised pre-training on large-scale action-free internet videos. Existing methods primarily focus on single-step transition prediction and image reconstruction, which tend to overlook crucial information contained in smaller pixel proportions. To tackle this issue, the authors propose a novel 'temporal correlation space' that allows for better differentiation of elements in videos based on their motion characteristics. They introduce the Multi-scale Temporal Contrastive Learning (MTCL) method, which models multi-scale temporal correlations to ensure balanced attention across different elements. This approach results in more informative representations that enhance policy learning in various downstream tasks. The experimental results demonstrate that MTCL significantly improves both sample efficiency and asymptotic performance across three benchmark tasks: DMControl Remastered, Meta-World, and CARLA, achieving state-of-the-art results.
Methodology
The authors propose a two-part contrastive learning framework within the MTCL method. It consists of Multi-scale Motion-aware Learning (MML) and Static Appearance-aware Learning (SAL), which separately model motion and static features across different temporal scales. This allows for a more comprehensive understanding of video elements, ensuring that both dynamic and static information are adequately represented.
Results
The experimental results indicate that the MTCL method leads to significant improvements in sample efficiency and asymptotic performance across the evaluated downstream tasks. The method outperforms existing approaches, demonstrating its effectiveness in learning informative representations from action-free video data.
Implications
The findings suggest that leveraging temporal correlations in video data can substantially enhance the pre-training process for RL, potentially leading to more efficient learning and better performance in real-world applications. This approach could be applied to various domains where video data is available but lacks action labels.
Automatic Detection of Stress from Speech in the Trier Social Stress Test
Audio & Speech
- Automatic speech analysis can effectively differentiate between stressed and non-stressed speech.
- Physiological stress responses can be predicted from acoustic-prosodic features of speech.
- The study utilized a between-subject design to enhance the reliability of stress detection.
- Feature importance analysis identified key predictors for stress detection performance.
Read more
Automatic Detection of Stress from Speech in the Trier Social Stress Test
Summary
This study explores the automatic detection of stress through speech during the Trier Social Stress Test (TSST) and a non-stressful control condition (friendly-TSST). The research involved 50 participants, with the aim of differentiating between stressful and non-stressful speech and predicting physiological and affective stress responses. The authors developed a processing pipeline that included speaker diarization and machine learning models, achieving stress detection performance significantly above baseline levels. The study found that acoustic-prosodic features could partially predict physiological responses, such as salivary cortisol and alpha-amylase levels, as well as self-reported affect. Feature importance analyses highlighted key predictors that contributed to the model's performance, demonstrating that speech can serve as a reliable and unobtrusive indicator of stress across multiple dimensions.
Methodology
The study involved collecting speech data from 50 healthy German-speaking university students who were randomly assigned to either the TSST or the friendly-TSST. Speech was recorded during a simulated job interview, and physiological measures (salivary cortisol and alpha-amylase) were collected before and after the tasks. Machine learning models were employed to analyze the acoustic features of the speech data, with a focus on speaker diarization and feature importance analysis.
Results
The results indicated that the machine learning models achieved a stress detection performance significantly above the mean baseline. Additionally, the acoustic-prosodic features extracted from speech were found to partially predict physiological stress responses, including salivary cortisol and self-reported affect. The feature importance analysis revealed specific predictors that were most informative for distinguishing between stressed and non-stressed speech.
Implications
The findings suggest that speech can be a valuable and unobtrusive biomarker for assessing stress in both research and clinical settings. This approach could facilitate more frequent and standardized stress assessments without the need for invasive physiological measures. The study opens avenues for further research into speech-based stress detection and its applications in mental health monitoring and intervention.
Model Merging as Probabilistic Inference in Fine-Tuning Parameter Space
Optimization
Theory
Efficient ML
- Introduces a probabilistic framework for model merging that improves upon traditional geometric methods.
- Models each task-specific solution as an energy-based expert, allowing for better aggregation of update directions.
- Addresses the limitations of Gaussian assumptions in existing methods by employing a heavy-tailed PoE design.
- Demonstrates significant performance improvements over state-of-the-art merging techniques in empirical tests.
Read more
Model Merging as Probabilistic Inference in Fine-Tuning Parameter Space
Summary
This paper presents a novel approach to model merging, which combines multiple task-specific models into a single multi-task model without requiring additional data-driven fine-tuning. Traditional methods rely on geometric properties of local solution spaces, which often fail to account for the statistical usefulness of each task-specific update direction. The authors propose a probabilistic framework for model merging, treating each task-specific model as an energy-based expert model (EBM) in a product-of-experts (PoE) scenario. This framework allows for a more nuanced aggregation of update directions, where the confidence in each direction is informed by its support across tasks. The authors identify that existing methods often assume Gaussian distributions for directional residuals, which do not align well with the observed heavy-tailed behavior of these residuals. To address this, they introduce a heavy-tailed PoE design using Cauchy experts, which better captures the residual behavior and ensures a convergent inference procedure. Empirical evaluations demonstrate that this new approach significantly outperforms state-of-the-art baselines across various tasks and architectures, highlighting its effectiveness in practical applications.
Methodology
The authors formulate model merging as MAP inference in the fine-tuning parameter space using a product of task-specific energy-based experts. They analyze existing merging methods as special cases of their framework and develop a heavy-tailed PoE design to better model the distribution of directional residuals.
Results
The proposed method shows substantial improvements in performance compared to existing state-of-the-art model merging techniques across multiple tasks and architectures, validating the effectiveness of the heavy-tailed PoE design.
Implications
This work has significant implications for the development of multi-task models in scenarios where data cannot be centralized or where fine-tuning is impractical. It opens avenues for more efficient model merging techniques that can be applied in real-world applications, particularly in resource-constrained environments.
Fourier Neural Operators for Rayleigh-Bénard Convection
Theory
Efficient ML
Time Series
- Introduction of a lean FNO architecture for predicting time increments in RBC.
- Achieved higher accuracy than standard FNOs while maintaining a compact model size.
- Demonstrated the model's ability to generalize across spatial and temporal resolutions.
- Ablation study indicates that multi-layer 1D convolutional layers enhance performance.
Read more
Fourier Neural Operators for Rayleigh-Bénard Convection
Summary
This paper presents an enhanced Fourier Neural Operator (FNO) model specifically designed for simulating two-dimensional Rayleigh-Bénard convection (RBC). The authors propose a novel approach that predicts time increments rather than full solutions, which leads to improved accuracy compared to a standard FNO baseline. The model is characterized by a compact architecture with 314k parameters and a fast inference time of 7 ms, while achieving comparable accuracy to previous benchmarks. The study highlights the limitations of accuracy based on the resolution of training data, even as FNOs generalize to finer meshes. The authors also conduct an ablation study to demonstrate that multi-layer 1D convolutional scaling operators outperform linear layers in terms of accuracy. The findings suggest that the lean FNO architecture is suitable for iterative numerical methods, although it may experience faster error accumulation during longer rollouts or larger time steps compared to 3D FNOs.
Methodology
The authors utilized Fourier Neural Operators to learn mappings between function spaces for the simulation of RBC. They generated training data using high-resolution numerical simulations and compared two learning objectives: predicting full solutions versus predicting increments. The model architecture includes input lifting layers, multiple Fourier layers, and a final projection layer, with an emphasis on using convolutional layers for improved accuracy.
Results
The proposed lean FNO model demonstrated significantly higher accuracy in predicting future states of the system compared to traditional methods. The ablation study confirmed that the use of multi-layer 1D convolutional scaling operators resulted in better performance. The model maintained a low memory footprint and fast inference times, making it efficient for practical applications.
Implications
The findings suggest that the lean FNO model can be effectively used in real-time simulations of turbulent convection phenomena, which have applications in atmospheric science and industrial processes. Its efficiency and accuracy make it a promising tool for researchers and engineers working on complex fluid dynamics problems.
A Lightweight Self-Supervised Learning Framework for Multivariate Time Series using Hierarchical-JEPA on ECG Data
Time Series
Efficient ML
- Introduction of ER-JEPA, a hierarchical SSL framework for ECG data analysis.
- Two-stage structure allows for efficient multichannel and temporal analysis.
- Achieves state-of-the-art performance on ECG benchmarks with minimal resource usage.
- Demonstrates the effectiveness of hierarchical representation learning without representation collapse.
Read more
A Lightweight Self-Supervised Learning Framework for Multivariate Time Series using Hierarchical-JEPA on ECG Data
Summary
This paper introduces the Event Reconstruction Joint-Embedding Predictive Architecture (ER-JEPA), a lightweight self-supervised learning (SSL) framework specifically designed for analyzing multivariate time series data, particularly electrocardiogram (ECG) data. The framework addresses the challenge of limited annotated datasets by leveraging large unannotated datasets through a hierarchical structure inspired by cardiologists' diagnostic approaches. ER-JEPA employs a two-stage structure that first constructs representations for each time interval and then processes these representations as a univariate time series. This hierarchical integration of two Joint-Embedding Predictive Architectures (JEPAs) allows for the encoding of multiple levels of abstract representations, enhancing prediction capabilities for complex tasks. The model was pretrained on approximately 180,000 10-second ECG recordings and demonstrated state-of-the-art performance on the ST-MEM benchmark, achieving high accuracy with reduced computational resources. The study highlights the effectiveness of hierarchical representation learning in ECG analysis, showcasing the potential of ER-JEPA to mitigate representation collapse and improve model performance across various pretraining strategies.
Methodology
The methodology involves a two-stage hierarchical structure where the first stage focuses on multichannel processing and the second on temporal analysis. The model utilizes two separate JEPAs to avoid representation collapse, allowing for efficient encoding of multivariate time series into univariate sequences. The framework is pretrained on a large dataset of ECG recordings, and various hyperparameters and strategies are assessed to optimize performance.
Results
ER-JEPA achieved competitive performance compared to transformer-based SSL models, matching state-of-the-art results on the ST-MEM benchmark. It recorded an AUC of 0.936 for multi-label and 0.943 for multi-class evaluations on the PTB-XL fine-tuning downstream task. The model also demonstrated significant reductions in memory usage and inference time, establishing itself as a lightweight solution.
Implications
The findings suggest that ER-JEPA can be effectively applied in medical domains where annotated data is scarce, particularly in ECG analysis. Its lightweight nature and efficiency make it suitable for real-time applications in healthcare settings, potentially improving diagnostic processes and patient monitoring.
AdaBoosting Text Prompts for Vision-Language Models
Multimodal
Computer Vision
NLP
- TPB combines AdaBoost principles with natural-language text prompts for enhanced few-shot learning.
- The framework explicitly focuses on misclassified examples to improve prompt quality and classification accuracy.
- TPB demonstrates superior shot scalability and cross-model transferability compared to existing methods.
- Extensive evaluations across multiple benchmarks confirm TPB's effectiveness in various VLM architectures.
Read more
AdaBoosting Text Prompts for Vision-Language Models
Summary
This paper introduces Text Prompt Boosting (TPB), an innovative framework designed to enhance the performance of Vision-Language Models (VLMs) through improved text prompt construction. Recognizing that the classification accuracy of VLMs is heavily influenced by the quality of text prompts, the authors critique existing few-shot prompting methods for their failure to focus on misclassified examples. TPB draws inspiration from the AdaBoost algorithm, treating each text prompt-based classifier as a weak learner and sequentially aggregating them into a strong ensemble. By explicitly targeting hard examples during prompt construction, TPB maximizes the utility of few-shot supervision, leading to broader visual coverage and improved model-agnostic task knowledge. The framework demonstrates significant improvements in accuracy across eleven classification benchmarks, showcasing its ability to maintain performance gains when transferred to larger VLMs, where existing methods struggle. The authors conduct extensive experiments to validate TPB's effectiveness, highlighting its robustness and scalability in cross-model transfer scenarios.
Methodology
The authors propose Text Prompt Boosting (TPB), which iteratively constructs an ensemble of text prompts by focusing on misclassified examples. This approach reweights training examples at each boosting round, allowing the selection of new prompts that specifically address errors made by previous prompts. The methodology leverages a pool of prompts generated by a Large Language Model (LLM) and systematically builds a robust classifier that captures diverse visual variations.
Results
TPB consistently outperforms state-of-the-art baselines across eleven classification benchmarks. It shows significant improvements in accuracy on the source model and retains shot-driven gains when transferred to larger VLMs, demonstrating superior scalability and robustness compared to existing few-shot prompting methods.
Implications
The findings suggest that TPB can be effectively utilized in various applications requiring robust image classification, particularly in scenarios with limited labeled data. The framework's ability to transfer knowledge across different VLMs could facilitate the development of more adaptable and efficient multimodal AI systems.
Dynamic Neural Graph Encoding of Inference Processes in Deep Weight Space
Graph Learning
Optimization
Theory
- Introduction of dynamic neural graphs for modeling neural network parameters.
- Development of the Dynamic Neural Graph Encoder (DNG-Encoder) to process these dynamic graphs.
- Creation of INR2JLS for mapping INR weights into a joint latent space.
- Demonstration of significant improvements in INR classification accuracy on CIFAR datasets.
Read more
Dynamic Neural Graph Encoding of Inference Processes in Deep Weight Space
Summary
This paper introduces a novel approach to modeling neural network parameters using dynamic graphs, addressing the challenges of processing high-dimensional weight spaces in neural networks. The authors propose the Dynamic Neural Graph Encoder (DNG-Encoder), which captures the temporal dynamics of inference processes by representing neural network parameters as dynamic graphs. This method preserves the sequential nature of layer-by-layer processing, which is often overlooked in existing static graph approaches. Additionally, the authors develop INR2JLS (Implicit Neural Representation to Joint Latent Space), a technique that facilitates downstream applications, such as classifying Implicit Neural Representations (INRs). The proposed methods demonstrate significant improvements in classification accuracy, achieving approximately 10% better performance on CIFAR-100-INR compared to state-of-the-art methods. The paper emphasizes the importance of capturing the temporal dynamics in neural processing and presents a comprehensive evaluation of the proposed methods across multiple tasks.
Methodology
The authors propose a recurrent-like graph neural network, the DNG-Encoder, which processes dynamic neural graphs that evolve over time, mirroring the forward propagation mechanism of neural networks. This approach captures the temporal dynamics of inference processes, allowing for more effective modeling of neural network parameters. The DNG-Encoder is then utilized to develop INR2JLS, which learns a joint latent space between deep weights and original data for improved downstream applications.
Results
The proposed methods, particularly the DNG-Encoder and INR2JLS, show substantial improvements in classification tasks, achieving a 9% and 10% increase in accuracy on CIFAR-10 and CIFAR-100 for INR classification, respectively, surpassing existing state-of-the-art methods.
Implications
The findings suggest that incorporating dynamic graphs into the modeling of neural network parameters can lead to better performance in various machine learning tasks, particularly in classifying implicit neural representations. This approach could have broader applications in optimizing neural networks and enhancing their interpretability.
EPC: A Standardized Protocol for Measuring Evaluator Preference Dynamics in LLM Agent Systems
Large Language Models
Reinforcement Learning
NLP
- Introduction of EPC, a standardized protocol for measuring evaluator preference dynamics.
- Establishment of a four-phase isolation paradigm for systematic evaluation.
- Provision of a versioned Reference Snapshot for reproducibility and comparison.
- Focus on community governance and versioning to maintain measurement validity.
Read more
EPC: A Standardized Protocol for Measuring Evaluator Preference Dynamics in LLM Agent Systems
Summary
This paper introduces EPC (Evaluator Preference Coupling), a standardized protocol designed to measure evaluator preference dynamics in large language model (LLM) agent systems. The author identifies a significant gap in the existing literature, where the absence of a uniform protocol hinders reproducibility, comparability, and the ability to track changes in evaluator performance over time. The EPC protocol comprises a four-phase isolation paradigm that allows for systematic evaluation of how evaluator feedback influences agent behavior through test-time reinforcement learning (TTRL). The paper details the configuration of evaluators and executors, the design of strategies and tasks, and the computation of various metrics such as the coupling coefficient (γ) and Jensen-Shannon divergence (JSD). Additionally, a versioned Reference Snapshot v1.0 is provided, containing coupling measurements across multiple evaluators and model versions, ensuring that researchers can reference time-bound data. The paper emphasizes that it does not present new empirical findings but rather establishes a framework for ongoing research in the field, promoting a community-governed approach to evaluation as models evolve.
Methodology
The EPC protocol is structured around a four-phase isolation paradigm that includes pure text and visual tasks, as well as coupling phases. It specifies the configuration of evaluators and executors, the design of strategies, and the computation of metrics to quantify evaluator preference coupling. The paper also outlines a versioning convention for tracking changes in evaluator performance over time.
Results
The paper presents a Reference Snapshot v1.0 that includes coupling measurements for eight evaluator conditions across various LLMs, derived from five independent studies. This snapshot is designed to be time-bound, with measurements expected to decay as evaluators are updated, thus providing a clear framework for future comparisons.
Implications
The EPC protocol has the potential to enhance the reproducibility and comparability of research in LLM agent systems, facilitating better understanding of evaluator dynamics and improving the robustness of evaluations as models evolve. It encourages a community-driven approach to maintaining evaluation standards in the rapidly changing landscape of AI.
Adaptive Group-Based Counterfactual Explanations for Time-Series Rehabilitation Data
Time Series
Interpretability
Optimization
- Introduces a two-stage framework for generating group-based counterfactual explanations in rehabilitation data.
- Implements a Learnable Gate mechanism to optimize sensor group relevance and enhance interpretability.
- Demonstrates improved modality-group sparsity and validity over traditional channel-level methods.
- Validates the approach using the KneE-PAD dataset, showing clinically meaningful corrective feedback.
Read more
Adaptive Group-Based Counterfactual Explanations for Time-Series Rehabilitation Data
Summary
This paper addresses the challenge of generating interpretable counterfactual explanations (CEs) for multivariate time-series classifiers, particularly in the context of rehabilitation movement analysis using inertial measurement units (IMUs). Traditional counterfactual methods often operate at the channel level, leading to explanations that are biomechanically incoherent and difficult for clinicians to interpret. The authors propose a two-stage framework that includes Shapley-Adaptive (SA) group ranking and a Learnable Gate (LG) mechanism. The SA group ranking preserves counterfactual validity but lacks group-level sparsity, prompting the introduction of LG methods that optimize per-group relevance gates alongside perturbation masks. Experiments conducted on the KneE-PAD rehabilitation dataset demonstrate that the LG approach significantly enhances modality-group sparsity while maintaining or improving validity, temporal smoothness, and generation efficiency. The findings indicate that group-structured counterfactuals provide concise, clinically relevant corrective guidance, thereby improving interpretability and actionable insights for rehabilitation practitioners.
Methodology
The proposed framework consists of two main components: (1) Shapley-Adaptive group ranking for initial counterfactual generation, and (2) a Learnable Gate mechanism that incorporates trainable relevance parameters for sensor groups. This approach allows for dynamic selection of sensor groups while optimizing for validity, sparsity, and plausibility in counterfactual generation.
Results
The experiments on the KneE-PAD dataset revealed that the Learnable Gate method significantly improved modality-group sparsity compared to the baseline channel-level M-CELS method. Additionally, it maintained or enhanced the validity, temporal smoothness, and efficiency of counterfactual generation. The structured counterfactuals provided actionable insights aligned with clinical reasoning, demonstrating their effectiveness in rehabilitation contexts.
Implications
The findings suggest that adaptive group-based counterfactual explanations can enhance the interpretability and reliability of machine learning models in clinical rehabilitation settings. This approach can facilitate personalized interventions by providing clinicians with clearer, biomechanically coherent guidance based on motion analysis.
Flow-Map GRPO: Reinforcement Learning for Few-Step Flow-Map Generators via Anchored Stochastic Composition
Reinforcement Learning
Generative Models
Computer Vision
- Introduces Flow-Map GRPO, a framework for optimizing deterministic few-step flow-map generators using RL.
- Proposes Anchored Stochastic Flow Map Composition (ASFMC) to introduce stochasticity while preserving the original probability path.
- Demonstrates that existing SDE-based stochasticization techniques are not applicable to long-range flow maps.
- Empirical results show improvements in performance metrics for text-to-image generation tasks.
Read more
Flow-Map GRPO: Reinforcement Learning for Few-Step Flow-Map Generators via Anchored Stochastic Composition
Summary
This paper introduces Flow-Map GRPO, an online reinforcement learning (RL) post-training framework designed for deterministic few-step flow-map generators, which are commonly used in generative models like consistency models and MeanFlow. Traditional deterministic models face challenges in optimization with RL due to their lack of stochastic trajectories and well-defined likelihood ratios. The authors propose a novel stochasticization mechanism called Anchored Stochastic Flow Map Composition (ASFMC), which introduces randomness while preserving the original marginal probability path of the deterministic flow map. This allows for the formulation of few-step flow-map sampling as a Markov decision process (MDP), enabling the use of RL techniques for optimization without altering the original model parameterization. The paper empirically validates Flow-Map GRPO on few-step FLUX-based text-to-image generators, demonstrating significant improvements in various evaluation metrics compared to pretrained models. Overall, the proposed method effectively aligns deterministic flow-map generators with RL post-training, enhancing their performance in generative tasks.
Methodology
The authors developed Flow-Map GRPO, which utilizes ASFMC to create stochastic transitions from deterministic flow maps. This mechanism employs anchor-based conditional resampling to maintain the original marginal distribution while allowing for trajectory-level exploration. The framework is designed to work with both single-time and two-time flow-map parameterizations, enabling the application of RL post-training techniques to enhance model performance.
Results
Experiments conducted on few-step FLUX-based text-to-image generators, including MeanFlow and sCM, showed that Flow-Map GRPO significantly improved performance across reward-based, perceptual, and task-level evaluation metrics compared to the original pretrained models.
Implications
The proposed framework has the potential to enhance the performance of various generative models in tasks requiring high-quality outputs, such as image and video generation. It opens avenues for further research into integrating RL with deterministic generative models without the need for extensive retraining.
CausalMix: Data Mixture as Causal Inference for Language Model Training
NLP
Large Language Models
Optimization
- CAUSALMIX optimizes data mixtures by framing it as a causal inference problem.
- The framework allows for dynamic adjustment of mixture weights based on the current data state.
- Extensive experiments show significant performance improvements over traditional methods.
- CAUSALMIX provides interpretability through the analysis of Conditional Average Treatment Effects.
Read more
CausalMix: Data Mixture as Causal Inference for Language Model Training
Summary
The paper introduces CAUSALMIX, a novel framework for optimizing data mixtures in Large Language Model (LLM) training by framing the problem as a causal inference task. Traditional methods for data mixing often rely on static data distributions and require costly retraining when data pools shift. CAUSALMIX addresses this limitation by treating the mixture optimization as a causal marginal return estimation problem, where the statistical features of the data pool are considered covariates and the domain mixture is treated as the treatment. The authors conducted extensive experiments using the Qwen2.5-0.5B model to estimate the Conditional Average Treatment Effect (CATE) and extrapolated optimal mixtures for larger datasets, demonstrating the framework's ability to generalize to unseen data and larger model architectures. The results indicate that CAUSALMIX consistently improves performance across various downstream tasks compared to existing methods like RegMix. Additionally, the framework offers interpretability through the CATE Interpreter, revealing insights into the interactions between different data domains. Overall, CAUSALMIX presents a scalable, interpretable, and transferable approach to data mixture optimization in LLM training.
Methodology
CAUSALMIX formulates data mixture optimization as a causal marginal return estimation problem, utilizing Double Machine Learning (DML) and causal forests to orthogonalize treatment and outcome variables. The framework analyzes historical training runs as treatments and conditions on the data state to estimate the impact of domain proportions on downstream performance.
Results
The application of CAUSALMIX led to improved performance across multiple downstream tasks, outperforming baseline methods such as RegMix. The framework successfully extrapolated optimal mixtures for larger datasets and demonstrated the ability to generalize to unseen data pools without requiring new proxy experiments.
Implications
CAUSALMIX offers a principled approach to data mixture optimization that can enhance the training of LLMs, making it easier to adapt to changing data distributions and improving model performance. The interpretability aspect allows researchers to understand the effects of different data domains on model outcomes, which can inform future data collection and training strategies.
Beyond Adam: SOAP and Muon for Faster, Label-Efficient Training of Machine Learning Interatomic Potentials
Optimization
Efficient ML
- SOAP and SOAP-Muon optimizers outperform Adam in training MLIPs, showing faster convergence and higher accuracy.
- These optimizers maintain strong performance even with reduced force supervision, indicating potential for label-efficient training.
- SOAP-Muon achieves robust results, particularly in scenarios where force labels are expensive or limited.
- The resulting MLIPs demonstrate physical fidelity, accurately reproducing ab initio calculations and experimental data.
Read more
Beyond Adam: SOAP and Muon for Faster, Label-Efficient Training of Machine Learning Interatomic Potentials
Summary
This paper addresses the optimization of machine learning interatomic potentials (MLIPs), which are crucial for scientific simulations in chemistry and materials science. While the MLIP community has focused on improving model architectures and datasets, the choice of optimizer has been largely overlooked, with most training protocols defaulting to Adam and its variants. The authors implement and compare several matrix-structured optimizers—Muon, SOAP, and a hybrid SOAP-Muon—against AdamW in training NequIP and Allegro MLIP models. The study finds that SOAP and SOAP-Muon significantly outperform Adam in terms of convergence speed and final accuracy, particularly under conditions of partial force supervision. The results suggest that the choice of optimizer is a critical factor in the design of MLIPs, with implications for reducing the need for extensive force labels in training.
Methodology
The authors benchmarked the performance of Muon, SOAP, and SOAP-Muon optimizers against AdamW using two significant physical systems: liquid water and solid acid electrolyte CsH2PO4. They evaluated the optimizers' effectiveness under varying levels of force supervision, including energy-only training, to assess their robustness and efficiency in training MLIPs.
Results
The study revealed that SOAP and SOAP-Muon consistently improved energy and force accuracy while accelerating convergence compared to AdamW. SOAP demonstrated the most robust performance across different systems, while SOAP-Muon achieved the best results in specific settings. Notably, SOAP-Muon maintained high accuracy even when trained with only 50% of the force labels, matching the performance of AdamW trained with full supervision. In extreme cases, SOAP-Muon preserved fidelity with just 5% of force labels, while AdamW became unstable.
Implications
The findings suggest that optimizing the choice of training algorithms can lead to more efficient and effective MLIPs, particularly in scenarios where obtaining force labels is costly. This could enhance the applicability of MLIPs in various scientific fields, enabling more extensive simulations with fewer resources.
Interpretable vs Learned Encoders for High-Cardinality Fraud Detection
Interpretability
- Entity embeddings provided the highest AUC-ROC score, indicating their effectiveness in high-cardinality fraud detection.
- The study operationalized auditor-readable tier grouping, demonstrating its competitive performance against learned encodings.
- Controlled comparisons across different encoders highlight the importance of isolating encoding methods from model architectures.
- Interpretability and computational efficiency are critical factors in selecting encoding methods for fraud detection in regulated environments.
Read more
Interpretable vs Learned Encoders for High-Cardinality Fraud Detection
Summary
This paper investigates the performance of seven categorical encoding methods for high-cardinality fraud detection using the IEEE-CIS fraud benchmark dataset, which consists of 590,540 records with a 3.5% positive rate and eight high-cardinality columns. The authors conducted a controlled experiment with a fixed LightGBM learner to isolate the effects of different encoders, including entity embeddings, target encoding, and tier grouping. The study highlights the trade-offs between accuracy, interpretability, and computational efficiency in fraud detection models. The results indicate that entity embeddings achieved the highest AUC-ROC score of 0.9612, closely followed by CatBoost at 0.9602. The paper emphasizes the importance of auditor-readable encoding methods, particularly tier grouping, which maintained interpretability while performing competitively. The findings suggest that while deep learning approaches like TabNet did not outperform tree-based methods, the choice of encoding can significantly influence model performance and compliance with regulatory standards.
Methodology
The authors tested seven categorical encoding methods on the IEEE-CIS fraud dataset using stratified 5-fold cross-validation with three repetitions. Five encoders were evaluated with a fixed LightGBM learner, while CatBoost and TabNet were included for cross-paradigm comparisons. The encoders included one-hot encoding, target encoding, frequency encoding, tier grouping, CatBoost, and entity embeddings.
Results
Entity embeddings achieved the highest AUC-ROC of 0.9612, with CatBoost closely following at 0.9602. Tier grouping performed competitively at 0.9548, while target encoding was slightly less effective. On AUC-PR, CatBoost led with a score of 0.822 compared to 0.793 for entity embeddings. The analysis confirmed that the advantage of embeddings arises from their ability to represent multiple columns jointly.
Implications
The findings suggest that organizations involved in fraud detection can benefit from using entity embeddings and tier grouping to enhance model performance while ensuring compliance with interpretability requirements. This research provides a framework for evaluating encoding methods in regulated financial environments, potentially influencing future practices in fraud detection.
EPnG: Adaptive Expert Prune-and-Grow for Parameter-Efficient MoE Fine-tuning
NLP
Large Language Models
Efficient ML
- EPnG optimizes parameter-efficient fine-tuning for Mixture-of-Experts models by reallocating resources based on expert importance.
- The prune-and-grow mechanism allows for dynamic adjustment of expert utilization while maintaining a fixed parameter budget.
- EPnG achieves performance comparable to full fine-tuning while updating significantly fewer parameters (0.55%–0.72%).
- The framework addresses the inefficiencies of existing PEFT methods that do not consider MoE routing dynamics.
Read more
EPnG: Adaptive Expert Prune-and-Grow for Parameter-Efficient MoE Fine-tuning
Summary
The paper introduces EPnG, an innovative adaptive framework designed to enhance the parameter-efficient fine-tuning (PEFT) of Mixture-of-Experts (MoE) models. Traditional PEFT methods, such as LoRA, overlook the unique routing dynamics of MoE architectures, resulting in inefficient resource allocation during fine-tuning. EPnG addresses this issue by dynamically reallocating LoRA capacity based on the importance of experts, as determined by router gate probabilities. The framework operates through a prune-and-grow mechanism: it prunes under-utilized experts and reallocates the freed capacity to high-importance experts by expanding their ranks with orthogonal initialization. This approach maintains a fixed parameter budget while optimizing expert utilization. The authors demonstrate that EPnG consistently outperforms static LoRA under the same budget constraints and achieves performance levels comparable to full fine-tuning, all while updating only a small fraction (0.55%–0.72%) of parameters, which is significantly fewer than traditional methods. The findings suggest that aligning PEFT strategies with MoE routing dynamics can lead to more effective and scalable fine-tuning solutions.
Methodology
EPnG employs a prune-and-grow strategy that first estimates expert importance from router gate probabilities collected during training. It prunes low-importance experts and reallocates the released budget to high-importance experts by expanding their LoRA ranks with orthogonal initialization. This process is repeated iteratively to optimize expert utilization under a fixed parameter budget.
Results
EPnG consistently outperformed LoRA under the same parameter budget across OLMoE and Qwen1.5-MoE models, achieving performance levels comparable to full fine-tuning while updating only 0.55%–0.72% of parameters, which translates to up to 140x-180x fewer parameters being updated.
Implications
The proposed EPnG framework has significant implications for the efficient fine-tuning of large language models, particularly in scenarios where computational resources are limited. It can lead to more scalable and effective adaptations of MoE architectures in various applications, enhancing their performance without incurring high costs.
Diffeomorphic Optimization
Optimization
Generative Models
- Diffeomorphic optimization enables smoother optimization on low-dimensional manifolds by utilizing diffusion and flow models.
- The method maintains on-manifold trajectories, reducing the risk of drifting into out-of-distribution solutions.
- It extends to matrix Lie groups, facilitating efficient backpropagation for complex protein structures.
- Diffeomorphic optimization outperforms existing techniques in protein design tasks, achieving better results in less time.
Read more
Diffeomorphic Optimization
Summary
This paper introduces diffeomorphic optimization, a novel method for optimizing differentiable objectives on low-dimensional manifolds embedded in high-dimensional spaces. Traditional optimization techniques struggle with the complex, non-convex landscapes of these manifolds, often leading to out-of-distribution solutions. The authors leverage diffusion and flow models to create a diffeomorphic map from a base space to the data manifold, allowing for gradient descent in a simpler space while ensuring that trajectories remain on the manifold. The method is particularly applicable to protein design, extending to matrix Lie groups SO(3) and SE(3) for efficient backpropagation through Lie-group ODE solvers. The authors demonstrate that diffeomorphic optimization significantly outperforms existing methods in various tasks, including secondary-structure targeting and peptide binding affinity optimization, while also reducing energy levels in protein structures across the PDB test set.
Methodology
The authors propose a framework that uses a diffeomorphic map from a simple base space to the target data manifold, allowing gradient descent to be performed in the base space. They derive methods for backpropagation through ODE solvers on matrix Lie groups, ensuring efficient computation of gradients.
Results
Diffeomorphic optimization achieved a 91.3% success rate in secondary-structure targeting compared to 63.3% with tuned guidance. It also outperformed OC-Flow on peptide binding affinity at twice the speed and significantly reduced Rosetta energies for large protein structures.
Implications
This method has the potential to revolutionize protein design by providing a more targeted approach to optimization, reducing the need for extensive sampling and allowing for more efficient use of computational resources in generating high-quality protein samples.
Balancing Expressivity and Learnability in Quantum Kernel Bandit Optimization
Optimization
Theory
Efficient ML
- Identifies the expressivity of quantum kernels as a fundamental learnability barrier in GP bandit optimization.
- Proposes new algorithms that utilize lower-dimensional quantum subspaces and classical approximations to reduce model complexity.
- Derives regret bounds that quantify the trade-off between information gain and kernel misspecification.
- Empirical results show improved sample efficiency and reduced computational overhead compared to full quantum kernels.
Read more
Balancing Expressivity and Learnability in Quantum Kernel Bandit Optimization
Summary
This paper explores Gaussian process (GP) bandit optimization using quantum kernels, particularly in the context of noisy intermediate-scale quantum (NISQ) computing. The authors identify a significant challenge in using high-dimensional quantum kernels, which can lead to increased model complexity and cumulative regret, thus hindering learnability. To mitigate this issue, they propose projected quantum kernels and classical kernel approximation techniques that maintain essential quantum properties while reducing dimensionality. The paper introduces misspecified GP bandit algorithms and provides regret bounds that illustrate the trade-off between approximation error and information gain. Empirical results demonstrate that the proposed methods outperform traditional full quantum kernels in terms of sample efficiency and computational overhead, making them suitable for scalable optimization in quantum-native applications.
Methodology
The authors develop a framework for approximate GP optimization in quantum kernel bandits, combining quantum kernel approximation techniques with GP and linear bandit algorithms. They introduce linear projected quantum kernels (LPQKs) and quantum-inspired classical approximations such as Random Fourier Features (RFF) and Newton basis expansions. The analysis includes deriving regret bounds to guide the choice of approximation parameters.
Results
The proposed methods demonstrate lower cumulative regret compared to full quantum kernels in empirical tests, indicating better sample efficiency. The regret bounds derived provide a principled way to balance model complexity and information gain, showing that a well-chosen approximate kernel can outperform high-dimensional quantum kernels.
Implications
The findings suggest that approximate quantum kernel methods can enhance the scalability and efficiency of quantum machine learning applications, particularly in NISQ-era tasks such as quantum control and variational quantum algorithms. This work opens avenues for further research in optimizing quantum algorithms for practical applications.
GSRQ: Gain-Shape Residual Quantization for Sub-1-bit KV Cache
NLP
Large Language Models
Efficient ML
- Introduces Gain-Shape K-means (GSKM) to address centroid shrinkage in high-dimensional vector quantization.
- Develops Gain-Shape Residual Quantization (GSRQ) for efficient KV cache compression in LLMs.
- Demonstrates substantial improvements in accuracy over existing quantization baselines, particularly at 1-bit quantization.
- Highlights the importance of directional preservation in high-dimensional quantization tasks.
Read more
GSRQ: Gain-Shape Residual Quantization for Sub-1-bit KV Cache
Summary
The paper addresses the challenge of deploying Large Language Models (LLMs) with extended context windows, which is constrained by the linear growth of Key-Value (KV) cache memory. The authors propose Gain-Shape Residual Quantization (GSRQ), a novel approach that enhances KV cache storage efficiency by utilizing a new codebook learning method called Gain-Shape K-means (GSKM). This method improves directional fidelity and reduces centroid shrinkage issues associated with standard K-means, which can degrade performance in high-dimensional spaces. GSRQ integrates GSKM into a Residual Quantization (RQ) pipeline, significantly improving reconstruction quality and downstream accuracy in LLM inference tasks. The authors demonstrate that GSRQ outperforms existing quantization methods, achieving a notable increase in accuracy on LongBench tasks while operating at a 1-bit quantization level.
Methodology
The authors propose GSKM as a replacement for standard K-means, which enhances directional fidelity in high-dimensional vector quantization. GSRQ is built by incorporating GSKM into a Residual Quantization framework, utilizing a robust gradient-based weighting scheme to optimize KV cache quantization.
Results
GSRQ significantly improves the average accuracy across LongBench tasks from 11.34 to 33.54 at 1-bit quantization, representing a gain of 22.20 percentage points over the VQLLM baseline. This demonstrates the effectiveness of the proposed methods in enhancing KV cache performance for LLMs.
Implications
The findings suggest that GSRQ can facilitate the deployment of LLMs with larger context windows by reducing memory requirements, thus enabling more efficient processing of long documents and complex tasks. This could lead to broader applications of LLMs in real-time systems where memory and computational efficiency are critical.