AI-generated summaries
Today's ML research,
without the noise.
Daily summaries of the latest machine learning papers from arXiv, processed every 8 hours.
59
Papers today
8h
Update frequency
7
Days of history
Attention Sinks Are Provably Necessary in Softmax Transformers: Evidence from Trigger-Conditional Tasks
- Attention sinks are a necessary feature of softmax Transformers when computing trigger-conditional tasks.
- The study introduces a trigger-conditional task that reflects the behavior of attention heads in practical scenarios.
- Theoretical proofs demonstrate that single-layer and multi-layer softmax models must exhibit sink behavior to achieve task accuracy.
- ReLU attention can solve the same task without any sink formation, indicating that normalization is the key factor driving sink behavior.
Read more
Attention Sinks Are Provably Necessary in Softmax Transformers: Evidence from Trigger-Conditional Tasks
Summary
In this paper, the author investigates the phenomenon of attention sinks in softmax Transformers, where attention probability mass concentrates on a fixed, content-agnostic position. The study proves that for softmax self-attention models, computing a trigger-conditional behavior necessitates the existence of an attention sink. This is demonstrated through a specific task where the model must output the average of preceding token representations at a designated trigger position and output zero elsewhere. The findings indicate that the normalization constraint inherent in softmax attention forces the model to collapse attention onto a stable anchor, particularly when a default state is required. In contrast, the author shows that non-normalized ReLU attention can achieve the same task without forming sinks, highlighting the normalization constraint as the primary cause of sink behavior. Empirical experiments corroborate the theoretical predictions, revealing that softmax models develop strong sinks while ReLU attention effectively eliminates them across various model architectures.
Methodology
The author introduces a trigger-conditional task to analyze attention behavior in softmax Transformers. Theoretical proofs are provided to establish the necessity of attention sinks in single-layer and multi-layer models. Empirical experiments are conducted to validate the theoretical claims, comparing softmax and ReLU attention mechanisms.
Results
The study proves that any single-layer softmax attention model achieving vanishing error on the trigger-conditional task must concentrate attention on a fixed sink token at all non-trigger positions. For multi-layer models, at least one layer must exhibit sink behavior. The experiments confirm that softmax models develop strong sinks, while ReLU attention models do not, thus supporting the hypothesis that normalization is the fundamental driver of sink behavior.
Implications
The findings have significant implications for the design and optimization of Transformer models, particularly in mitigating attention sinks that can adversely affect model performance and interpretability. Understanding the role of normalization in attention mechanisms may lead to improved architectures that avoid the pitfalls associated with attention sinks.
Huntington Disease Automatic Speech Recognition with Biomarker Supervision
- Introduces a high-fidelity clinical corpus for HD speech ASR, the first of its kind for end-to-end evaluation.
- Demonstrates that different ASR architectures exhibit unique error patterns when processing HD speech.
- Achieves a significant reduction in WER through HD-specific adaptations of the Parakeet-TDT model.
- Proposes the use of clinically grounded biomarkers as auxiliary supervision for ASR adaptation.
Read more
Huntington Disease Automatic Speech Recognition with Biomarker Supervision
Summary
This paper addresses the challenges of automatic speech recognition (ASR) for individuals with Huntington's disease (HD), a condition characterized by irregular speech patterns that complicate transcription. The authors conduct a systematic study using a high-fidelity clinical speech corpus, which includes recordings from 94 HD-positive individuals and 36 healthy controls. They compare various ASR architectures, revealing that HD speech induces architecture-specific error patterns. Notably, the Parakeet-TDT model outperforms traditional encoder-decoder and CTC baselines. The study also introduces HD-specific adaptations that significantly reduce word error rates (WER) from 6.99% to 4.95%. Furthermore, the authors propose a novel approach that utilizes biomarker-based auxiliary supervision to enhance ASR performance, analyzing how this supervision reshapes error behavior in a severity-dependent manner. The paper concludes by making all code and models open-source, contributing to the field of dysarthric ASR and potentially extending the framework to other atypical speech disorders.
Methodology
The authors utilized a clinical dataset comprising 4.5 hours of audio from 130 individuals, including both HD-positive and healthy controls. They compared multiple ASR architectures, focusing on their performance in recognizing HD speech. The study involved adapting the Parakeet-TDT model with encoder-side adapters and incorporating auxiliary supervision derived from clinically relevant biomarkers, which were extracted using tools like openSMILE and librosa. A detailed error analysis was conducted to understand the substitution, deletion, and insertion patterns across different model families and severity cohorts.
Results
The study found that the Parakeet-TDT model outperformed other ASR architectures, achieving a WER reduction from 6.99% to 4.95% with HD-specific adaptations. The introduction of biomarker-based auxiliary supervision reshaped error behavior in a manner that was dependent on the severity of the HD condition, rather than uniformly improving WER across all cases. This indicates that the integration of clinical biomarkers can enhance the adaptability and robustness of ASR systems for pathological speech.
Implications
The findings suggest that specialized ASR systems can significantly improve transcription accuracy for individuals with Huntington's disease, addressing a critical gap in the field. The open-source nature of the code and models allows for further research and development in dysarthric ASR, potentially benefiting other atypical speech disorders. This work highlights the importance of incorporating clinical insights and biomarker data into machine learning models to enhance their performance in real-world applications.
Duration Aware Scheduling for ASR Serving Under Workload Drift
- Duration-aware scheduling can significantly reduce end-to-end latency in ASR systems.
- SJF reduces median latency by up to 73% but may cause increased tail latency.
- HRRN provides a balanced approach, improving median latency while controlling tail latency degradation.
- The proposed methods maintain performance under workload drift with minimal scheduling overhead.
Read more
Duration Aware Scheduling for ASR Serving Under Workload Drift
Summary
This paper addresses the inefficiencies of first-come-first-served (FCFS) scheduling in Automatic Speech Recognition (ASR) systems, particularly under variable workloads. The authors demonstrate that audio duration serves as a reliable proxy for job processing time in ASR models like Whisper. By integrating two classical scheduling algorithms—Shortest Job First (SJF) and Highest Response Ratio Next (HRRN)—into the vLLM engine, the authors aim to enhance end-to-end latency performance. The study evaluates these algorithms under realistic workloads, revealing that SJF can reduce median end-to-end latency by up to 73% at high loads, although it may increase tail latency due to starvation of longer requests. HRRN mitigates this issue by balancing median latency improvements (up to 28%) while limiting tail latency degradation to 24%. The findings indicate that duration-aware scheduling can significantly improve ASR responsiveness without incurring throughput penalties or substantial scheduling overhead.
Methodology
The authors integrated SJF and HRRN scheduling algorithms into the vLLM engine, leveraging the correlation between audio duration and job processing time. They evaluated the performance of these algorithms on the LibriSpeech dataset and a synthetic workload to assess their robustness under varying workloads.
Results
The results showed that SJF achieved a reduction in median end-to-end latency by up to 73% at high loads, while HRRN provided a more balanced reduction of up to 28% in median latency with a maximum tail latency degradation of 24%. Both algorithms demonstrated consistent performance improvements across different workloads, with less than 0.1 ms scheduling overhead per request.
Implications
The findings suggest that implementing duration-aware scheduling can enhance the responsiveness of ASR systems, making them more efficient for real-time applications such as voice assistants and live captioning. This approach can be particularly beneficial in environments with variable workloads, improving user satisfaction and system performance.
Systematic Scaling Analysis of Jailbreak Attacks in Large Language Models
- Introduces a scaling-law framework for analyzing jailbreak attacks in LLMs.
- Demonstrates that prompting-based attacks are more compute-efficient than optimization-based methods.
- Identifies distinct success-stealthiness operating points for different attack paradigms.
- Finds that the ease of eliciting harm is highly goal-dependent, with misinformation being the most accessible target.
Read more
Systematic Scaling Analysis of Jailbreak Attacks in Large Language Models
Summary
This paper investigates the scaling behavior of jailbreak attacks on large language models (LLMs), focusing on how the success of these attacks correlates with the computational effort exerted by attackers. The authors propose a scaling-law framework that treats each jailbreak attempt as a compute-bounded optimization problem, measuring progress along a shared FLOPs (floating point operations) axis. They evaluate four distinct jailbreak paradigms: optimization-based attacks, self-refinement prompting, sampling-based selection, and genetic optimization, across various model families and harmful objectives. The study reveals that prompting-based methods are generally more compute-efficient than optimization-based approaches, and that different attack paradigms occupy unique success-stealthiness operating points. Additionally, the research highlights that the vulnerability of LLMs is highly dependent on the type of harm being targeted, with misinformation-related harms being the easiest to elicit. The findings contribute to a deeper understanding of the dynamics of jailbreak attacks and provide insights into the efficiency and effectiveness of various attack strategies.
Methodology
The authors conducted a systematic evaluation of various jailbreak attack methods by measuring their success rates against the computational resources (FLOPs) used. They fitted a saturating exponential function to the success trajectories of the attacks and analyzed the efficiency of different paradigms through comparative analysis. Additionally, they explored the optimization dynamics of prompt-based updates and assessed the goal-dependent nature of vulnerabilities in LLMs.
Results
The study found that prompting-based methods achieved higher success rates with lower computational costs compared to optimization-based methods. The fitted scaling curves indicated predictable trends in attack success relative to the computational effort, with a notable distinction in the success-stealthiness balance among different attack types. Misinformation-related harms were identified as significantly easier to elicit than other forms of harm.
Implications
The findings of this research have significant implications for the development of more robust defenses against jailbreak attacks in LLMs. By understanding the scaling behaviors and efficiencies of various attack strategies, developers can better anticipate vulnerabilities and design more effective safety mechanisms. Furthermore, the insights into goal-dependent vulnerabilities can inform targeted strategies for mitigating specific types of harmful outputs in LLMs.
Multilingual Financial Fraud Detection Using Machine Learning and Transformer Models: A Bangla-English Study
- Investigation of financial fraud detection in a multilingual Bangla-English context.
- Comparison of classical machine learning models with transformer-based architectures.
- Identification of unique linguistic patterns in fraudulent messages.
- Demonstration of the effectiveness of classical models in low-resource language settings.
Read more
Multilingual Financial Fraud Detection Using Machine Learning and Transformer Models: A Bangla-English Study
Summary
This paper addresses the critical challenge of financial fraud detection in a multilingual context, specifically focusing on Bangla and English. As digital financial platforms expand, the need for effective fraud detection systems becomes paramount, yet most existing research has been limited to English-language data. The authors explore the performance of classical machine learning models and transformer-based architectures on a dataset of legitimate and fraudulent financial messages in a Bangla-English setting. They find that while classical models like Linear SVM outperform transformers in accuracy, the latter demonstrates higher recall for fraud detection. The study highlights the unique characteristics of scam messages and the challenges posed by linguistic diversity, code-mixing, and the constraints of low-resource languages. The findings emphasize the importance of multilingual modeling and the continued relevance of classical machine learning techniques in this domain.
Methodology
The authors employed classical machine learning models, including Logistic Regression, Linear SVM, and Ensemble classifiers, using TF-IDF features. They also utilized transformer-based architectures to analyze their performance on a dataset containing both legitimate and fraudulent financial messages. The evaluation was conducted using 5-fold stratified cross-validation to ensure robust performance assessment.
Results
The study found that the Linear SVM model achieved the highest performance with an accuracy of 91.59% and an F1 score of 91.30%, surpassing the transformer model, which recorded 89.49% accuracy and 88.88% F1 score. However, the transformer model exhibited a higher fraud recall of 94.19%, indicating its potential for identifying fraudulent messages despite a higher false positive rate. Exploratory analysis revealed that scam messages were typically longer and contained urgency-inducing terms, URLs, and phone numbers, while legitimate messages included transactional confirmations and currency references.
Implications
The findings suggest that classical machine learning techniques remain competitive for multilingual fraud detection, particularly in low-resource language contexts. This research underscores the necessity for developing fraud detection systems that accommodate linguistic diversity and code-mixing, which are prevalent in many real-world financial communications. The study also opens avenues for further exploration of NLP applications in underrepresented languages like Bangla.
Slack More, Predict Better: Proximal Relaxation for Probabilistic Latent Variable Model-based Soft Sensors
- Introduction of KProxNPLVM to improve soft sensor modeling accuracy.
- Utilization of Wasserstein distance as a proximal operator to relax the learning objective.
- Rigorous derivation and proof of convergence for the new variational inference strategy.
- Demonstration of improved performance through extensive experiments on various datasets.
Read more
Slack More, Predict Better: Proximal Relaxation for Probabilistic Latent Variable Model-based Soft Sensors
Summary
This paper addresses the limitations of conventional Nonlinear Probabilistic Latent Variable Models (NPLVMs) in soft sensor modeling, particularly the approximation error introduced by using amortized variational inference (AVI). The authors propose a novel model called KProxNPLVM, which relaxes the optimization objective to improve performance. They demonstrate that the approximation error arises from the finite-dimensional representation of the variational posterior, which can lead to reduced accuracy in soft sensor predictions. By employing the Wasserstein distance as a proximal operator, the authors derive a new variational inference strategy that sidesteps the approximation error. The paper includes a rigorous derivation of the optimization implementation for KProxNPLVM and proves the convergence of the algorithm. Extensive experiments on both synthetic and real-world industrial datasets validate the effectiveness of the proposed model, showing significant improvements in prediction accuracy compared to traditional methods.
Methodology
The authors develop KProxNPLVM by relaxing the optimization objective using the Wasserstein distance, which allows for a more accurate representation of the latent variable distributions. They provide a detailed derivation of the optimization process and establish the convergence of the proposed method, ensuring it effectively sidesteps the approximation error associated with traditional variational inference techniques.
Results
The experiments conducted on synthetic and real-world datasets show that KProxNPLVM significantly outperforms conventional NPLVMs in terms of prediction accuracy. The proposed model effectively captures the underlying data distributions, leading to better soft sensor performance in industrial applications.
Implications
The findings suggest that KProxNPLVM can enhance the reliability and accuracy of soft sensors in industrial settings, potentially leading to improved product quality, reduced energy consumption, and optimized operational costs. This approach may also inspire further research into alternative optimization strategies for probabilistic models in various applications.
Survival Meets Classification: A Novel Framework for Early Risk Prediction Models of Chronic Diseases
Interpretability
- Integration of survival analysis with classification techniques for chronic disease risk prediction.
- Development of models using only EMR data, excluding lab results, for early disease alerts.
- Survival models outperform traditional classifiers in predictive performance metrics.
- Novel explanation methodology for model outputs validated by medical experts.
Read more
Survival Meets Classification: A Novel Framework for Early Risk Prediction Models of Chronic Diseases
Summary
This paper presents a novel framework for early risk prediction models of chronic diseases by integrating survival analysis with classification techniques. The authors focus on five prevalent chronic diseases: diabetes, hypertension, chronic kidney disease (CKD), chronic obstructive pulmonary disease (COPD), and chronic ischemic heart disease (CHD). Unlike traditional models that operate independently in either survival analysis or classification, this study re-engineers survival models to effectively perform classification tasks. The proposed models utilize Electronic Medical Records (EMR) data, excluding laboratory results, to provide timely alerts for preventive interventions. The authors demonstrate that their survival models achieve comparable or superior performance to state-of-the-art classifiers like LightGBM and XGBoost, measured by accuracy, F1 score, and AUROC. Additionally, the models incorporate a novel explanation methodology validated by expert physicians, enhancing their clinical relevance and usability in real-world settings.
Methodology
The authors re-engineered survival analysis methods to enable classification, creating early disease risk prediction models based on de-identified EMR data. They employed machine learning techniques to analyze patient data and validate their models using performance metrics such as accuracy, F1 score, and AUROC. The SHAP algorithm was utilized for generating explanations of the model predictions.
Results
The proposed survival models demonstrated performance metrics that were comparable to or better than existing state-of-the-art models, including LightGBM and XGBoost. The models were validated by a panel of three expert physicians, ensuring that the features and risk factors used were clinically relevant.
Implications
The framework developed in this study has the potential to significantly improve early risk prediction and management of chronic diseases, enabling healthcare providers to implement timely preventive measures. By relying solely on EMR data, the models can be applied in real-world clinical settings without the need for laboratory tests, thus enhancing their practicality and accessibility.
Procedural Fairness via Group Counterfactual Explanation
- Formalizes procedural fairness as group counterfactual explanation invariance.
- Introduces Group Counterfactual Integrated Gradients (GCIG) as a training-time regularization method.
- GCIG minimizes cross-group variation in feature attributions to ensure consistent reasoning.
- Empirical results show reduced explanation disparity and competitive predictive performance.
Read more
Procedural Fairness via Group Counterfactual Explanation
Summary
This paper addresses the gap in machine learning fairness research by focusing on procedural fairness, which emphasizes the consistency of decision-making processes across different protected groups. While existing fairness metrics often concentrate on outcome-oriented fairness, such as Equalized Odds, they do not account for the reasoning behind model predictions. The authors introduce a novel framework called Group Counterfactual Integrated Gradients (GCIG), which enforces explanation invariance across groups during the training process. GCIG computes explanations based on multiple group conditional baselines and penalizes variations in these attributions, thereby promoting stable explanations across different groups. The empirical evaluation of GCIG against six state-of-the-art methods demonstrates its effectiveness in reducing cross-group explanation disparity while maintaining competitive predictive performance. This work highlights the importance of aligning model reasoning across groups as a means to enhance fairness beyond mere outcome parity.
Methodology
The authors propose GCIG, which integrates explanation-based constraints directly into the learning process. It computes feature attributions using Integrated Gradients relative to group conditional baselines and penalizes discrepancies in these attributions during training. This approach ensures that the model's reasoning remains consistent across protected groups, thereby promoting procedural fairness.
Results
The evaluation of GCIG against six established fairness methods reveals that it significantly reduces cross-group explanation disparity while maintaining competitive predictive accuracy. The results indicate that GCIG effectively aligns model reasoning across different groups, providing a practical approach to enhancing fairness in machine learning.
Implications
The findings suggest that incorporating procedural fairness into machine learning models can help build trust in AI systems by ensuring consistent reasoning across diverse groups. This approach may have applications in various domains, including finance, healthcare, and criminal justice, where fairness and transparency in decision-making are critical.
Chemical Reaction Networks Learn Better than Spiking Neural Networks
Theory
- CRNs can learn classification tasks without requiring hidden layers, unlike SNNs.
- The paper provides mathematical guarantees for the learning behavior of CRNs.
- Numerical experiments show CRNs outperform SNNs in classifying handwritten digits.
- The study highlights the potential of CRNs in machine learning applications.
Read more
Chemical Reaction Networks Learn Better than Spiking Neural Networks
Summary
This paper presents a mathematical proof demonstrating that chemical reaction networks (CRNs) without hidden layers can effectively solve tasks that require hidden layers in spiking neural networks (SNNs). The authors utilize deterministic mass-action kinetics to establish that a specific CRN can learn a classification task previously achievable by an SNN with hidden layers. They provide analytical regret bounds for the network's global behavior and analyze its asymptotic behavior and Vapnik–Chervonenkis dimension. Through numerical experiments, the authors confirm that the proposed CRN can classify handwritten digits from pixel images more accurately and efficiently than an SNN with hidden layers. This work suggests that CRNs may offer a more efficient learning framework compared to traditional neural networks, providing insights into how biological cells might learn through biochemical processes.
Methodology
The authors establish a structural analogy between stochastic chemical kinetics and SNNs, allowing for a comparative analysis of learning frameworks. They derive theoretical guarantees for a CRN modeled using continuous mass-action kinetics, which operates through a selection phase followed by a learning phase, updating weights based on expert aggregation algorithms.
Results
The CRN proposed in the study successfully classifies handwritten digits, achieving higher accuracy and efficiency than an SNN with hidden layers. The authors provide analytical regret bounds and demonstrate that the CRN's learning guarantees hold under weaker assumptions than those required for SNNs.
Implications
This research opens avenues for utilizing CRNs in machine learning, particularly in biochemical computing. It suggests that CRNs could serve as more efficient learning systems, potentially influencing the design of future machine learning models and providing insights into biological learning mechanisms.
Causal Matrix Completion under Multiple Treatments via Mixed Synthetic Nearest Neighbors
Theory
Efficient ML
Optimization
- Introduction of Mixed Synthetic Nearest Neighbors (MSNN) for causal matrix completion under multiple treatments.
- MSNN retains the statistical properties of SNN while improving sample efficiency in data-scarce environments.
- The method allows for the sharing of imputation coefficients across treatments based on shared latent factors.
- Empirical results show MSNN's effectiveness in estimating causal effects where traditional methods fail.
Read more
Causal Matrix Completion under Multiple Treatments via Mixed Synthetic Nearest Neighbors
Summary
This paper addresses the challenge of causal matrix completion in scenarios with multiple treatments, particularly when data is missing not at random (MNAR). The authors introduce the Mixed Synthetic Nearest Neighbors (MSNN) algorithm, which enhances the existing Synthetic Nearest Neighbors (SNN) approach by integrating information across different treatment levels. The MSNN method allows for the estimation of imputation coefficients using data from multiple treatments, thereby overcoming the limitations posed by data scarcity in certain treatment levels. The theoretical foundation of MSNN retains the finite-sample error bounds and asymptotic normality guarantees of SNN while significantly improving sample efficiency, especially in sparse treatment scenarios. Empirical evaluations demonstrate that MSNN can reliably estimate causal effects in data-scarce environments, outperforming traditional methods like SNN. This work contributes to the field of causal inference by formalizing the problem of entry-wise causal matrix completion under multiple MNAR treatment levels and providing a robust solution that leverages shared latent structures across treatments.
Methodology
The authors propose the MSNN algorithm, which utilizes Mixed Anchor Rows (MAR) and Mixed Anchor Columns (MAC) to estimate imputation coefficients from a combination of data across treatment levels. This approach is grounded in the assumption of shared latent row factors, allowing for effective cross-treatment identifiability. The methodology addresses the challenges of heterogeneous scales and variances by employing appropriate weighting techniques.
Results
The MSNN algorithm demonstrates an exponential improvement in sample efficiency compared to the SNN method, particularly in scenarios with sparse treatment data. The expected number of usable data subgroups for MSNN significantly exceeds that of SNN, enhancing the feasibility of causal estimation. Empirical evaluations, including simulations and a case study on California's tobacco control policy, confirm MSNN's reliability in estimating effects for data-scarce treatments.
Implications
The findings suggest that MSNN can be a valuable tool for researchers and practitioners in fields requiring causal inference from observational data, such as economics and public policy. By effectively leveraging data across treatment levels, MSNN can improve decision-making processes where data scarcity is a significant barrier.
Entropy-Preserving Reinforcement Learning
- Entropy collapse in policy gradient algorithms can hinder exploration and lead to suboptimal policies.
- Active monitoring and control of entropy during training is essential for maintaining diversity in learned trajectories.
- The paper introduces REPO and ADAPO as mechanisms for effective entropy regulation.
- Maintaining a steady entropy trajectory correlates with improved performance in language model reasoning tasks.
Read more
Entropy-Preserving Reinforcement Learning
Summary
This paper addresses the issue of entropy collapse in policy gradient reinforcement learning (RL) algorithms, which can limit the diversity of explored trajectories and lead to premature convergence to suboptimal solutions. The authors argue for the active monitoring and control of entropy during training to maintain exploration capabilities. They analyze how various policy gradient objectives affect entropy dynamics and identify empirical factors that influence entropy behavior. The paper introduces two main mechanisms for entropy control: REPO, which modifies the advantage function to regulate entropy, and ADAPO, an adaptive asymmetric clipping approach. These methods help maintain diversity in the learned policies, resulting in improved performance and better adaptability for sequential learning in new environments. The authors provide theoretical insights into the relationship between entropy and performance, emphasizing that the trajectory of entropy throughout training is a critical factor for achieving optimal results.
Methodology
The authors conducted a formal analysis of leading policy gradient objectives to understand their impact on entropy dynamics. They identified critical factors affecting entropy behavior, such as numerical precision and implementation details. The proposed methods, REPO and ADAPO, were developed to actively control entropy levels during training, using adaptive controllers to ensure target entropy is maintained.
Results
The proposed entropy-preserving methods, REPO and ADAPO, demonstrated significant improvements in performance across various benchmarks, achieving state-of-the-art results on the AppWorld dataset with 79% Test Normal and 71% Test Challenge accuracy. The analysis showed that models trained with these methods maintained higher entropy throughout training, leading to better generalization and performance in sequential learning tasks.
Implications
The findings suggest that actively managing entropy in reinforcement learning can enhance the performance of language models and other RL applications. This approach could be beneficial in scenarios requiring robust exploration and adaptability, such as robotics, game playing, and interactive AI systems.
Matching Features, Not Tokens: Energy-Based Fine-Tuning of Language Models
NLP
Large Language Models
Optimization
- Introduces a feature-matching loss for fine-tuning language models that targets sequence-level statistics.
- Proposes Energy-Based Fine-Tuning (EBFT) as a practical method to optimize the feature-matching loss.
- EBFT outperforms traditional supervised fine-tuning (SFT) and matches reinforcement learning with verifiable rewards (RLVR) in downstream tasks.
- Demonstrates lower validation cross-entropy compared to both SFT and RLVR.
Read more
Matching Features, Not Tokens: Energy-Based Fine-Tuning of Language Models
Summary
This paper addresses the limitations of traditional cross-entropy (CE) training in language models, which focuses on next-token prediction and can lead to a distribution shift during deployment. The authors propose a novel feature-matching objective for fine-tuning language models that emphasizes sequence-level statistics of the completion distribution, providing dense semantic feedback without the need for task-specific verifiers. They introduce Energy-Based Fine-Tuning (EBFT), a method that employs strided block-parallel sampling to generate multiple rollouts from nested prefixes concurrently, allowing for efficient feature extraction and on-policy policy-gradient updates. The theoretical framework connects EBFT to KL-regularized feature-matching and energy-based modeling. Empirical results demonstrate that EBFT matches the performance of reinforcement learning with verifiable rewards (RLVR) while outperforming supervised fine-tuning (SFT) in downstream accuracy and achieving lower validation cross-entropy across various tasks, including Q&A coding, unstructured coding, and translation.
Methodology
The authors developed Energy-Based Fine-Tuning (EBFT) which optimizes a feature-matching loss by generating multiple rollouts from nested prefixes using strided block-parallel sampling. This method allows for efficient feature extraction and employs a REINFORCE-style gradient estimator for on-policy updates, targeting the sequence-level statistics of the rollout distribution.
Results
EBFT achieved lower feature-matching loss across various completion lengths compared to SFT and RLVR, indicating better calibration of the model's rollout distribution. It also demonstrated superior downstream accuracy in tasks such as Q&A coding, unstructured coding, and translation, while maintaining a lower validation cross-entropy.
Implications
The proposed EBFT method could enhance the performance of language models in open-ended tasks where traditional training methods struggle with distribution shifts. This approach may lead to more robust and reliable language generation systems that better align with real-world applications.
Client-Conditional Federated Learning via Local Training Data Statistics
Federated Learning
- Proposes a method that conditions a global model on PCA statistics of local training data.
- Achieves performance comparable to an Oracle baseline across various heterogeneity types.
- Demonstrates unique robustness to data sparsity, maintaining accuracy with reduced client data.
- Avoids the need for additional communication and complex client clustering.
Read more
Client-Conditional Federated Learning via Local Training Data Statistics
Summary
This paper addresses the challenges of federated learning (FL) in the presence of data heterogeneity, which can severely impact model performance. Traditional methods like FedAvg fail to account for client differences, leading to significant accuracy drops as client clusters increase. The author proposes a novel approach that conditions a single global model on locally computed PCA statistics of each client's training data, thereby avoiding the need for additional communication or complex cluster discovery. The method is evaluated across 97 configurations involving four types of heterogeneity (label shift, covariate shift, concept shift, and combined), using four datasets (MNIST, Fashion-MNIST, CIFAR-10, CIFAR-100) and seven baseline methods. The results demonstrate that the proposed method matches or exceeds the performance of an Oracle baseline that knows true cluster assignments, particularly excelling in scenarios with combined heterogeneity. Additionally, it shows remarkable robustness to data sparsity, maintaining accuracy even as client data decreases significantly.
Methodology
The proposed method involves each client computing PCA eigenvalues from their local training data, which serve as a low-dimensional representation of the data distribution. This PCA vector is concatenated with the model's input before the fully-connected layers, allowing the global model to adapt its predictions based on the local data characteristics without maintaining separate models or requiring cluster discovery.
Results
The proposed method consistently matches or exceeds the Oracle baseline across all experimental configurations, achieving a 1-6% improvement in scenarios with combined heterogeneity. It also demonstrates exceptional performance in sparsity scenarios, maintaining near-constant accuracy as client data decreases from 6,000 to 200 samples, while other methods experience significant degradation.
Implications
This approach has the potential to enhance federated learning applications in environments with diverse and sparse data distributions, making it suitable for real-world scenarios where data privacy and communication efficiency are critical. It could be particularly beneficial in sectors like healthcare, finance, and IoT, where data heterogeneity is common.
Frequentist Consistency of Prior-Data Fitted Networks for Causal Inference
- PFNs can exhibit prior-induced confounding bias, affecting their frequentist consistency.
- A one-step posterior correction (OSPC) is proposed to recalibrate PFNs and restore consistency.
- The OSPC leads to a semi-parametric Bernstein-von Mises theorem for calibrated PFNs.
- Martingale posteriors are utilized to implement the OSPC effectively.
Read more
Frequentist Consistency of Prior-Data Fitted Networks for Causal Inference
Summary
This paper investigates the frequentist consistency of prior-data fitted networks (PFNs) in estimating the average treatment effect (ATE) for causal inference. While PFNs have demonstrated strong empirical performance, their reliability in providing uncertainty quantification consistent with classical frequentist estimators is questioned. The authors identify that existing PFNs can suffer from prior-induced confounding bias, where the prior does not asymptotically yield to data, hindering frequentist consistency. To address this issue, they propose a one-step posterior correction (OSPC) calibration procedure that restores frequentist consistency and establishes a semi-parametric Bernstein-von Mises theorem for calibrated PFNs. The OSPC is implemented using martingale posteriors, allowing for the recovery of functional nuisance posteriors necessary for calibration. Through multiple experiments, the calibrated PFNs demonstrate ATE uncertainty that aligns asymptotically with frequentist uncertainty and remains well-calibrated in finite samples compared to other Bayesian ATE estimators.
Methodology
The authors analyze the frequentist consistency of PFNs by first identifying the prior-induced confounding bias in existing PFNs. They then propose a calibration procedure (OSPC) based on efficient influence functions to restore consistency. The implementation of OSPC is achieved through martingale posteriors, which allow for the recovery of necessary posteriors for nuisance functions. The methodology is validated through multiple semi-synthetic experiments comparing the performance of calibrated PFNs with classical frequentist estimators.
Results
The study finds that existing PFNs, when used as Bayesian ATE estimators, are prone to prior-induced confounding bias, which prevents frequentist consistency. The proposed OSPC successfully restores this consistency, yielding ATE posteriors that asymptotically match the normal distribution of frequentist estimators. The experiments demonstrate that the uncertainty estimates from OSPC-calibrated PFNs are well-calibrated in finite samples and align with frequentist uncertainty as sample sizes increase.
Implications
The findings suggest that PFNs can be effectively utilized for causal inference in various fields such as marketing, public policy, and medicine, provided they are calibrated appropriately. The proposed OSPC method enhances the reliability of uncertainty quantification in causal estimates, making PFNs a more robust tool for decision-making based on observational data.
Separable neural architectures as a primitive for unified predictive and generative intelligence
- Introduction of Separable Neural Architectures (SNA) as a new neural primitive.
- SNAs exploit latent factorisable structures in various domains, enhancing predictive and generative capabilities.
- Demonstrated effectiveness across multiple applications, including reinforcement learning and turbulent flow modeling.
- Establishes a structural analogy between chaotic dynamics and linguistic autoregression.
Read more
Separable neural architectures as a primitive for unified predictive and generative intelligence
Summary
This paper introduces the concept of Separable Neural Architectures (SNA), a novel framework that formalizes a representational class capable of unifying predictive and generative intelligence across various domains such as physics, language, and perception. Traditional monolithic neural architectures often overlook the latent factorisable structures inherent in these systems. SNAs address this by employing a rank-and interaction-controlled operator that allows for the construction of high-dimensional mappings from low-arity components, thereby imposing a structural inductive bias. This approach not only facilitates the modeling of chaotic spatiotemporal dynamics but also reveals a structural analogy between such dynamics and linguistic autoregression. The authors demonstrate the versatility of SNAs through applications in four distinct domains: reinforcement learning for autonomous waypoint navigation, inverse generation of multifunctional microstructures, distributional modeling of turbulent flow, and neural language modeling. The results indicate that SNAs can effectively unify deterministic and distributional representations, providing a domain-agnostic primitive for intelligent systems.
Methodology
The paper presents a formal framework for SNAs that combines low-arity learnable components governed by an interaction tensor. The architecture allows for control over interaction order and tensor rank, enabling the modeling of high-dimensional mappings. The authors validate the approach through various applications, including KHRONOS, a standalone model that integrates predictive and generative intelligence, and Variational Separable Neural Architectures (VSNAs) for operator-driven learning.
Results
The authors demonstrate that SNAs can effectively model complex systems with fewer parameters while maintaining high accuracy in predictions and generative tasks. KHRONOS, as an example of SNA, shows the ability to perform real-time predictions and generative inversions, while VSNAs successfully learn high-dimensional fields from governing operators. The results across the four domains highlight the versatility and effectiveness of SNAs in unifying predictive and generative modeling.
Implications
The introduction of SNAs has significant implications for the development of intelligent systems, allowing for more efficient modeling of complex phenomena across various fields. This framework can enhance the performance of applications in autonomous navigation, material design, and natural language processing, potentially leading to advancements in AI that leverage the underlying structures of the systems being modeled.
H2LooP Spark Preview: Continual Pretraining of Large Language Models for Low-Level Embedded Systems Code
Large Language Models
Generative Models
NLP
- Development of a specialized training corpus for embedded systems code using repository-datasheet pairs.
- Successful application of continual pretraining to adapt a large language model for a niche domain.
- Significant performance improvements in perplexity and code generation accuracy compared to existing models.
- Open-source release of the model checkpoint to facilitate further research in embedded systems LLMs.
Read more
H2LooP Spark Preview: Continual Pretraining of Large Language Models for Low-Level Embedded Systems Code
Summary
The paper presents H2LooP Spark Preview, a continual pretraining (CPT) pipeline designed to adapt the OLMo-3-7B language model for low-level embedded systems programming. The authors highlight the limitations of existing large language models (LLMs) in generating code for specialized domains like embedded systems, which involve unique hardware interactions and vendor-specific APIs. To address this, they constructed a large-scale training corpus from 818 repository-datasheet pairs, totaling 76.4 GB of data across 117 manufacturers. The training utilized BF16 LoRA with Rank-Stabilized scaling on NVIDIA H100 GPUs, with extensive hyperparameter tuning across 1,400 runs. The results showed significant reductions in perplexity and improved generative code completion accuracy, with the model outperforming larger competitors in several categories. The authors also released a production checkpoint for community research, emphasizing the potential of CPT for enhancing LLMs in specialized technical tasks.
Methodology
The methodology involved constructing a large-scale training corpus from embedded systems data, utilizing a continual pretraining approach with BF16 LoRA and Rank-Stabilized scaling. Hyperparameter optimization was performed through Bayesian exploration and grid searches across multiple projects, focusing on LoRA rank, learning rates, and data mixtures.
Results
The H2LooP Spark Preview model achieved a 70.4% reduction in in-domain perplexity and a 66.1% reduction on held-out repositories. It surpassed larger models like Claude Opus 4.6 and Qwen3-Coder-30B in generative code completion benchmarks across 8 out of 13 embedded categories, demonstrating the effectiveness of targeted continual pretraining.
Implications
The findings suggest that continual pretraining can significantly enhance the capabilities of LLMs in specialized domains, making them more applicable for tasks in embedded systems programming. This approach could lead to improved automation and efficiency in software development for hardware-specific applications.
On the Role of Reversible Instance Normalization
Time Series
- Identifies three key challenges in normalization for time series forecasting: temporal, spatial, and conditional distribution shifts.
- Conducts ablation studies on RevIN, revealing redundancies and detrimental components.
- Challenges the effectiveness of RevIN in mitigating distribution shifts.
- Proposes improvements for RevIN to enhance robustness and generalization in forecasting models.
Read more
On the Role of Reversible Instance Normalization
Summary
This paper investigates the role of Reversible Instance Normalization (RevIN) in time series forecasting, addressing the inadequacies of standard normalization techniques in handling distribution shifts. The authors identify three primary challenges: temporal input distribution shift, spatial input distribution shift, and conditional output distribution shift. Through extensive ablation studies, they analyze the components of RevIN, revealing that some are redundant or detrimental to its performance. The findings suggest that while RevIN aims to mitigate distribution shifts, its effectiveness can be improved by refining its components. The paper contributes to a better understanding of normalization strategies in time series forecasting and proposes new perspectives for enhancing the robustness and generalization of forecasting models.
Methodology
The authors conducted ablation studies on the RevIN method using standard forecasting benchmarks to evaluate the necessity of its components. They analyzed the effects of instance normalization on data distributions and compared RevIN's performance against traditional normalization techniques.
Results
The ablation studies indicated that several components of RevIN do not contribute positively to its performance and may even hinder it. The authors provided evidence that standard normalization methods fail to address the identified distribution shifts effectively, highlighting the limitations of RevIN in its current form.
Implications
The findings suggest that improvements to normalization techniques like RevIN could lead to better performance in time series forecasting models. This has implications for various applications, including energy consumption forecasting, traffic prediction, and other domains where time series data is prevalent.
Harnessing Data Asymmetry: Manifold Learning in the Finsler World
- Introduction of Finsler geometry to capture asymmetric dissimilarities in manifold learning.
- Development of a Finsler manifold learning pipeline that enhances existing asymmetric embedding techniques.
- Experimental validation showing superior performance of Finsler embeddings over traditional Euclidean methods.
- Revelation of hidden structures and information in data that symmetric methods fail to capture.
Read more
Harnessing Data Asymmetry: Manifold Learning in the Finsler World
Summary
This paper addresses the limitations of traditional manifold learning methods that rely on symmetric Riemannian geometry, which often overlook valuable asymmetric information inherent in data. The authors propose a novel approach using Finsler geometry, an asymmetric generalization of Riemannian geometry, to construct asymmetric dissimilarities and perform embeddings in a Finsler space. This new framework broadens the applicability of existing asymmetric embedding techniques, such as Finsler t-SNE and Finsler Umap, to a wider range of data types. The authors demonstrate that their Finsler manifold learning pipeline not only captures the underlying structure of high-dimensional data more effectively but also reveals additional information lost in traditional symmetric approaches. Through experiments on synthetic and real datasets, the proposed methods consistently outperform Euclidean counterparts, showcasing the advantages of embracing data asymmetry in manifold learning.
Methodology
The authors propose a three-stage pipeline for Finsler manifold learning: (1) Data construction to compute asymmetric dissimilarities, (2) Embedding definition to establish a Finsler space for embedding, and (3) Optimization to fit the embedding to the asymmetric dissimilarities. They modernize existing asymmetric embedding techniques, specifically adapting t-SNE and Umap to work within the Finsler framework.
Results
The results indicate that the Finsler manifold learning pipeline reveals valuable information, such as density hierarchies, that traditional symmetric methods overlook. In extensive classification benchmarks, the proposed methods consistently outperform Euclidean baselines, demonstrating the superior quality of the embeddings produced by the Finsler approach.
Implications
The findings suggest that embracing data asymmetry can significantly enhance the performance of manifold learning techniques, leading to better data analysis and visualization outcomes. This approach opens new avenues for applying asymmetric methods to a broader range of datasets, potentially improving insights in fields such as computer vision, bioinformatics, and social network analysis.
The Latent Color Subspace: Emergent Order in High-Dimensional Chaos
- Introduction of the Latent Color Subspace (LCS) in the VAE latent space of the FLUX model.
- Demonstration that color can be represented in a three-dimensional subspace closely resembling the HSL color model.
- Development of a training-free method for color intervention based on the LCS interpretation.
- Validation of the LCS through mid-generation color observation and targeted interventions.
Read more
The Latent Color Subspace: Emergent Order in High-Dimensional Chaos
Summary
This paper addresses the challenge of achieving fine-grained control over generated images in text-to-image (T2I) generation models, specifically focusing on the color representation within the Variational Autoencoder (VAE) latent space of the FLUX model. The authors introduce the concept of the Latent Color Subspace (LCS), which reveals a three-dimensional structure in the VAE latent space that corresponds to the Hue, Saturation, and Lightness (HSL) color model. By interpreting color in this way, the authors demonstrate that it is possible to predict and control color in generated images without additional training. They validate their approach by showing that the LCS can be used for targeted interventions in the image generation process, allowing for fine control over the colors of specific objects. This work not only enhances the understanding of color encoding in T2I models but also provides a training-free method for localized color intervention, simplifying the process of image generation and editing.
Methodology
The authors analyze the VAE latent space of the FLUX model to identify a three-dimensional subspace that represents color in terms of Hue, Saturation, and Lightness (HSL). They utilize a mechanistic understanding of how image patches evolve during the generation process to construct a functional interpretation of color. This interpretation allows for lightweight transformations in the latent space, enabling direct observation and intervention without the need for the VAE decoder.
Results
The study successfully identifies a clear structure for color representation in the VAE latent space, confirming that color can be manipulated effectively using the LCS. The authors demonstrate that their method allows for precise control over the colors of specific objects in generated images, validating the accuracy and causal nature of their interpretation.
Implications
The findings have significant implications for the field of image generation, particularly in enhancing the controllability and interpretability of T2I models. The training-free approach to color manipulation can simplify workflows in applications requiring precise image editing, such as graphic design, gaming, and virtual reality. Additionally, this work contributes to a deeper understanding of the internal mechanisms of generative models, fostering trust and reliability in their outputs.
Deep Learning Network-Temporal Models For Traffic Prediction
Time Series
Graph Learning
Large Language Models
- Introduction of two deep learning models for multivariate time series prediction in network traffic.
- The GAT model captures both temporal and topological correlations, while the LLM model excels in generalization.
- Extensive performance evaluations demonstrate the superiority of the LLM model over traditional methods.
- Insights into correlation variability and prediction distribution discrepancies are provided.
Read more
Deep Learning Network-Temporal Models For Traffic Prediction
Summary
This paper addresses the limitations of existing statistical and shallow machine learning models in predicting multivariate time series data, particularly in the context of network traffic. The authors propose two novel deep learning architectures: a customized network-temporal graph attention network (GAT) and a fine-tuned multi-modal large language model (LLM) enhanced with a clustering approach. These models are designed to capture both temporal patterns and topological correlations in network data. The study compares the performance of these models against a Long Short-Term Memory (LSTM) model, which has previously outperformed statistical methods. Through extensive training on a real-world network dataset, the LLM-based model exhibits superior prediction and generalization capabilities, while the GAT model effectively reduces prediction variance across different time series and forecasting horizons. The paper also provides insights into correlation variability and prediction distribution discrepancies, emphasizing the importance of model structure and hyperparameter optimization in achieving reliable performance in network traffic prediction.
Methodology
The authors developed two deep learning models: a customized spatial-temporal graph attention network (ST-GAT) to capture temporal and topological correlations, and a fine-tuned multi-modal large language model (LLM) with a clustering pre-training step. They conducted comprehensive performance evaluations on a real-world network traffic dataset, optimizing model structures and hyperparameters.
Results
The LLM-based model demonstrated superior overall prediction and generalization performance compared to the LSTM model, while the GAT model effectively reduced prediction variance across time series and different forecasting horizons. Detailed analyses revealed important insights into correlation variability and prediction distribution discrepancies.
Implications
The findings suggest that deep learning models, particularly LLMs and GATs, can significantly enhance the accuracy and reliability of network traffic predictions, which is crucial for effective network management and control. This research could inform future developments in AI-driven network analytics and traffic engineering.
Hindsight-Anchored Policy Optimization: Turning Failure into Feedback in Sparse Reward Settings
Reinforcement Learning
Large Language Models
Optimization
- HAPO addresses the dilemma of sparse rewards in reinforcement learning by integrating hindsight mechanisms.
- The Synthetic Success Injection (SSI) operator allows for dynamic anchoring to teacher demonstrations during failures.
- A Thompson sampling-inspired gating mechanism governs the intervention, enabling a self-paced learning curriculum.
- HAPO demonstrates asymptotic consistency, recovering unbiased gradients as the policy improves.
Read more
Hindsight-Anchored Policy Optimization: Turning Failure into Feedback in Sparse Reward Settings
Summary
This paper introduces Hindsight-Anchored Policy Optimization (HAPO), a novel framework designed to enhance reinforcement learning (RL) in sparse reward environments by integrating reinforcement learning with supervised fine-tuning (SFT). The authors identify the challenges faced by existing group-based methods like Group Relative Policy Optimization (GRPO), which struggle with high variance and distributional bias in sparse-reward settings. HAPO employs a Synthetic Success Injection (SSI) operator that selectively anchors optimization to teacher demonstrations during failure modes, guided by a Thompson sampling-inspired gating mechanism. This dynamic approach allows the model to adaptively switch between leveraging SFT guidance and pure RL exploration based on its confidence level. The theoretical foundation of HAPO ensures asymptotic consistency, allowing the model to recover unbiased on-policy gradients as it improves. Preliminary evaluations demonstrate that HAPO achieves competitive performance on mathematical reasoning benchmarks, outperforming static mixed-policy methods significantly. The paper highlights the importance of adaptive integration of RL and SFT to mitigate issues like catastrophic forgetting and distribution drift, ultimately enhancing the reasoning capabilities of large language models.
Methodology
The methodology involves the introduction of HAPO, which utilizes the SSI operator to dynamically anchor optimization to teacher demonstrations during low-confidence scenarios. The gating mechanism inspired by Thompson sampling allows the model to adaptively determine when to leverage SFT versus RL exploration, addressing distribution drift and catastrophic forgetting.
Results
Preliminary evaluations indicate that HAPO achieves competitive performance on mathematical reasoning benchmarks, matching the performance of LUFFY on AIME2024 and significantly outperforming it on MATH-500 by 2.4 points.
Implications
The findings suggest that HAPO could enhance the training of large language models in environments with sparse rewards, potentially improving their reasoning capabilities and adaptability. This approach may lead to more robust models capable of better generalization beyond teacher distributions.
Teleodynamic Learning a new Paradigm For Interpretable AI
- Teleodynamic Learning shifts the focus from static optimization to dynamic co-evolution of structure, parameters, and resources.
- The approach models learning as navigation in a constrained dynamical system with coupled inner and outer dynamics.
- Three central phenomena emerge: stabilization without external criteria, phase-structured behavior, and geometry-based convergence guarantees.
- The Distinction Engine (DE11) demonstrates the effectiveness of this paradigm, achieving high accuracy on benchmark datasets.
Read more
Teleodynamic Learning a new Paradigm For Interpretable AI
Summary
This paper introduces Teleodynamic Learning, a novel paradigm in machine learning that shifts the focus from static objective minimization to the emergence and stabilization of functional organization under constraints. The authors argue that traditional optimization frameworks fail to capture the dynamic interplay between structure, parameters, and resources in adaptive systems. Instead, they propose modeling learning as navigation within a constrained dynamical system characterized by two coupled timescales: inner dynamics (continuous parametric adaptation) and outer dynamics (discrete structural modification). This approach leads to three significant phenomena: emergent stabilization without external stopping criteria, phase-structured behavior identifiable through dynamical signatures, and convergence guarantees based on the geometry of the parameter manifold. The Distinction Engine (DE11) is presented as an instantiation of this paradigm, achieving competitive accuracy on standard benchmarks while generating interpretable logical rules that arise from the system's dynamics. Overall, Teleodynamic Learning offers a unified framework for understanding learning as a co-evolution of structure, parameters, and resources, paving the way for more adaptive and interpretable AI systems.
Methodology
The authors model learning as navigation in a constrained dynamical system with two coupled timescales, utilizing concepts from information geometry and tropical optimization. The Distinction Engine (DE11) is developed as a teleodynamic learner that incorporates these principles.
Results
DE11 achieves 93.3% test accuracy on the IRIS dataset, 92.6% on the WINE dataset, and 94.7% on the Breast Cancer dataset, outperforming traditional logistic regression. The system generates interpretable logical rules that emerge from its internal dynamics.
Implications
Teleodynamic Learning has the potential to revolutionize the field of AI by providing a framework that better aligns with biological learning processes, enhancing the interpretability and adaptability of AI systems. This paradigm could lead to more robust and efficient learning algorithms that are capable of evolving their structures and parameters in real-time.
Effective Resistance Rewiring: A Simple Topological Correction for Over-Squashing
Graph Learning
- Introduces Effective Resistance Rewiring (ERR) to address over-squashing in GNNs.
- Utilizes effective resistance as a global measure to identify structural bottlenecks.
- Demonstrates a trade-off between over-squashing and oversmoothing in GNNs.
- Combines ERR with normalization techniques to improve model performance.
Read more
Effective Resistance Rewiring: A Simple Topological Correction for Over-Squashing
Summary
The paper addresses the challenge of over-squashing in Graph Neural Networks (GNNs), where information from an expanding neighborhood must pass through limited structural bottlenecks, hindering long-range dependencies. The authors propose Effective Resistance Rewiring (ERR), a novel topology correction strategy that utilizes effective resistance as a global signal to identify and mitigate these bottlenecks. ERR iteratively rewires the graph by adding edges between node pairs with the highest resistance while removing those with minimal resistance, thus enhancing weak communication pathways while adhering to a fixed edge budget. The methodology is parameter-free apart from the rewiring budget and relies on a single global measure that aggregates all paths between node pairs. The authors evaluate the predictive performance of ERR on Graph Convolutional Networks (GCNs) and analyze its impact on message propagation by examining cosine similarity between node embeddings across layers. Experiments conducted on both homophilic (Cora, CiteSeer) and heterophilic (Cornell, Texas) graphs, including directed settings with DirGCN, reveal a trade-off between over-squashing and oversmoothing, where increased connectivity can lead to accelerated representation mixing in deeper models. The combination of ERR with normalization techniques, such as PairNorm, stabilizes this trade-off and enhances performance, particularly in heterophilic contexts.
Methodology
The authors developed Effective Resistance Rewiring (ERR), which rewires the graph by adding edges between node pairs with the highest effective resistance and removing edges with minimal resistance. This approach is parameter-free except for the rewiring budget and relies on a global measure of connectivity to enhance information flow in GNNs. The impact of rewiring on message propagation is analyzed through cosine similarity of node embeddings across layers.
Results
The experiments showed that ERR significantly improves connectivity and signal propagation in GNNs, particularly in heterophilic settings. However, it also revealed that while ERR enhances long-range communication, it can lead to accelerated representation mixing in deeper models. The combination of ERR with normalization techniques like PairNorm effectively stabilizes the trade-off between over-squashing and oversmoothing, leading to improved predictive performance.
Implications
The findings suggest that ERR can be a powerful tool for enhancing GNN performance, particularly in scenarios where long-range dependencies are crucial. The methodology could be applied to various domains involving graph data, such as social networks, biological networks, and recommendation systems, where effective information propagation is essential.
Temporal Straightening for Latent Planning
Robotics
Optimization
Theory
- Introduces temporal straightening to improve representation learning for latent planning.
- Utilizes a curvature regularizer to create straighter latent trajectories.
- Demonstrates improved alignment between Euclidean and geodesic distances in latent space.
- Achieves significant performance gains in goal-reaching tasks with gradient-based planning.
Read more
Temporal Straightening for Latent Planning
Summary
This paper addresses the challenge of representation learning for latent planning in world models, where pretrained visual encoders often contain irrelevant information that complicates planning tasks. The authors introduce a novel approach called 'temporal straightening,' inspired by the perceptual straightening hypothesis in human vision. This method employs a curvature regularizer to encourage locally straightened latent trajectories during the training of an encoder and a predictor. By reducing the curvature of these trajectories, the authors demonstrate that the Euclidean distance in latent space becomes a more accurate representation of geodesic distance, thereby improving the conditioning of the planning objective. Empirical results show that temporal straightening enhances the stability of gradient-based planning and significantly increases success rates in various goal-reaching tasks, with improvements of 20-60% in open-loop planning and 20-30% in model predictive control (MPC). This work highlights the importance of tailored representations for effective planning in latent spaces, paving the way for more efficient gradient-based optimization methods.
Methodology
The authors jointly learn an encoder and a predictor while imposing a curvature regularization on the latent trajectories during training. This approach encourages the formation of straighter trajectories, which enhances the predictive capabilities of the model and aligns the Euclidean distance more closely with the geodesic distance.
Results
The implementation of temporal straightening resulted in a 20-60% increase in success rates for open-loop planning and a 20-30% improvement for model predictive control (MPC) tasks, demonstrating the effectiveness of the proposed method in enhancing planning stability and efficiency.
Implications
The findings suggest that optimizing representations for latent planning can lead to more efficient and reliable planning methods, potentially reducing the reliance on computationally expensive search-based techniques. This could have applications in robotics, autonomous systems, and any domain where effective planning in high-dimensional spaces is crucial.
On the Robustness of Langevin Dynamics to Score Function Error
Generative Models
Theory
- Langevin dynamics is not robust to L2 (or Lp) errors in score function estimates, unlike diffusion models.
- Even small L2 errors can lead to significant deviations from the target distribution in Langevin dynamics.
- The results caution against the use of Langevin dynamics with estimated scores in high-dimensional generative modeling.
- The paper provides a formal proof of the limitations of Langevin dynamics regarding score estimation errors.
Read more
On the Robustness of Langevin Dynamics to Score Function Error
Summary
This paper investigates the robustness of Langevin dynamics, a popular sampling algorithm, to errors in the estimated score function, particularly focusing on L2 and more generally Lp errors. The authors demonstrate that unlike diffusion models, which can still sample accurately from the target distribution with small L2 errors, Langevin dynamics fails to produce samples close to the target distribution even with arbitrarily small L2 errors in high-dimensional spaces. This finding raises concerns about the practical application of Langevin dynamics when score functions are estimated from data, emphasizing the need for caution in using this method in generative modeling. The paper formalizes the limitations of Langevin dynamics in the context of score estimation errors, contrasting it with the established robustness of diffusion models under similar conditions.
Methodology
The authors analyze the performance of Langevin dynamics in high-dimensional settings by examining the impact of L2 and Lp errors in score function estimates. They provide theoretical results that quantify the relationship between score estimation errors and the Total Variation distance from the target distribution, contrasting these findings with existing results for diffusion models.
Results
The main result shows that Langevin dynamics, when run for any polynomial time horizon, produces a distribution that is far from the target distribution in Total Variation distance, even with small L2 errors in the score function estimate. This is in stark contrast to diffusion models, which can still sample accurately under similar conditions.
Implications
The findings suggest that practitioners should be cautious when using Langevin dynamics for generative modeling, particularly in high-dimensional spaces where score estimation errors are inevitable. The results advocate for the preference of diffusion models over Langevin dynamics in scenarios where score functions are learned from data.
Taming the Adversary: Stable Minimax Deep Deterministic Policy Gradient via Fractional Objectives
Reinforcement Learning
Robotics
Optimization
- Introduction of MMDDPG framework for robust policy learning in RL.
- Formulation of training as a minimax optimization problem between user and adversary.
- Use of a fractional objective to balance performance and disturbance magnitude.
- Demonstrated improved robustness in MuJoCo environments against disturbances.
Read more
Taming the Adversary: Stable Minimax Deep Deterministic Policy Gradient via Fractional Objectives
Summary
This paper addresses the challenge of ensuring robust performance in reinforcement learning (RL) agents when faced with external disturbances and model uncertainties. The authors propose a novel framework called minimax deep deterministic policy gradient (MMDDPG) that formulates the training process as a minimax optimization problem between a user policy and an adversarial disturbance policy. The key innovation is the introduction of a fractional objective that balances task performance with disturbance magnitude, preventing excessively aggressive disturbances from destabilizing the learning process. Experimental evaluations in MuJoCo environments demonstrate that MMDDPG significantly enhances robustness against external force perturbations and parametric variations compared to conventional RL methods. The findings indicate that the proposed approach effectively stabilizes the interaction between the user and adversary, leading to improved performance in continuous control tasks.
Methodology
The authors develop the MMDDPG framework, which formulates the RL training process as a two-player zero-sum game. The user policy aims to minimize an objective function while the adversary seeks to maximize it. A fractional objective is introduced to stabilize the training by limiting the magnitude of disturbances, allowing for effective learning of robust policies in continuous control tasks.
Results
Experimental results show that MMDDPG achieves significantly improved robustness against external force perturbations and resilience to parametric mismatches. The proposed method outperforms conventional RL baselines in terms of stability and performance in continuous control tasks within the MuJoCo environments.
Implications
The findings suggest that MMDDPG can be applied to real-world scenarios where RL agents must operate under uncertain conditions, such as robotics and autonomous systems. The framework's ability to enhance robustness could lead to safer and more reliable deployment of RL in critical applications.
Slow-Fast Inference: Training-Free Inference Acceleration via Within-Sentence Support Stability
NLP
Large Language Models
Efficient ML
- Identification of within-sentence support stability, where attention support remains stable over short coherent spans.
- Introduction of Slow-Fast Inference (SFI), a training-free framework that alternates between fast and slow decoding steps.
- Development of a training-free Selector that converts dense-attention evidence into reusable memory.
- Significant improvements in decoding throughput without sacrificing quality, achieving 1.6Ă— to 14.4Ă— speedup.
Read more
Slow-Fast Inference: Training-Free Inference Acceleration via Within-Sentence Support Stability
Summary
The paper addresses the computational challenges of long-context autoregressive decoding in large language models (LLMs), which typically require extensive processing for each decoding step due to the growing history of tokens. The authors introduce a novel framework called Slow-Fast Inference (SFI), which leverages the observation that attention support within short semantically coherent spans remains stable. SFI operates by alternating between frequent low-cost fast steps that utilize a compact sparse memory and occasional dense slow steps that refresh this memory at semantic boundaries. This approach allows for significant reductions in computational overhead while maintaining output quality comparable to traditional methods. The framework is training-free, meaning it can be applied directly to existing model checkpoints without retraining. The authors demonstrate that SFI achieves a decoding throughput increase of approximately 1.6Ă— to 14.4Ă— across various context lengths, making it a practical solution for enhancing inference efficiency in long-context tasks.
Methodology
The SFI framework decouples the decoding process into frequent low-cost fast steps and occasional dense slow steps. Fast steps utilize a compact cache of selected tokens, while slow steps involve full attention over the broader context. A Selector is employed during slow steps to create reusable memory from dense-attention evidence using a KL-based fusion objective. The system also incorporates latency-hiding and memory-coalesced designs to optimize performance.
Results
SFI achieves a decoding throughput improvement of 1.6Ă— to 14.4Ă— across various context lengths while maintaining quality on par with the full-KV baseline. The method demonstrates effectiveness in long-context understanding and long-chain-of-thought reasoning tasks without requiring retraining.
Implications
The SFI framework offers a practical approach to enhance the efficiency of autoregressive models in long-context applications, making it suitable for real-time processing in scenarios such as multi-agent systems and extensive reasoning tasks. This could lead to broader adoption of LLMs in resource-constrained environments.
Automatic Generation of High-Performance RL Environments
Reinforcement Learning
Efficient ML
Robotics
- High-performance RL environments can be generated automatically and cheaply, reducing the engineering burden.
- The proposed methodology includes hierarchical verification to ensure semantic equivalence across environments.
- Significant speedups were achieved in various environments, with performance improvements ranging from 1.5Ă— to 42Ă—.
- The approach allows for the creation of entirely new environments, such as TCGJax, which did not exist prior to this work.
Read more
Automatic Generation of High-Performance RL Environments
Summary
This paper addresses the challenge of translating complex reinforcement learning (RL) environments into high-performance implementations, which has traditionally required extensive engineering efforts. The authors propose a reusable methodology that combines a generic prompt template, hierarchical verification, and iterative agent-assisted repair to produce semantically equivalent high-performance environments at a low cost (less than $10 in compute). They demonstrate this approach through three distinct workflows across five environments, including EmuRust, a Game Boy emulator achieving a 1.5Ă— speedup, and PokeJAX, a GPU-parallel Pokemon battle simulator with a staggering 22,320Ă— speedup over its TypeScript reference. The paper also introduces TCGJax, a new deployable JAX Pokemon trading card game engine, synthesized from a web-extracted specification. The authors emphasize the importance of hierarchical verification in ensuring semantic equivalence and confirm that their methodology significantly reduces environment overhead, allowing for more efficient RL training. The paper provides detailed methodologies, including prompts and verification techniques, enabling reproducibility of the results.
Methodology
The authors developed a methodology that includes a generic prompt template for coding agents, hierarchical verification processes, and iterative agent-assisted repair. This approach allows for the translation of complex RL environments into high-performance versions while ensuring semantic equivalence through rigorous testing.
Results
The paper reports substantial performance improvements across five environments, including a 1.5Ă— speedup for EmuRust, a 22,320Ă— speedup for PokeJAX, and a 6.6Ă— speedup for TCGJax. The methodology confirmed throughput parity with existing implementations and demonstrated zero sim-to-sim gaps across environments.
Implications
The findings suggest that the automatic generation of high-performance RL environments can streamline the development process in the RL community, making it more accessible and efficient. This could lead to faster experimentation and deployment of RL algorithms in various applications, including gaming and robotics.
STAMP: Selective Task-Aware Mechanism for Text Privacy
- STAMP selectively allocates privacy budgets based on token importance and sensitivity.
- The polar mechanism preserves the magnitude of embeddings while perturbing their direction.
- Experimental results show STAMP outperforms existing methods in maintaining utility while ensuring privacy.
- The framework is applicable in scenarios requiring client-side text privacy protection.
Read more
STAMP: Selective Task-Aware Mechanism for Text Privacy
Summary
The paper introduces STAMP, a novel framework designed for task-aware text privatization that enhances the balance between privacy and utility. STAMP employs a selective approach to allocate privacy budgets across tokens based on their significance to downstream tasks and their privacy sensitivity. This token-level partitioning allows for nuanced control over the noise applied to different parts of the input text, ensuring that crucial tokens receive appropriate privacy protection without compromising task relevance. The authors propose the polar mechanism for privatizing token embeddings, which perturbs only the direction of embeddings while maintaining their magnitude. This approach preserves semantic relationships in the embedding space, thereby enhancing downstream utility compared to traditional isotropic noise mechanisms. The effectiveness of STAMP is validated through experiments on various datasets, including SQuAD, Yelp, and AG News, demonstrating its superior privacy-utility trade-offs across different privacy budgets.
Methodology
STAMP utilizes a selective task-aware mechanism that evaluates each token's importance to the downstream task and its privacy sensitivity. The polar mechanism is introduced to perturb token embeddings directionally, while decoding is achieved through cosine nearest-neighbor search, aligning the perturbation with the decoding process. The framework is evaluated using local differential privacy principles and tested on multiple datasets.
Results
The experimental evaluations indicate that STAMP, particularly when combined with the normalized polar mechanism, consistently achieves better privacy-utility trade-offs compared to existing methods. The results highlight STAMP's ability to maintain semantic integrity while providing effective privacy protection across various tasks and datasets.
Implications
STAMP's approach to selective task-aware text privatization has significant implications for applications involving sensitive user-generated content, such as chatbots, recommendation systems, and any service utilizing large language models. By ensuring that privacy is maintained without sacrificing the utility of the text, STAMP can enhance user trust and compliance with privacy regulations.
Personalized Federated Learning via Gaussian Generative Modeling
Federated Learning
Generative Models
- Introduces pFedGM, a method for personalized federated learning using Gaussian generative modeling.
- Balances global collaboration and personalization through a dual objective approach.
- Decouples the Gaussian classifier into a navigator and a statistic extractor for improved representation learning.
- Demonstrates effectiveness across diverse scenarios and datasets, outperforming existing methods.
Read more
Personalized Federated Learning via Gaussian Generative Modeling
Summary
This paper addresses the challenges of personalized federated learning (PFL) in the context of data heterogeneity across clients. Traditional federated learning approaches often struggle with non-IID data distributions, leading to poor generalization. The authors propose a novel method called pFedGM, which utilizes Gaussian generative modeling to capture client-specific data characteristics. The method involves training a Gaussian generator that models client heterogeneity through weighted re-sampling. A dual objective is employed to balance global collaboration and personalization: maximizing inter-class distance across clients while minimizing intra-class distance within them. The authors decouple the Gaussian classifier into a navigator for global optimization and a statistic extractor for capturing distributional statistics. This dual-scale fusion framework allows each client to develop a personalized classifier head, leveraging Bayesian inference for class probability estimation. The evaluation of pFedGM demonstrates its effectiveness across various scenarios, including class count heterogeneity and environmental corruption, achieving superior or competitive performance compared to existing state-of-the-art methods.
Methodology
The methodology involves training a Gaussian generator to model client heterogeneity, employing a dual objective that maximizes inter-class distance and minimizes intra-class distance. The Gaussian classifier is decoupled into components for global optimization and local statistics extraction, facilitating personalized classifier head development for each client.
Results
The pFedGM method achieved superior or competitive performance compared to state-of-the-art methods across various benchmarks, effectively addressing the challenges posed by data heterogeneity in personalized federated learning.
Implications
The proposed method can enhance the performance of federated learning systems in real-world applications where data is distributed and heterogeneous, such as in healthcare, finance, and IoT, while maintaining privacy and security.
Flowcean - Model Learning for Cyber-Physical Systems
- Flowcean automates model generation for Cyber-Physical Systems using data-driven learning.
- The framework emphasizes modularity and usability, allowing integration of various learning libraries.
- Flowcean addresses the limitations of existing machine learning frameworks in handling diverse CPS applications.
- The framework supports customization of data-driven learning pipelines for specific CPS characteristics.
Read more
Flowcean - Model Learning for Cyber-Physical Systems
Summary
The paper presents Flowcean, a novel framework aimed at automating the generation of models for Cyber-Physical Systems (CPS) through data-driven learning. The authors highlight the challenges of constructing effective models for CPS due to their inherent complexity and the manual effort required in traditional modeling approaches. Flowcean addresses these challenges by offering a modular and flexible architecture that integrates various learning strategies, data processing methods, and evaluation metrics. This framework is designed to streamline the model generation and evaluation process, making it more efficient and accessible for a wide range of CPS applications. The authors emphasize the importance of adapting the data-driven learning pipeline to the unique characteristics of each CPS, thereby facilitating the development of tailored solutions. The paper also discusses the limitations of existing machine learning frameworks, which often lack the flexibility to accommodate diverse modeling strategies required for different CPS scenarios. Flowcean aims to fill this gap by providing a comprehensive solution that enhances the usability and adaptability of machine learning in CPS contexts.
Methodology
The authors developed Flowcean as a modular framework that incorporates multiple learning strategies, data processing techniques, and evaluation metrics. The framework allows for the integration of different machine learning libraries, enabling users to customize their modeling approaches based on the specific requirements of their CPS applications. The methodology includes observing the system, preprocessing data, and learning models through a structured pipeline tailored to the unique characteristics of each CPS.
Results
The paper does not provide specific quantitative results but emphasizes the framework's ability to facilitate efficient model generation and evaluation for various CPS scenarios. Flowcean's modular architecture is designed to enhance adaptability and usability, potentially leading to improved outcomes in CPS modeling tasks.
Implications
Flowcean has significant implications for industries relying on Cyber-Physical Systems, such as energy, mobility, and logistics. By automating model generation and providing a flexible framework, it can reduce the time and expertise required for CPS modeling, enabling more efficient design, verification, and testing processes. The framework's adaptability may also foster innovation in CPS applications by allowing for rapid prototyping and experimentation with different modeling strategies.
Grammar of the Wave: Towards Explainable Multivariate Time Series Event Detection via Neuro-Symbolic VLM Agents
- Introduction of Knowledge-Guided Time Series Event Detection (K-TSED) to leverage natural language descriptions for event detection.
- Development of the Event Logic Tree (ELT) framework to model the temporal-logic structures of events.
- Creation of a neuro-symbolic VLM agent system (SELA) that combines logic analysis and signal inspection.
- Establishment of a benchmark dataset from real-world time series data to validate the proposed method.
Read more
Grammar of the Wave: Towards Explainable Multivariate Time Series Event Detection via Neuro-Symbolic VLM Agents
Summary
This paper addresses the challenge of Time Series Event Detection (TSED), which is crucial in high-stakes domains such as healthcare and energy production. Traditional methods rely heavily on labeled data for training, which can be scarce in real-world scenarios. The authors propose a novel approach termed Knowledge-Guided Time Series Event Detection (K-TSED), where a model is provided with natural language event descriptions to identify corresponding intervals in multivariate time series data without extensive training data. Central to their approach is the Event Logic Tree (ELT), a knowledge representation framework that captures the temporal-logic structures of events. This framework facilitates the grounding of linguistic descriptions to time series data. The authors introduce a neuro-symbolic VLM agent system, SELA, which consists of Logic Analyst agents that construct ELT schemas and Signal Inspector agents that locate and refine time series intervals based on these schemas. To validate their method, they created a benchmark using real-world time series data, demonstrating that their approach outperforms existing supervised and zero-shot methods in terms of accuracy and explainability. The study highlights the importance of ELT in reducing the hallucination problem often encountered in VLMs, thereby enhancing the reliability of event detection.
Methodology
The authors propose a neuro-symbolic VLM agent framework that utilizes the Event Logic Tree (ELT) for grounding natural language event descriptions to time series data. The framework consists of Logic Analyst agents that create ELT schemas from event descriptions and Signal Inspector agents that identify and refine event intervals based on these schemas. This deductive approach allows for effective event detection without the need for extensive labeled training data.
Results
The proposed method significantly outperformed traditional inductive pattern recognition methods and existing zero-shot time series reasoning frameworks. The experiments showed that the K-TSED approach, utilizing the ELT framework, approached human-level performance in event detection tasks. An ablation study confirmed the critical role of ELT in mitigating the hallucination issues commonly faced by VLMs.
Implications
The findings suggest that K-TSED can be applied in various high-stakes domains where event detection is critical, such as healthcare monitoring and energy management. The ability to use natural language for event detection could enhance the interpretability and trustworthiness of AI systems in these fields, facilitating better collaboration between human experts and AI.
Resource-Efficient Iterative LLM-Based NAS with Feedback Memory
- Introduces a closed-loop iterative NAS pipeline using LLMs for architecture generation and refinement.
- Utilizes a historical feedback memory to learn from past attempts, improving the efficiency of the search process.
- Demonstrates significant performance improvements on image classification tasks with minimal computational resources.
- Establishes a framework that is accessible for resource-constrained environments, favoring compact models for edge deployment.
Read more
Resource-Efficient Iterative LLM-Based NAS with Feedback Memory
Summary
This paper presents a novel approach to Neural Architecture Search (NAS) that leverages large language models (LLMs) in a resource-efficient manner. Traditional NAS methods are often computationally expensive, requiring extensive GPU resources. The authors propose a closed-loop pipeline that iteratively generates, evaluates, and refines convolutional neural network architectures for image classification using a single consumer-grade GPU. Central to their approach is a historical feedback memory mechanism inspired by Markov chains, which maintains a sliding window of recent attempts to provide context for iterative learning. Unlike previous methods that discard unsuccessful attempts, this approach treats failures as valuable learning signals, recording structured diagnostic triples for each attempt. The pipeline consists of a Code Generator that produces executable PyTorch architectures, an Evaluator that assesses these architectures using one-epoch proxy accuracy on datasets like CIFAR-10, CIFAR-100, and ImageNette, and a Prompt Improver that uses historical feedback to suggest targeted improvements. The authors evaluate three frozen instruction-tuned LLMs and demonstrate significant improvements in architecture performance, establishing a low-budget, reproducible, and hardware-aware paradigm for NAS without the need for cloud infrastructure.
Methodology
The methodology involves a closed-loop pipeline with three main components: a Code Generator that creates PyTorch model implementations, an Evaluator that trains these models for a single epoch to assess their performance, and a Prompt Improver that analyzes results using a historical feedback memory to generate suggestions for future iterations. This iterative process allows the LLM to learn from both successes and failures, optimizing the architecture search.
Results
The empirical results show that the proposed pipeline significantly enhances architecture quality across three LLMs. For instance, on the CIFAR-10 dataset, the DeepSeek-Coder-6.7B model improved from 28.2% to 69.2% accuracy, Qwen2.5-7B improved from 50.0% to 71.5%, and GLM-5 improved from 43.2% to 62.0%. The entire search process was completed in approximately 18 GPU hours on a single RTX 4090, demonstrating the efficiency of the approach.
Implications
The findings suggest that this resource-efficient NAS approach can democratize access to advanced neural architecture design, enabling researchers and practitioners in resource-limited settings to leverage LLMs for effective model development. The framework may also facilitate the deployment of compact neural networks on edge devices, expanding the applicability of deep learning in various real-world scenarios.
Statistical and structural identifiability in representation learning
Theory
- Introduces distinct concepts of statistical and structural identifiability in representation learning.
- Proposes model-agnostic definitions of near-identifiability allowing for error tolerance.
- Demonstrates that ICA can resolve linear ambiguities in representations.
- Achieves state-of-the-art disentanglement using a simple combination of autoencoders and ICA.
Read more
Statistical and structural identifiability in representation learning
Summary
This paper addresses the stability of internal representations in representation learning models, distinguishing between statistical identifiability (consistency across runs) and structural identifiability (alignment with unobserved ground truth). The authors propose new model-agnostic definitions of statistical and structural near-identifiability, allowing for some error tolerance (ϵ). They prove that representations of models with nonlinear decoders can achieve statistical ϵ-near-identifiability, extending existing identifiability theory to intermediate representations in various models, including generative pre-trained transformers and autoencoders. The paper demonstrates that independent components analysis (ICA) can resolve remaining ambiguities, and empirically validates these claims through experiments. The findings suggest that with certain assumptions on the data-generating process, statistical identifiability can lead to structural identifiability, offering a practical method for disentanglement in latent representations. The proposed approach achieves state-of-the-art disentanglement on synthetic benchmarks and successfully disentangles biological variations from technical batch effects in cell microscopy applications.
Methodology
The authors formalize definitions of statistical and structural identifiability, proving new results regarding near-identifiability for various models. They conduct synthetic experiments to validate their theoretical claims, assessing the impact of hyperparameter selection and regularization on identifiability. They also apply ICA to latent representations to achieve disentanglement.
Results
The paper establishes that intermediate representations in models with statistically identifiable outputs are statistically ϵ-near-identifiable. It shows that ICA can resolve linear indeterminacies, leading to effective disentanglement in both synthetic and real-world applications, including cell microscopy.
Implications
The findings have significant implications for improving representation learning models, particularly in tasks requiring disentanglement of complex data, such as in biology and other domains where data variations can obscure meaningful patterns.
Graph Tokenization for Bridging Graphs and Transformers
Graph Learning
- Introduces a graph tokenization framework that combines reversible serialization with BPE.
- Guides serialization using global statistics to enhance structural representation.
- Enables standard Transformers to process graph data without architectural changes.
- Achieves state-of-the-art results on 14 benchmark datasets, surpassing existing models.
Read more
Graph Tokenization for Bridging Graphs and Transformers
Summary
This paper presents a novel graph tokenization framework aimed at integrating graph-structured data with Transformer architectures, particularly large pretrained models like BERT. The authors address the challenges of tokenizing graphs by proposing a method that combines reversible graph serialization with Byte Pair Encoding (BPE). This approach preserves essential graph information while allowing for the generation of sequential representations. The serialization process is guided by global statistics of graph substructures, ensuring that frequently occurring patterns are prioritized in the tokenization process. The empirical results demonstrate that this framework enables standard Transformers to achieve state-of-the-art performance on 14 benchmark datasets for graph classification and regression tasks, often outperforming both traditional Graph Neural Networks and specialized Graph Transformers. The work effectively bridges the gap between graph data and the Transformer ecosystem, providing a robust interface for future applications.
Methodology
The proposed methodology involves a two-step process: first, reversible graph serialization is applied to convert graphs into sequential representations while preserving structural information. This is followed by the application of Byte Pair Encoding (BPE) to merge frequent substructures into meaningful tokens. The serialization process is guided by global statistics to ensure that the most common graph patterns are effectively captured and represented in the tokenized output.
Results
The proposed tokenizer allows standard Transformer models to achieve state-of-the-art performance across 14 benchmark datasets for graph classification and regression tasks. The results indicate that the framework frequently outperforms both established Graph Neural Networks and specialized Graph Transformers, demonstrating its effectiveness in handling graph-structured data.
Implications
This work has significant implications for the application of Transformer models to graph-structured data, potentially enhancing the performance of various tasks in domains such as social network analysis, molecular biology, and recommendation systems. It opens avenues for further research into integrating sequence models with complex data structures.
Cross-Domain Policy Optimization via Bellman Consistency and Hybrid Critics
- Introduction of cross-domain Bellman consistency to measure transferability of source-domain models.
- Development of the QAvatar framework that combines Q functions from source and target domains.
- Establishment of convergence properties for QAvatar, ensuring reliable knowledge transfer.
- Demonstration of QAvatar's superior performance over existing CDRL methods in benchmark tasks.
Read more
Cross-Domain Policy Optimization via Bellman Consistency and Hybrid Critics
Summary
This paper addresses the challenges of cross-domain reinforcement learning (CDRL), which aims to enhance data efficiency by transferring knowledge from a source domain to a target domain with potentially different state and action spaces. The authors identify two primary challenges: the distinct state/action spaces between domains and the uncertain transferability of source-domain models. To tackle these issues, they introduce the concept of cross-domain Bellman consistency to measure the transferability of source models. The proposed framework, QAvatar, integrates Q functions from both domains using an adaptive weight function to facilitate knowledge transfer. The authors demonstrate that QAvatar achieves reliable transfer by effectively leveraging source-domain Q functions, ensuring improved sample efficiency. Experimental results show that QAvatar outperforms existing CDRL benchmarks across various tasks, including locomotion and robot arm manipulation, validating its effectiveness in real-world scenarios.
Methodology
The authors propose the QAvatar framework, which utilizes a weighted combination of Q functions from both source and target domains to update the target-domain policy. They introduce a tabular prototype of QAvatar to establish its convergence properties and later develop a practical implementation that incorporates a normalizing flow-based mapping for learning state-action correspondence. The methodology emphasizes minimizing a cross-domain Bellman loss to facilitate effective transfer despite differences in state and action spaces.
Results
Experimental evaluations indicate that QAvatar significantly outperforms existing CDRL benchmarks across various reinforcement learning tasks. The results highlight the framework's ability to achieve reliable knowledge transfer and improved sample efficiency, demonstrating its effectiveness in scenarios with distinct state and action spaces.
Implications
The findings suggest that QAvatar can be applied in various real-world scenarios where knowledge transfer between different domains is necessary, such as robotics, autonomous systems, and simulation-based learning. The framework's ability to handle distinct state and action spaces opens avenues for more flexible and efficient reinforcement learning applications.
Exhaustive Circuit Mapping of a Single-Cell Foundation Model Reveals Massive Redundancy, Heavy-Tailed Hub Architecture, and Layer-Dependent Differentiation Control
- Exhaustive circuit tracing reveals a heavy-tailed hub architecture with significant annotation bias.
- 1.8% of features account for a disproportionate amount of connectivity in the model.
- Redundancy in feature interactions increases with interaction order, indicating a subadditive architecture.
- Late-layer features are causally linked to promoting cellular maturity, while early-layer features push away from it.
Read more
Exhaustive Circuit Mapping of a Single-Cell Foundation Model Reveals Massive Redundancy, Heavy-Tailed Hub Architecture, and Layer-Dependent Differentiation Control
Summary
This paper addresses the limitations of mechanistic interpretability in biological foundation models by employing exhaustive circuit tracing, higher-order combinatorial ablation, and causal trajectory steering in Geneformer, a transformer-based single-cell model. The study reveals a significant expansion in the understanding of feature interactions, uncovering 1,393,850 significant downstream edges from 4,065 active features at layer 5, a 27-fold increase over previous selective sampling methods. The results indicate a heavy-tailed hub distribution where a small percentage of features account for most connectivity, highlighting systematic annotation biases in prior analyses. Additionally, the research demonstrates that redundancy in feature interactions increases with interaction order, confirming a fundamentally subadditive circuit architecture. Finally, trajectory-guided feature steering establishes a causal relationship between layer position and differentiation directionality, showing that late-layer features promote cellular maturity while early-layer features do not. These findings provide a deeper understanding of how biological foundation models process cellular information and suggest new avenues for research in single-cell biology.
Methodology
The study utilized exhaustive circuit tracing to analyze all active features in layer 5 of the Geneformer model, measuring causal downstream effects through a causal mediation framework. Three-way combinatorial ablation was performed to assess redundancy and interaction dynamics, while trajectory-guided feature steering was employed to establish causal links between layer position and differentiation outcomes.
Results
The exhaustive analysis produced 1,393,850 significant edges, revealing a heavy-tailed distribution of connectivity among features. Notably, 40% of the top 20 hub features lacked biological annotation, indicating systematic biases in previous analyses. Redundancy was found to deepen with interaction order, and late-layer features were shown to causally influence cell state maturation.
Implications
These findings have significant implications for the interpretation of biological foundation models, suggesting that current methodologies may overlook critical features and interactions. The results could inform future research directions in single-cell biology and enhance the design of models for better interpretability and biological relevance.
Fingerprinting Concepts in Data Streams with Supervised and Unsupervised Meta-Information
Time Series
- FiCSUM framework combines supervised and unsupervised meta-information for concept representation.
- Dynamic weighting strategy allows for flexible adaptation to different datasets.
- FiCSUM significantly outperforms existing methods in detecting concept drift and classification accuracy.
- The framework captures a wide range of concept behaviors, enhancing the identification of recurring concepts.
Read more
Fingerprinting Concepts in Data Streams with Supervised and Unsupervised Meta-Information
Summary
The paper addresses the challenge of concept drift in data streams, where the distribution of data changes over time, impacting the performance of classification algorithms. The authors propose a novel framework called FiCSUM (Fingerprinting with Combined Supervised and Unsupervised Meta-Information) that utilizes a comprehensive fingerprint vector made up of various meta-information features to uniquely represent concepts in data streams. This approach aims to enhance the detection of concept drift by allowing classifiers to adapt to new or recurring concepts effectively. The authors highlight that existing methods often rely on a limited number of meta-information features, which can lead to failures in distinguishing between concepts. FiCSUM combines both supervised and unsupervised meta-information, capturing at least 65 aspects of concept behavior. A dynamic weighting strategy is introduced to adjust the influence of each feature based on the dataset, ensuring flexibility and generalizability. The evaluation of FiCSUM across 11 real-world and synthetic datasets demonstrates its superiority in classification accuracy and its ability to model underlying concept drift compared to state-of-the-art methods.
Methodology
The FiCSUM framework constructs a fingerprint vector that integrates multiple meta-information features to represent concepts in data streams. A dynamic weighting scheme is employed to learn and adjust the importance of each feature online, allowing for effective concept drift detection and adaptation. The authors evaluate the framework against various datasets to assess its performance in classification and concept drift modeling.
Results
FiCSUM demonstrated significantly better classification accuracy and a higher ability to capture ground truth concepts compared to purely supervised or unsupervised methods. The framework effectively discriminated between concepts across diverse datasets, showcasing its robustness in handling different types of concept drifts.
Implications
The proposed FiCSUM framework has potential applications in real-time data analysis, such as in sensor networks, financial markets, and online services, where concept drift is prevalent. By improving the detection and adaptation to concept drift, it can enhance the performance of machine learning systems in dynamic environments.
Geometry-Aware Probabilistic Circuits via Voronoi Tessellations
- Introduces Voronoi tessellations as a method to enhance the geometric adaptability of probabilistic circuits.
- Formalizes the incompatibility between Voronoi-based routing and tractable inference in PCs.
- Presents two solutions: a certified approximate inference framework and a structural condition for exact inference.
- Develops a differentiable relaxation for Voronoi tessellations to enable gradient-based learning.
Read more
Geometry-Aware Probabilistic Circuits via Voronoi Tessellations
Summary
This paper addresses the limitations of traditional probabilistic circuits (PCs) that utilize data-independent mixture weights, which restrict their ability to adapt to the local geometry of data manifolds. The authors propose the use of Voronoi tessellations (VT) to incorporate geometric structure into the sum nodes of PCs. However, the naive integration of VT into PCs leads to tractability issues due to the complexity of computing integrals over Voronoi-defined regions. To resolve this, the authors formalize the incompatibility between Voronoi-based routing and tractable inference and present two complementary solutions: an approximate inference framework that provides certified bounds for inference and a structural condition that allows for exact tractable inference through Hierarchical Factorized Voronoi (HFV) circuits. Additionally, they introduce a differentiable relaxation for VT to facilitate gradient-based learning, enabling effective training on standard density estimation tasks. The paper concludes with proof-of-concept experiments demonstrating the effectiveness of the proposed methods.
Methodology
The authors develop an approximate inference framework that utilizes tractable axis-aligned box approximations to compute provable bounds on partition functions and marginals. They also identify a structural condition for Voronoi tessellations that aligns with the decomposition of the circuit, allowing for exact inference. A soft gating mechanism is introduced to facilitate gradient-based learning, transitioning from soft to hard assignments during training and testing.
Results
The proposed methods demonstrate improved performance in density estimation tasks, with the approximate inference framework providing reliable bounds and the HFV circuits allowing for exact inference under certain conditions. The soft-to-hard transition mechanism for learning Voronoi tessellated PCs is shown to converge rapidly, ensuring effective training.
Implications
The findings suggest that incorporating geometric structures into probabilistic circuits can enhance their expressiveness and adaptability, making them more suitable for complex data distributions. This approach has potential applications in areas requiring reliable probabilistic reasoning, such as out-of-distribution detection, causal reasoning, and structured prediction.
Task-Conditioned Routing Signatures in Sparse Mixture-of-Experts Transformers
NLP
Large Language Models
Interpretability
- Introduction of routing signatures as a compact representation of expert activation patterns.
- Demonstration of strong task-conditioned clustering of routing signatures in the OLMoE model.
- Validation of routing patterns against permutation and load-balancing baselines.
- High accuracy in task classification using routing signatures.
Read more
Task-Conditioned Routing Signatures in Sparse Mixture-of-Experts Transformers
Summary
This paper explores the routing mechanisms in Sparse Mixture-of-Experts (MoE) architectures, which are crucial for the efficient scaling of large language models through conditional computation. The authors introduce 'routing signatures', a vector representation that summarizes expert activation patterns across layers for specific prompts. Using the OLMoE-1B-7B-0125-Instruct model, the study demonstrates that prompts belonging to the same task category generate highly similar routing signatures, while those from different categories show significantly lower similarity. The authors quantify this effect, finding that within-category routing similarity (0.8435 ± 0.0879) is notably higher than across-category similarity (0.6225 ± 0.1687), with a Cohen's d of 1.44. A logistic regression classifier trained on routing signatures achieves a cross-validated accuracy of 92.5% ± 6.1% for four-way task classification. The paper also introduces statistical baselines to validate the findings, showing that the observed routing patterns are not merely due to sparsity or balancing constraints. Additionally, the analysis reveals that task structure becomes more pronounced in deeper layers of the model. The authors conclude that routing in sparse transformers is a task-sensitive component of computation rather than just a balancing mechanism. They also release MOE-XRAY, a toolkit for routing telemetry and analysis.
Methodology
The authors conducted experiments using the OLMoE-1B-7B-0125-Instruct model, analyzing 80 prompts across four task categories: code, math, story, and factual question-answering. They introduced routing signatures to represent expert usage frequencies and employed logistic regression for task classification. Statistical baselines were established to ensure the validity of their findings.
Results
The study found that routing signatures from the same task category exhibited significantly higher similarity compared to those from different categories. The logistic regression classifier achieved a high accuracy of 92.5% ± 6.1% in classifying tasks based solely on routing signatures. The results indicate that routing behavior is influenced by task structure, particularly in deeper layers of the model.
Implications
The findings suggest that routing mechanisms in sparse transformers can be leveraged for improved interpretability and debugging of large language models. Understanding task-conditioned routing may enhance the design of more efficient and effective MoE architectures.
Single molecule localization microscopy challenge: a biologically inspired benchmark for long-sequence modeling
- Introduction of SMLM-C as a benchmark for evaluating long-sequence models in biological imaging.
- Demonstration of significant performance degradation in state space models under conditions of high temporal discontinuity.
- Highlighting the unique challenges posed by sparse, irregular, and noise-corrupted temporal signals in SMLM data.
- Emphasis on the necessity for improved sequence modeling methodologies to address the complexities of biological data.
Read more
Single molecule localization microscopy challenge: a biologically inspired benchmark for long-sequence modeling
Summary
This paper introduces the Single Molecule Localization Microscopy Challenge (SMLM-C), a novel benchmark designed to evaluate state space models (SSMs) on sparse spatiotemporal localization data derived from Single Molecule Localization Microscopy (SMLM) techniques. The authors highlight that while SSMs have shown promise in long-sequence modeling tasks, their performance has primarily been assessed in synthetic environments or domains with dense temporal signals, leaving a gap in understanding their efficacy in biologically realistic scenarios characterized by irregular and sparse data. SMLM-C consists of ten simulation scenarios that model the blinking dynamics of fluorophores in dSTORM and DNA-PAINT modalities, providing a controlled environment to assess the ability of SSMs to predict ground-truth emitter positions from observed localization sequences. The study evaluates two state space models, S5 and Mamba, focusing on the impact of temporal discontinuity on model performance. Results indicate that current long-context sequence models struggle significantly with the challenges posed by heavy-tailed blinking dynamics and realistic detection noise, underscoring the need for further advancements in modeling techniques to effectively handle such complex biological data.
Methodology
The authors developed SMLM-C using a simulation engine that models fluorophore blinking kinetics, emitter density variations, and localization uncertainties. The benchmark includes ten scenarios with sequences extending up to 10,000 frames, designed to evaluate the performance of state space models in predicting emitter positions from sparse localization sequences. Two models, S5 and Mamba, were tested under varying conditions of temporal discontinuity to assess their capabilities.
Results
The evaluation revealed that both S5 and Mamba struggled with the combined effects of extreme sparsity, heavy-tailed blinking dynamics, and realistic detection noise. The performance of these models significantly degraded as the temporal discontinuity increased, indicating limitations in their ability to aggregate information over long sequences in biologically realistic contexts.
Implications
The findings suggest a critical need for the development of more robust sequence models that can effectively handle the complexities of sparse and irregular temporal data encountered in biological imaging. The SMLM-C benchmark could serve as a valuable tool for researchers aiming to advance methodologies in long-sequence modeling, particularly in the field of biological data analysis.
High-resolution weather-guided surrogate modeling for data-efficient cross-location building energy prediction
Optimization
Time Series
Efficient ML
- Introduces a high-resolution weather-informed surrogate modeling approach for building energy prediction.
- Achieves cross-location generalization with minimal simulation effort, enabling zero-shot predictions.
- Utilizes weekly weather data to capture fine-grained weather-energy relationships.
- Evaluates multiple time-series learning strategies for optimal weather input encoding.
Read more
High-resolution weather-guided surrogate modeling for data-efficient cross-location building energy prediction
Summary
This paper addresses the challenges of building energy prediction across different locations using surrogate models, which are typically location-specific and require extensive simulations. The authors propose a novel high-resolution weather-informed surrogate modeling approach that captures short-term weather-energy demand patterns shared across regions. This method allows for the development of a generalized surrogate model that can predict energy performance in unseen locations with minimal simulation effort. By focusing on weekly weather data instead of annual aggregates, the model enhances its ability to generalize across different climates. The study evaluates various time-series learning strategies, including Temporal Convolutional Networks, Transformer-based encoders, and convolutional autoencoders, to determine the best approach for encoding weather inputs. The results indicate that the proposed model maintains high predictive accuracy for buildings in the same climate zone and exhibits only slight performance degradation in different climate zones, demonstrating its potential for scalable and reusable applications in building design optimization.
Methodology
The study employs a high-resolution weather-informed surrogate modeling approach that leverages weekly weather data to learn energy demand patterns. It compares three time-series learning strategies: Temporal Convolutional Networks (TCNs), Transformer-based encoders, and convolutional autoencoders, to identify the most effective method for encoding weather inputs. The model is trained on data from a few representative sites to facilitate generalization to unseen locations.
Results
The experimental results show that the proposed surrogate model maintains high predictive accuracy for buildings within the same climate zone when trained on a single location. There is minimal degradation in performance when applied to buildings in different climate zones, indicating strong generalization capabilities.
Implications
This research has significant implications for building design optimization, as it enables faster and more efficient energy performance evaluations across various locations. The scalable and reusable nature of the proposed surrogate model can facilitate more sustainable building practices and support climate-responsive design strategies.
Sharpness-Aware Minimization for Generalized Embedding Learning in Federated Recommendation
- Introduces FedRecGEL, a framework focusing on generalized item embedding learning in federated recommendation systems.
- Reformulates the federated recommendation problem as a multi-task learning challenge, emphasizing item-centered perspectives.
- Utilizes sharpness-aware minimization to enhance the stability and generalization of item embeddings.
- Demonstrates significant performance improvements over existing federated recommendation methods through extensive experiments.
Read more
Sharpness-Aware Minimization for Generalized Embedding Learning in Federated Recommendation
Summary
This paper addresses the challenges of learning generalized item embeddings in federated recommender systems, which are crucial for effective knowledge sharing across clients while maintaining user privacy. The authors propose a novel framework called Federated Recommendation with Generalized Embedding Learning (FedRecGEL), which reformulates the federated recommendation problem from an item-centered perspective and treats it as a multi-task learning problem. This approach emphasizes the importance of stable learning of generalized embeddings amidst the heterogeneity and sparsity of local data distributions in a cross-device setting. The authors employ sharpness-aware minimization to enhance the generalization capability of the learned embeddings, thereby stabilizing the training process and improving recommendation performance. Extensive experiments conducted on four datasets demonstrate that FedRecGEL significantly outperforms existing federated recommendation methods, showcasing its effectiveness in addressing the inherent difficulties of learning generalized embeddings.
Methodology
The authors reformulate the federated recommendation problem as a multi-task learning problem from an item-centered perspective. They utilize sharpness-aware minimization to stabilize the training process and improve the generalization of item embeddings, modifying both local training and global aggregation processes to incorporate this technique.
Results
The experiments conducted on four datasets show that FedRecGEL significantly improves the performance of federated recommendation systems compared to existing methods, effectively addressing the challenges posed by data heterogeneity and sparsity.
Implications
The proposed framework has potential applications in various domains where federated learning is applicable, such as personalized content recommendation, privacy-preserving data sharing, and collaborative filtering, enhancing user experience while ensuring data privacy.
Beyond the Class Subspace: Teacher-Guided Training for Reliable Out-of-Distribution Detection in Single-Domain Models
Computer Vision
Theory
Efficient ML
- Identification of Domain-Sensitivity Collapse (DSC) as a critical failure mode in single-domain OOD detection.
- Introduction of Teacher-Guided Training (TGT) to enhance domain sensitivity in single-domain models.
- Demonstrated significant improvements in OOD detection performance across multiple benchmarks.
- TGT maintains in-domain classification accuracy while reducing false positive rates for OOD detection.
Read more
Beyond the Class Subspace: Teacher-Guided Training for Reliable Out-of-Distribution Detection in Single-Domain Models
Summary
This paper addresses the challenge of out-of-distribution (OOD) detection in single-domain models, which often suffer from a phenomenon termed Domain-Sensitivity Collapse (DSC). DSC occurs when supervised training compresses features into a low-rank class subspace, leading to a loss of sensitivity to domain shifts. The authors propose a novel approach called Teacher-Guided Training (TGT), which utilizes a frozen multi-domain teacher model (DINOv2) to distill class-suppressed residual structures into a student model during training. This method aims to restore domain-sensitive geometry without adding inference overhead. The effectiveness of TGT is validated across eight single-domain benchmarks, demonstrating significant reductions in false positive rates for distance-based OOD detection methods while maintaining or slightly improving in-domain accuracy.
Methodology
The authors formalize the concept of Domain-Sensitivity Collapse (DSC) and introduce Teacher-Guided Training (TGT), which involves training a student model to capture domain-sensitive features by leveraging a frozen multi-domain teacher model. The teacher's class-suppressed residuals are used to guide the student, enhancing its ability to detect OOD samples without the need for the teacher during inference.
Results
TGT significantly reduces the false positive rate at 95% recall (FPR@95) for distance-based OOD detection methods, with improvements of 11.61 percentage points for MDS, 10.78 for ViM, and 12.87 for kNN on average across ResNet-50. The method also shows consistent performance across eight single-domain benchmarks, closing the gap to a teacher-feature oracle while maintaining in-domain classification accuracy.
Implications
The findings suggest that TGT can be effectively applied in practical systems trained on single-domain data, such as medical imaging and industrial inspection, where reliable OOD detection is critical. This approach may enhance the robustness of machine learning models in real-world applications by improving their ability to identify unseen classes and samples from different domains.
Monitoring and Prediction of Mood in Elderly People during Daily Life Activities
Time Series
- Development of a wearable system for mood monitoring in elderly individuals.
- Utilization of ecological momentary assessment (EMA) for real-time mood evaluation.
- Machine learning classifier trained on physiological data from a wristband.
- Promising results in mood prediction accuracy, especially for happiness and activeness.
Read more
Monitoring and Prediction of Mood in Elderly People during Daily Life Activities
Summary
This paper presents an intelligent wearable system designed to monitor and predict mood states in elderly individuals during their daily activities. The system consists of a wristband that records various physiological signals and a mobile application for ecological momentary assessment (EMA). The authors utilize machine learning to train a classifier that predicts mood states based solely on data from the wristband. The study addresses the growing concern of mental health among the elderly, particularly those living alone, and aims to provide a solution for continuous mood monitoring. The results indicate that the system achieves promising accuracy in mood prediction, particularly for happiness and activeness, and is comparable to existing state-of-the-art methods. The approach emphasizes the importance of integrating mental health monitoring into wearable technology to enhance the quality of life for older adults.
Methodology
The system comprises an Empatica E4 wristband that collects physiological data (e.g., heart rate, skin temperature, accelerometer data) and a mobile app for EMA input. Users report their mood state through two simple questions five times a day. The collected data is processed offline, and features are extracted using a sliding window approach. A machine learning classifier is then trained to predict mood states based on the physiological signals.
Results
The system demonstrates high accuracy in predicting mood states, particularly in detecting happiness and activeness. The results are comparable to existing methods in the field, indicating the effectiveness of the wearable device and the EMA approach in real-life settings.
Implications
This research has significant implications for enhancing mental health monitoring in elderly populations, potentially leading to improved interventions and support systems. The integration of wearable technology in daily life can facilitate early detection of mood disturbances, allowing for timely mental health support.
IsoCompute Playbook: Optimally Scaling Sampling Compute for LLM RL
Reinforcement Learning
Large Language Models
Optimization
- Optimal allocation of sampling compute in LLM RL is crucial for maximizing performance.
- The number of parallel rollouts per problem increases with compute budget but saturates at higher levels.
- Easy and hard problem sets show similar scaling trends driven by different mechanisms.
- Performance is less sensitive to the number of unique problems per batch compared to rollouts per problem.
Read more
IsoCompute Playbook: Optimally Scaling Sampling Compute for LLM RL
Summary
The paper addresses the challenge of optimizing compute allocation for reinforcement learning (RL) in large language models (LLMs), an area that lacks established scaling laws compared to pre-training. The authors propose the IsoCompute Playbook, which provides a framework for understanding how to allocate sampling compute effectively during on-policy RL post-training of LLMs. They explore three key resources: the number of parallel rollouts per problem, the number of problems per batch, and the number of sequential update steps. Through extensive experiments across various models and problem distributions, the authors identify that the optimal number of rollouts per problem increases with the compute budget but eventually saturates. They also find that while easy and hard problems exhibit similar scaling trends, the underlying mechanisms differ, with easy problems benefiting from performance sharpening and hard problems requiring more rollouts to discover successful trajectories. The study concludes with practical guidelines for compute-efficient RL post-training, emphasizing the importance of balancing the number of rollouts and problems per batch based on available compute resources.
Methodology
The authors conducted a series of experiments using three base models (Qwen2.5-7B-Instruct, Qwen3-4B-Instruct, and Llama 3.1-8B-Instruct) to analyze the scaling behavior of RL in LLMs. They framed the problem as a compute-constrained optimization and evaluated the effects of varying the number of parallel rollouts, problems per batch, and update steps on performance metrics across different problem distributions.
Results
The study revealed that the compute-optimal number of rollouts per problem increases with the compute budget and saturates at higher levels. The findings indicated that easy problems benefit from performance sharpening while hard problems require more rollouts to discover rare successful trajectories. Additionally, the number of problems per batch had a marginal effect on performance when kept within a moderate range, suggesting a strategic shift in focus based on compute availability.
Implications
The findings provide a structured approach for practitioners in RL to allocate compute resources effectively, enhancing the performance of LLMs in various applications. This work contributes to the understanding of scaling laws in RL, potentially guiding future research and practical implementations in the field.
Security Considerations for Artificial Intelligence Agents
NLP
Large Language Models
Theory
- AI agents introduce unique security vulnerabilities distinct from traditional software systems.
- The blurring of code and data in LLM-powered agents creates new attack surfaces.
- Existing security mechanisms are often inadequate for the dynamic and autonomous nature of AI agents.
- A layered defense strategy is proposed to address the security challenges of AI agents.
Read more
Security Considerations for Artificial Intelligence Agents
Summary
This paper presents a response to the NIST/CAISI Request for Information regarding the security of AI agents, drawing on Perplexity's extensive experience with general-purpose agentic systems. The authors highlight the unique security challenges posed by AI agents, particularly those powered by large language models (LLMs). They discuss how these systems blur the lines between code and data, leading to new vulnerabilities such as indirect prompt injection and cascading failures in workflows. The paper maps out the principal attack surfaces and assesses current defenses, proposing a layered security approach that includes input-level and model-level mitigations, sandboxed execution, and deterministic policy enforcement. Furthermore, the authors identify significant gaps in current standards and research, emphasizing the need for adaptive security benchmarks and guidance for secure multi-agent system design aligned with NIST risk management principles.
Methodology
The authors conducted a comprehensive analysis of security threats and vulnerabilities specific to AI agent systems, leveraging their operational experience with these technologies. They mapped attack surfaces and evaluated existing security measures, proposing enhancements based on identified gaps.
Results
The paper outlines various security threats associated with AI agents, including the risks of prompt injection and the challenges posed by the non-deterministic behavior of LLMs. It also critiques current security mechanisms and suggests a layered defense approach to mitigate these risks effectively.
Implications
The findings of this paper have significant implications for the design and deployment of AI agents in both enterprise and consumer applications. By addressing the identified security gaps, developers can enhance the safety and reliability of AI systems, fostering greater trust and adoption in various sectors.
ARROW: Augmented Replay for RObust World models
- ARROW introduces a dual-buffer system for memory-efficient continual reinforcement learning.
- The algorithm is inspired by neuroscience, specifically the Complementary Learning Systems theory.
- ARROW demonstrates reduced forgetting in tasks without shared structure compared to traditional methods.
- The approach maintains comparable forward transfer, indicating effective knowledge retention.
Read more
ARROW: Augmented Replay for RObust World models
Summary
The paper presents ARROW (Augmented Replay for RObust World models), a novel model-based continual reinforcement learning (CRL) algorithm designed to address the challenges of catastrophic forgetting while maintaining performance across tasks. Traditional model-free methods often rely on large replay buffers, which can be memory-intensive and inefficient. ARROW draws inspiration from neuroscience, particularly the Complementary Learning Systems theory, proposing a dual-buffer system: a short-term buffer for recent experiences and a long-term buffer to ensure task diversity through intelligent sampling. The algorithm extends DreamerV3 by incorporating a memory-efficient replay mechanism that balances the retention of recent experiences with the preservation of long-term knowledge. The authors evaluate ARROW in two distinct continual learning settings: tasks without shared structure (Atari) and tasks with shared structure (Procgen CoinRun variants). The results indicate that ARROW significantly reduces forgetting in tasks lacking shared structure while maintaining comparable forward transfer to existing model-free and model-based baselines. The findings suggest that model-based approaches, particularly those inspired by biological systems, hold promise for enhancing continual reinforcement learning.
Methodology
ARROW employs a model-based approach that extends DreamerV3, utilizing a strategically managed replay mechanism with two buffers: a short-term buffer for recent experiences and a long-term buffer for preserving task diversity. This design allows for efficient memory usage while balancing the retention of recent and long-term knowledge.
Results
ARROW was evaluated against model-free and model-based baselines in two continual learning settings. The results showed that ARROW significantly mitigated forgetting in tasks without shared structure while achieving similar levels of forward transfer, demonstrating its effectiveness in continual reinforcement learning scenarios.
Implications
The development of ARROW suggests that model-based reinforcement learning can effectively address the challenges of continual learning, particularly in dynamic environments. This has potential applications in robotics, adaptive systems, and any domain requiring agents to learn and adapt to new tasks without losing previously acquired knowledge.
Disentangled Representation Learning through Unsupervised Symmetry Group Discovery
Reinforcement Learning
Robotics
Theory
- Introduces a method for autonomous discovery of symmetry group structures in representation learning.
- Proves the identifiability of the true symmetry group decomposition under minimal assumptions.
- Develops two algorithms: one for symmetry group discovery and another for LSBD representation learning.
- Demonstrates superior performance of the proposed method over existing LSBD approaches in various environments.
Read more
Disentangled Representation Learning through Unsupervised Symmetry Group Discovery
Summary
This paper presents a novel approach to symmetry-based disentangled representation learning that eliminates the need for prior knowledge of the symmetry group's structure. The authors propose a method where an embodied agent autonomously discovers the group structure of its action space through unsupervised interactions with the environment. They prove the identifiability of the true symmetry group decomposition under minimal assumptions and introduce two algorithms: one for discovering the group decomposition from interaction data and another for learning Linear Symmetry-Based Disentangled (LSBD) representations without specific subgroup assumptions. The method is validated across three environments with varying group decompositions, demonstrating superior performance compared to existing LSBD approaches. This work addresses the limitations of previous methods that relied on strong prior knowledge or restrictive assumptions, thereby advancing the field of unsupervised disentangled representation learning.
Methodology
The authors derive two algorithms based on the identifiability of the symmetry group decomposition. The first algorithm discovers the symmetry group structure from interaction data, while the second learns LSBD representations without imposing structural assumptions on subgroups. The methodology leverages transitions in the environment, aligning with reinforcement learning principles, where agents actively interact to induce state changes.
Results
The proposed method outperformed existing LSBD approaches in three distinct environments with different group decompositions. The experimental validation confirmed the effectiveness of the algorithms in discovering symmetry structures and learning disentangled representations.
Implications
This research has significant implications for improving interpretability and transferability in machine learning models. By enabling unsupervised discovery of symmetry groups, the approach can enhance the understanding of latent factors in various applications, including robotics and reinforcement learning, where environment interactions are crucial.
Structure-Aware Epistemic Uncertainty Quantification for Neural Operator PDE Surrogates
Theory
Optimization
Efficient ML
- Introduces a structure-aware UQ scheme for neural operators in PDE modeling.
- Focuses on epistemic uncertainty arising from limited data and imperfect training.
- Implements targeted perturbations in the lifting module to improve uncertainty estimates.
- Demonstrates superior performance in uncertainty coverage and alignment on PDE benchmarks.
Read more
Structure-Aware Epistemic Uncertainty Quantification for Neural Operator PDE Surrogates
Summary
This paper addresses the challenge of epistemic uncertainty in neural operators (NOs) used for surrogate modeling of partial differential equations (PDEs). The authors propose a novel structure-aware epistemic uncertainty quantification (UQ) scheme that enhances the reliability of uncertainty bands in NO predictions. Traditional methods often apply unstructured perturbations across the entire network, which can lead to inaccurate uncertainty estimates. Instead, the proposed method restricts Monte Carlo sampling to a module-aligned subspace, injecting stochasticity only into the lifting module while treating the propagation and recovery modules as deterministic. This approach is instantiated with two lightweight perturbations: channel-wise multiplicative feature dropout and Gaussian feature perturbation. The authors validate their method on challenging PDE benchmarks, demonstrating that their structure-aware design yields more reliable coverage and tighter uncertainty bands compared to existing baselines, while maintaining computational efficiency.
Methodology
The authors propose a structure-aware epistemic uncertainty quantification scheme that restricts Monte Carlo sampling to a module-aligned subspace. They implement two perturbation techniques: channel-wise multiplicative feature dropout and Gaussian feature perturbation, followed by standard calibration to construct uncertainty bands. This targeted approach contrasts with traditional methods that apply unstructured perturbations across the entire network.
Results
Experiments on PDE benchmarks, including discontinuous-coefficient Darcy flow and geometry-shifted 3D car CFD surrogates, show that the proposed method achieves more reliable uncertainty coverage, tighter uncertainty bands, and improved alignment of residuals with uncertainty compared to common baseline methods.
Implications
The proposed structure-aware UQ scheme can enhance the deployment of neural operators in scientific computing applications, such as aerospace CFD design and safety-critical monitoring, by providing well-calibrated uncertainty estimates that inform risk management.
Higher-Order Modular Attention: Fusing Pairwise and Triadic Interactions for Protein Sequences
Theory
Efficient ML
- HOMA introduces triadic interactions to enhance the representation of protein sequences.
- The method is designed to be computationally efficient, suitable for long biological sequences.
- HOMA shows consistent performance improvements across multiple protein-related tasks.
- The framework allows for controlled comparisons among different attention mechanisms.
Read more
Higher-Order Modular Attention: Fusing Pairwise and Triadic Interactions for Protein Sequences
Summary
The paper introduces Higher-Order Modular Attention (HOMA), a novel attention mechanism designed to enhance the representation of protein sequences by incorporating triadic interactions alongside traditional pairwise attention. Traditional transformer architectures primarily focus on pairwise interactions, which can overlook the complex dependencies present in biological sequences, particularly in the context of protein folding and function. HOMA addresses this limitation by integrating an explicit triadic interaction pathway while maintaining computational efficiency through block-structured, windowed attention. The authors evaluate HOMA on three TAPE benchmarks—Secondary Structure, Fluorescence, and Stability—demonstrating that it consistently outperforms standard self-attention and other efficient variants like block-wise attention and Linformer. The results indicate that the inclusion of triadic terms significantly enhances the model's representational capacity for protein sequence prediction without incurring prohibitive computational costs.
Methodology
HOMA augments standard pairwise attention in transformers with a triadic interaction pathway. It employs block-structured computation with overlapping blocks to compute triadic interactions efficiently. The model uses learned projections for pairwise and triadic pathways and allows for tunable hyperparameters to balance accuracy and efficiency.
Results
HOMA was evaluated on three TAPE benchmarks, showing significant improvements in accuracy for the Secondary Structure task and Spearman’s correlation for the Fluorescence and Stability tasks. The method also demonstrated enhanced token processing speed and reduced peak memory usage compared to traditional self-attention mechanisms.
Implications
The introduction of HOMA could lead to more accurate models for predicting protein functions and interactions, which is crucial for advancements in molecular biology and bioinformatics. This approach may also inspire new architectures in other domains requiring the modeling of complex interactions.
Learning Tree-Based Models with Gradient Descent
Optimization
Interpretability
Reinforcement Learning
- Introduces a gradient descent approach for learning decision trees, overcoming limitations of traditional methods.
- Utilizes backpropagation on a dense decision tree representation for joint optimization of tree parameters.
- Extends the method to tree ensembles with instance-wise weighting for improved performance and interpretability.
- Achieves state-of-the-art results in multiple domains, including multimodal and reinforcement learning.
Read more
Learning Tree-Based Models with Gradient Descent
Summary
This doctoral thesis presents a novel approach to learning tree-based models, specifically hard, axis-aligned decision trees, using gradient descent. Traditional methods like CART have limitations due to their reliance on greedy search procedures, which often lead to suboptimal tree structures and hinder integration with modern machine learning techniques that utilize gradient descent. The proposed method leverages backpropagation with a straight-through operator on a dense decision tree representation, allowing for the joint optimization of all tree parameters. This approach not only overcomes the constraints of sequentially selecting locally optimal splits but also facilitates seamless integration into multimodal and reinforcement learning frameworks. The thesis further extends this method to tree ensembles with an instance-wise weighting scheme, balancing performance and interpretability. The results demonstrate state-of-the-art performance across various domains, including interpretable decision trees for small datasets, advanced models for complex tabular data, and applications in multimodal learning and interpretable reinforcement learning, thereby significantly enhancing the applicability of tree-based models in machine learning.
Methodology
The methodology involves formulating a dense decision tree representation and applying backpropagation with a straight-through operator to enable gradient-based optimization. This allows for the joint optimization of all parameters in the decision tree, rather than relying on sequential, locally optimal splits. The approach is further extended to tree ensembles with an instance-wise weighting scheme.
Results
The proposed method achieves state-of-the-art performance across various domains, including small tabular datasets, complex tabular data, multimodal learning, and interpretable reinforcement learning, demonstrating significant improvements in both performance and applicability of tree-based models.
Implications
The findings suggest that integrating gradient descent with decision tree learning can enhance the interpretability and performance of models in high-stakes domains, making them more suitable for applications in multimodal and reinforcement learning contexts.
Deep Learning-Based Metamodeling of Nonlinear Stochastic Dynamic Systems under Parametric and Predictive Uncertainty
Time Series
- Introduces three metamodeling frameworks for nonlinear dynamic systems that account for both loading and parameter uncertainties.
- Employs deep learning techniques to enhance metamodeling capabilities, overcoming limitations of traditional methods.
- Demonstrates effective prediction uncertainty quantification through Monte Carlo dropout integrated with LSTM.
- Validates the proposed frameworks on two distinct case studies, showcasing their adaptability and performance.
Read more
Deep Learning-Based Metamodeling of Nonlinear Stochastic Dynamic Systems under Parametric and Predictive Uncertainty
Summary
This paper addresses the challenges of modeling high-dimensional, nonlinear dynamic structural systems under natural hazards, particularly focusing on uncertainties in external loads and structural parameters. The authors propose three novel metamodeling frameworks that integrate feature extraction modules (multi-layer perceptron, message-passing neural network, and autoencoder) with a long short-term memory (LSTM) network. These frameworks utilize Monte Carlo dropout and a negative log-likelihood loss function to quantify both epistemic and aleatoric uncertainties in predictions. The proposed architectures (MLP-LSTM, MPNN-LSTM, and AE-LSTM) are validated through two case studies: a multi-degree-of-freedom Bouc–Wen system and a 37-story fiber-discretized nonlinear steel moment-resisting frame, both subjected to stochastic seismic excitation and structural parameter uncertainty. The results demonstrate that the MLP-LSTM architecture performs best for the simpler Bouc–Wen system, while the MPNN-LSTM and AE-LSTM excel in the more complex steel-frame model. Furthermore, a strong correlation between predictive variance and actual error indicates the frameworks' potential for active-learning strategies and assessing model confidence in structural response predictions.
Methodology
The study develops three metamodeling frameworks (MLP-LSTM, MPNN-LSTM, AE-LSTM) that combine feature extraction modules with LSTM networks. These frameworks utilize Monte Carlo dropout for uncertainty quantification and are trained using a negative log-likelihood loss function. Wavelet-based approximations are incorporated to enhance training efficiency.
Results
The MLP-LSTM framework achieved the lowest prediction errors for the Bouc–Wen system, while the MPNN-LSTM and AE-LSTM frameworks outperformed in the more complex steel-frame model. The correlation between predictive variance and actual error was consistent, confirming the frameworks' effectiveness for uncertainty quantification.
Implications
The proposed metamodeling frameworks can significantly improve the reliability and efficiency of dynamic response predictions in structural engineering, particularly in the context of uncertainty quantification. This can lead to better-informed decision-making in design and risk assessment under natural hazards.
Language Generation with Replay: A Learning-Theoretic View of Model Collapse
NLP
Large Language Models
Theory
- Introduces a learning-theoretic framework for analyzing model collapse in LLMs.
- Demonstrates that replay can limit generatability in non-uniform and limit generation contexts.
- Findings support existing practical methods like data cleaning and watermarking but also identify their limitations.
- Establishes a clear distinction between the effects of replay on different generative tasks.
Read more
Language Generation with Replay: A Learning-Theoretic View of Model Collapse
Summary
This paper addresses the phenomenon of model collapse in large language models (LLMs), which occurs when models are trained on their own generated outputs, leading to performance degradation. The authors introduce a learning-theoretic framework for understanding this issue, termed 'language generation with replay.' This framework incorporates a replay adversary that injects the generator's past outputs into the training data stream. The study characterizes the impact of replay on various notions of generatability, revealing that while replay does not affect uniform generation, it creates separations in non-uniform generation and generation in the limit. The findings align with practical heuristics such as data cleaning and output filtering, while also highlighting scenarios where these methods may fail. This theoretical exploration provides a deeper understanding of the challenges posed by the increasing prevalence of machine-generated content in training datasets.
Methodology
The authors develop a replay variant of the language generation game, where a generator is tasked with producing outputs from a target language while facing an adversary that can include previous outputs in the training stream. This approach allows for a theoretical exploration of the effects of replay on generatability across different definitions.
Results
The study finds that replay does not impact uniform generation, but it does create separations for non-uniform generation and generation in the limit. The results indicate that while certain practical strategies can mitigate the effects of model collapse, they may not be universally effective.
Implications
The findings have significant implications for the training of LLMs, particularly in understanding how to manage the risks associated with training on machine-generated content. This work could inform strategies for data curation and model training to prevent performance degradation.
Scaling Reasoning Efficiently via Relaxed On-Policy Distillation
Reinforcement Learning
Large Language Models
Efficient ML
- REOPOLD stabilizes on-policy distillation by relaxing strict imitation constraints.
- The framework integrates modern RL insights to enhance training efficiency.
- Empirical results show significant improvements in sample efficiency and test-time scaling.
- REOPOLD allows smaller models to perform comparably to larger models in reasoning tasks.
Read more
Scaling Reasoning Efficiently via Relaxed On-Policy Distillation
Summary
This paper introduces REOPOLD (Relaxed On-Policy Distillation), a novel framework designed to enhance the efficiency and stability of on-policy distillation for transferring reasoning capabilities from large teacher models to smaller student models. The authors analyze the limitations of traditional on-policy distillation, which often leads to instability and negative transfer, particularly in smaller language models (SLMs). By interpreting on-policy distillation through the lens of reinforcement learning (RL), the authors propose a method that relaxes strict imitation constraints, allowing for a more stable learning environment. REOPOLD employs techniques such as mixture-based reward clipping, entropy-based dynamic sampling, and a unified exploration-to-refinement strategy to improve sample efficiency and test-time scaling. Empirical results demonstrate that REOPOLD significantly outperforms existing methods, achieving up to 12 times greater sample efficiency and enabling smaller models to approach the performance of much larger teacher models in various reasoning tasks.
Methodology
The authors analyze on-policy distillation as a form of policy optimization, where the teacher-student log-likelihood ratio serves as a reward signal. REOPOLD is developed by implementing reward clipping, dynamic sampling based on entropy, and a multi-stage training approach to selectively apply teacher signals, thereby stabilizing the optimization process.
Results
REOPOLD achieves state-of-the-art performance on the AIME-25 benchmark, demonstrating a 6.7 to 12 times increase in sample efficiency compared to traditional methods. Additionally, it enables a 7B student model to match the performance of a 32B teacher model with a 3.32 times speedup in inference.
Implications
The findings suggest that REOPOLD could be applied to improve the performance of smaller models in various reasoning tasks, making advanced reasoning capabilities more accessible in resource-constrained environments. This could have significant implications for deploying AI in practical applications where computational resources are limited.
Jailbreak Scaling Laws for Large Language Models: Polynomial-Exponential Crossover
NLP
Large Language Models
Theory
- Adversarial prompt-injection can amplify attack success rates from polynomial to exponential growth.
- A theoretical model based on spin-glass theory provides insights into the behavior of LLMs under adversarial conditions.
- Short prompts lead to polynomial scaling of attack success rates, while long prompts result in exponential scaling.
- Empirical validation shows varying attack susceptibility across different LLMs, correlating with their reasoning abilities.
Read more
Jailbreak Scaling Laws for Large Language Models: Polynomial-Exponential Crossover
Summary
This paper investigates the scaling laws of adversarial prompt-injection attacks on large language models (LLMs), specifically focusing on how these attacks can transition from polynomial to exponential growth in success rates based on the number of inference-time samples. The authors propose a theoretical generative model inspired by spin-glass theory to explain this phenomenon, where short prompts correspond to weak magnetic fields leading to polynomial scaling, while longer prompts correspond to strong magnetic fields resulting in exponential scaling. The model, termed SpinLLM, captures the dynamics of adversarial prompt injection and its effects on language model behavior. Empirical validation is conducted on various LLMs, revealing differences in attack susceptibility and reasoning capabilities across models. The findings highlight the potential for adversarial prompts to significantly enhance the likelihood of unsafe outputs from LLMs, raising concerns about their safety and alignment with intended use.
Methodology
The authors developed a generative model based on spin-glass theory, termed SpinLLM, to analyze the effects of adversarial prompt injection on language models. They derived analytical expressions for attack success rates in both weak and strong magnetic field regimes and validated these findings empirically on various LLMs.
Results
The study found that the attack success rate (ASR) scales polynomially with the number of inference-time samples in the absence of prompt injection, while adversarial prompt injection can lead to an exponential increase in ASR for weaker models. The empirical results confirmed the theoretical predictions, demonstrating a clear transition in scaling behavior based on the length of the injected prompts.
Implications
The findings of this study have significant implications for the safety and robustness of large language models, particularly in applications where adversarial attacks could lead to harmful outputs. Understanding the scaling laws of adversarial attacks can inform the design of more resilient AI systems and improve safety mechanisms.
AutoScout: Structured Optimization for Automating ML System Configuration
Optimization
Efficient ML
- AutoScout formulates ML system configuration as a mixed-discrete/continuous optimization problem.
- It employs a hybrid optimization framework that integrates sparse and dense parameter optimization.
- AutoScout achieves 2.7–3.0× training speedup compared to expert-tuned settings.
- The system is 13.7–16.5× faster than existing system configurators.
Read more
AutoScout: Structured Optimization for Automating ML System Configuration
Summary
The paper introduces AutoScout, a novel system configurator designed to optimize machine learning (ML) system configurations across various models and hardware platforms. As ML systems grow increasingly complex, the configuration space expands significantly, making it challenging to identify high-performance settings. AutoScout addresses this by framing the configuration optimization as a mixed-discrete/continuous problem with hierarchical dependencies. It employs a hybrid optimization framework that combines tree-based search for sparse structural decisions with gradient-guided optimization for dense execution parameters. Additionally, AutoScout incorporates adaptive exploration strategies to prioritize impactful features and utilizes ensemble simulations to reduce profiling costs. The results demonstrate that AutoScout consistently outperforms existing methods, achieving significant speedups in training while being considerably faster than traditional configurators.
Methodology
AutoScout utilizes a hybrid optimization approach that combines tree-based search for sparse configuration choices with coordinate-wise stochastic gradient descent for dense parameters. It employs a hybrid bandit mechanism to coordinate exploration between the sparse and dense optimizers and incorporates a tournament-based design to prioritize high-impact features.
Results
AutoScout consistently identifies high-performance configurations across diverse models and deployment objectives, achieving training speedups of 2.7–3.0× over expert-tuned settings and being significantly faster than existing configurators.
Implications
The development of AutoScout has the potential to streamline the configuration process for ML systems, making it easier for practitioners to achieve optimal performance without extensive manual tuning. This could lead to more efficient training and inference processes in various ML applications.
Topological DeepONets and a generalization of the Chen-Chen operator approximation theorem
Theory
- Introduction of Topological DeepONets for approximating nonlinear operators in locally convex spaces.
- Generalization of the Chen-Chen operator approximation theorem to encompass broader function spaces.
- Construction of a neural network architecture that utilizes continuous linear functionals for input processing.
- Demonstration of uniform approximation capabilities for continuous operators on compact sets.
Read more
Topological DeepONets and a generalization of the Chen-Chen operator approximation theorem
Summary
This paper introduces Topological DeepONets, an extension of the Deep Operator Networks (DeepONets) framework, which is designed to approximate nonlinear operators between function spaces. Traditionally, DeepONets operate within the realm of continuous functions defined on compact sets, but this work expands the applicability to arbitrary Hausdorff locally convex spaces. The author constructs a topological feedforward neural network that utilizes continuous linear functionals from the dual space of the input space, allowing the branch component of the network to process inputs through these linear measurements while the trunk component handles the output in a Euclidean domain. The main contribution is a theorem demonstrating that continuous operators can be uniformly approximated by these topological DeepONets, thereby generalizing the classical Chen-Chen operator approximation theorem to a broader context beyond Banach spaces. This advancement opens new avenues for applying DeepONets in various scientific and engineering fields where inputs may not conform to traditional Euclidean structures.
Methodology
The paper develops a topological framework for DeepONets, where the input is treated as an element of a locally convex space. The architecture employs continuous linear functionals from the dual space to construct a feedforward neural network that approximates operators acting between function spaces. Theoretical results are derived to establish the uniform approximation of continuous operators using this new architecture.
Results
The main result is a universal approximation theorem that shows continuous operators defined on compact sets can be approximated by topological DeepONets. This result extends previous findings from the classical operator approximation theory, demonstrating that the branch-trunk architecture can effectively approximate operators beyond the confines of Banach spaces.
Implications
The findings suggest that Topological DeepONets can be applied to a wider range of problems in scientific computing, particularly in areas involving complex function spaces such as partial differential equations and other nonlinear operator learning scenarios. This could lead to more robust models in engineering applications, including fluid dynamics and multiphysics simulations.
Exploiting Expertise of Non-Expert and Diverse Agents in Social Bandit Learning: A Free Energy Approach
Reinforcement Learning
Theory
Optimization
- Introduces a free energy-based social bandit learning algorithm that integrates individual and social learning.
- Proves theoretical convergence to the optimal policy without requiring shared rewards or social norms.
- Demonstrates improved learning performance in the presence of non-expert agents.
- Maintains logarithmic regret, indicating efficient exploration and exploitation.
Read more
Exploiting Expertise of Non-Expert and Diverse Agents in Social Bandit Learning: A Free Energy Approach
Summary
This paper addresses the limitations of traditional reinforcement learning (RL) algorithms in social contexts, particularly in social bandit learning scenarios where agents can observe each other's actions without knowledge of their rewards. The authors propose a novel free energy-based algorithm that allows a social agent to evaluate the expertise of other agents while integrating its own experiences. The algorithm operates in a policy space and does not require any oracle or social norms, making it applicable in real-world settings where agents may not share private information. The theoretical convergence of the proposed method to the optimal policy is established, and empirical evaluations demonstrate its superiority over existing approaches. The algorithm effectively identifies relevant agents, even among random or suboptimal ones, and enhances learning performance, particularly in environments with non-expert agents. The method maintains logarithmic regret, indicating efficient learning in complex social environments.
Methodology
The proposed method utilizes a free energy minimization framework to evaluate the suitability of observed behaviors from other agents while integrating the social agent's own experiences. It models decision-making as a balance between expected utility and information-processing costs, incorporating both self-referenced evaluations and global measures of expertise.
Results
The empirical results indicate that the proposed algorithm outperforms existing social learning methods in various scenarios, effectively leveraging the expertise of both expert and non-expert agents. The algorithm successfully maintains logarithmic regret, showcasing its efficiency in learning and decision-making.
Implications
The findings suggest that integrating social learning methods into reinforcement learning can significantly enhance performance in personalized AI applications, such as educational systems and recommendation engines, where agents operate in diverse and dynamic environments.