AI-generated summaries
Today's ML research,
without the noise.
Daily summaries of the latest machine learning papers from arXiv, processed every 8 hours.
55
Papers today
8h
Update frequency
7
Days of history
Resource-Efficient Iterative LLM-Based NAS with Feedback Memory
Computer Vision
Large Language Models
Efficient ML
- Introduces a closed-loop iterative NAS pipeline utilizing LLMs for architecture generation and refinement.
- Employs a historical feedback memory mechanism to learn from past attempts, enhancing iterative learning.
- Achieves significant improvements in model accuracy on CIFAR datasets with minimal computational resources.
- Demonstrates the feasibility of conducting NAS on a single consumer-grade GPU without cloud infrastructure.
Read more
Resource-Efficient Iterative LLM-Based NAS with Feedback Memory
Summary
This paper presents a novel approach to Neural Architecture Search (NAS) that leverages large language models (LLMs) in a resource-efficient manner. The proposed method utilizes a closed-loop pipeline that iteratively generates, evaluates, and refines convolutional neural network architectures for image classification, specifically designed to operate on a single consumer-grade GPU without the need for LLM fine-tuning. A key innovation is the introduction of a historical feedback memory mechanism, inspired by Markov chains, which maintains a sliding window of recent improvement attempts. This allows the system to learn from both successes and failures, treating execution errors as valuable learning signals. The pipeline consists of three main components: a Code Generator that produces executable PyTorch architectures, an Evaluator that assesses these architectures using a one-epoch proxy accuracy metric, and a Prompt Improver that analyzes results to generate targeted improvement suggestions. The method was evaluated using three frozen instruction-tuned LLMs across multiple iterations, demonstrating significant improvements in architecture quality while maintaining low computational costs. The results indicate that the proposed NAS approach is not only effective but also accessible for resource-constrained environments.
Methodology
The methodology involves a closed-loop pipeline with three components: a Code Generator that creates PyTorch model implementations, an Evaluator that trains these models for one epoch to assess their performance, and a Prompt Improver that utilizes historical feedback memory to refine future architecture suggestions based on past results.
Results
The evaluation showed substantial improvements in architecture performance, with DeepSeek-Coder-6.7B achieving an accuracy increase from 28.2% to 69.2%, Qwen2.5-7B from 50.0% to 71.5%, and GLM-5 from 43.2% to 62.0% on the CIFAR-10 dataset. The entire search process was completed in approximately 18 GPU hours on a single RTX 4090.
Implications
The findings suggest that NAS can be made more accessible and efficient, particularly for users with limited computational resources. This approach could facilitate the deployment of optimized neural networks in edge computing scenarios, where hardware constraints are a significant consideration.
Entropy-Preserving Reinforcement Learning
Reinforcement Learning
NLP
Large Language Models
- Entropy reduction in policy gradient algorithms can limit exploration and lead to suboptimal policies.
- Active monitoring and control of entropy during training can enhance policy performance.
- The paper introduces REPO and ADAPO as mechanisms for effective entropy regulation.
- Maintaining diversity in explored trajectories is crucial for robust learning in RL.
Read more
Entropy-Preserving Reinforcement Learning
Summary
This paper addresses a critical issue in policy gradient reinforcement learning (RL) where the entropy of explored trajectories tends to decrease during training, leading to reduced diversity and premature convergence to suboptimal policies. The authors argue for the active monitoring and control of entropy throughout the training process. They analyze how various policy gradient objectives influence entropy dynamics and identify empirical factors that affect entropy behavior, such as numerical precision. The paper introduces two main mechanisms for entropy control: REPO, which modifies the advantage function to regulate entropy, and ADAPO, an adaptive asymmetric clipping approach. These methods aim to maintain diversity in the training process, resulting in more effective policies that are capable of sequential learning in new environments. The findings suggest that maintaining a steady entropy trajectory during training correlates positively with performance, highlighting the importance of entropy dynamics in RL optimization.
Methodology
The authors conducted a formal analysis of leading policy gradient objectives to understand their impact on entropy dynamics. They proposed two new algorithms, REPO and ADAPO, designed to regulate entropy through modifications to the advantage function and adaptive clipping, respectively. The study also involved empirical evaluations to assess the effects of numerical precision and other implementation factors on entropy behavior.
Results
The proposed entropy-preserving methods, REPO and ADAPO, demonstrated superior performance on benchmark tasks, achieving state-of-the-art results on the AppWorld dataset. The experiments showed that models trained with these methods maintained higher entropy levels throughout training, leading to better performance metrics compared to traditional algorithms that experienced entropy collapse.
Implications
The findings of this paper suggest that actively managing entropy in reinforcement learning can lead to more robust and adaptable models, particularly in dynamic environments. This has potential applications in various fields, including natural language processing and robotics, where maintaining diversity in decision-making is critical.
Task-Conditioned Routing Signatures in Sparse Mixture-of-Experts Transformers
Large Language Models
Interpretability
Efficient ML
- Introduction of routing signatures as a representation of expert activation patterns.
- Demonstration of strong task-conditioned clustering of routing signatures in MoE transformers.
- Validation of routing patterns against permutation and load-balancing baselines.
- High accuracy in task classification using routing signatures.
Read more
Task-Conditioned Routing Signatures in Sparse Mixture-of-Experts Transformers
Summary
This paper investigates the routing mechanisms in Sparse Mixture-of-Experts (MoE) transformers, which are crucial for the efficient scaling of large language models through conditional computation. The authors introduce 'routing signatures', a vector representation that summarizes expert activation patterns across layers for specific prompts. Using the OLMoE-1B-7B-0125-Instruct model, the study demonstrates that prompts from the same task category yield highly similar routing signatures, while those from different categories show significantly lower similarity. The authors quantify this effect, finding that within-category routing similarity (0.8435 ± 0.0879) is notably higher than across-category similarity (0.6225 ± 0.1687), with a Cohen’s d of 1.44. A logistic regression classifier trained on routing signatures achieves a cross-validated accuracy of 92.5% ± 6.1% for four-way task classification. The paper also introduces statistical baselines to validate the findings, showing that the observed routing patterns cannot be solely attributed to sparsity or balancing constraints. Additionally, the analysis reveals that task structure becomes more pronounced in deeper layers of the model. The authors conclude that routing in sparse transformers is a task-sensitive component of computation rather than merely a balancing mechanism. They also release MOE-XRAY, a toolkit for routing telemetry and analysis.
Methodology
The authors conducted experiments using the OLMoE-1B-7B-0125-Instruct model, analyzing 80 prompts across four task categories: code, math, story, and factual question-answering. They introduced routing signatures to represent expert activation patterns and employed logistic regression for task classification. Statistical baselines were established to ensure the validity of the results.
Results
The study found that routing signatures from prompts of the same task category exhibited significantly higher similarity compared to those from different categories. The logistic regression classifier achieved an accuracy of 92.5% ± 6.1% in classifying tasks based solely on routing signatures, indicating that routing behavior is task-sensitive.
Implications
The findings suggest that routing mechanisms in MoE transformers can be leveraged for improved interpretability and debugging of models, as well as for enhancing the efficiency of task-specific computations in large language models. The release of the MOE-XRAY toolkit provides a resource for further exploration of routing behaviors.
abx_amr_simulator: A simulation environment for antibiotic prescribing policy optimization under antimicrobial resistance
Reinforcement Learning
Optimization
Theory
- Introduces a simulation environment for antibiotic prescribing policy optimization under AMR.
- Allows customization of patient populations and antibiotic resistance dynamics.
- Compatible with Gymnasium RL API for training reinforcement learning agents.
- Models antibiotic prescribing as a Markov Decision Process (MDP) with partial observability.
Read more
abx_amr_simulator: A simulation environment for antibiotic prescribing policy optimization under antimicrobial resistance
Summary
The paper presents the abx_amr_simulator, a Python-based simulation package aimed at modeling antibiotic prescribing and antimicrobial resistance (AMR) dynamics. This simulator addresses the global health threat posed by AMR, which complicates clinical decision-making and reduces the effectiveness of antibiotics. The abx_amr_simulator allows users to customize patient populations, antibiotic-specific AMR response curves, and reward functions that balance immediate clinical benefits with long-term resistance management. Key features include a modular design for patient attributes, a leaky-balloon model for resistance dynamics, and tools for exploring partial observability through noise, bias, and delay in observations. The simulator is compatible with the Gymnasium reinforcement learning API, enabling the training and testing of reinforcement learning agents in various clinical scenarios. The abx_amr_simulator serves as a benchmark environment for sequential decision-making under uncertainty, providing a valuable tool for studying AMR dynamics and optimizing antibiotic stewardship strategies.
Methodology
The abx_amr_simulator models antibiotic prescribing and AMR dynamics as a Markov Decision Process (MDP). It features a modular architecture with components such as PatientGenerator for synthetic population modeling, RewardCalculator for evaluating prescribing decisions, and AMR_LeakyBalloon for tracking resistance levels. The simulator allows for the exploration of various clinical scenarios with customizable parameters and is designed to support reinforcement learning methods.
Results
The simulator provides a flexible and extensible framework for studying AMR dynamics and evaluating antibiotic prescribing policies. It enables researchers to simulate different patient populations and antibiotic resistance scenarios, facilitating the assessment of long-term effects of prescribing interventions under realistic uncertainties.
Implications
The abx_amr_simulator has significant implications for public health by offering a tool to optimize antibiotic prescribing strategies and improve antibiotic stewardship programs. It can help researchers and policymakers understand the dynamics of AMR and develop effective interventions to mitigate its impact on global health.
Graph Tokenization for Bridging Graphs and Transformers
Graph Learning
- Introduction of a graph tokenization framework that combines reversible graph serialization with BPE.
- Structure-guided serialization process that addresses ordering ambiguities in graphs.
- Enables standard Transformer models to achieve state-of-the-art results on 14 graph benchmarks.
- Outperforms traditional GNNs and specialized Graph Transformers in various tasks.
Read more
Graph Tokenization for Bridging Graphs and Transformers
Summary
This paper presents a novel graph tokenization framework aimed at integrating graph-structured data with Transformer models, particularly large pretrained Transformers. The authors propose a method that combines reversible graph serialization with Byte Pair Encoding (BPE) to create sequential representations of graphs while preserving essential structural information. The serialization process is guided by global statistics of graph substructures, ensuring that frequently occurring patterns are represented effectively in the tokenization process. The framework allows standard Transformer architectures, such as BERT, to be applied directly to graph benchmarks without requiring modifications. The empirical results demonstrate that this approach achieves state-of-the-art performance across 14 benchmark datasets for graph classification and regression tasks, often outperforming both traditional Graph Neural Networks (GNNs) and specialized Graph Transformers. This work effectively bridges the gap between graph-structured data and the Transformer ecosystem, providing a new interface for processing graphs in a manner compatible with existing sequence models.
Methodology
The methodology involves a graph tokenization framework that integrates reversible graph serialization techniques, such as extended Euler circuits and minimal-weight graph traversals, with Byte Pair Encoding (BPE). The serialization process is guided by global statistics of graph substructures to ensure that frequently occurring patterns are adjacent in the sequence, allowing BPE to effectively merge them into meaningful tokens.
Results
The proposed tokenizer achieves state-of-the-art performance on 14 benchmark datasets for graph classification and regression, demonstrating significant improvements over both established Graph Neural Networks and specialized Graph Transformers. The results indicate that the framework allows standard Transformer architectures to effectively process graph-structured data.
Implications
The implications of this work suggest that it is now feasible to apply Transformer models directly to graph-structured data without architectural changes, potentially expanding the applicability of Transformers in various domains that utilize graph data, such as social network analysis, molecular chemistry, and recommendation systems.
Effective Resistance Rewiring: A Simple Topological Correction for Over-Squashing
Graph Learning
- Introduces Effective Resistance Rewiring (ERR) to address over-squashing in GNNs.
- ERR uses effective resistance as a global measure to identify structural bottlenecks.
- Demonstrates a trade-off between over-squashing and oversmoothing in GNNs.
- Combining ERR with normalization techniques enhances model performance.
Read more
Effective Resistance Rewiring: A Simple Topological Correction for Over-Squashing
Summary
This paper addresses the challenge of over-squashing in Graph Neural Networks (GNNs), which hampers their ability to capture long-range dependencies due to structural bottlenecks. The authors propose Effective Resistance Rewiring (ERR), a novel topology correction strategy that utilizes effective resistance as a global signal to identify and mitigate these bottlenecks. ERR operates by iteratively adding edges between node pairs with the highest resistance while removing those with the least resistance, thereby enhancing weak communication pathways while adhering to a fixed edge budget. The method is parameter-free aside from the rewiring budget and relies on a single global measure that aggregates all paths between node pairs. The authors evaluate the predictive performance of ERR on Graph Convolutional Networks (GCNs) and analyze its impact on message propagation by examining cosine similarity between node embeddings across layers. Experiments conducted on both homophilic (Cora, CiteSeer) and heterophilic (Cornell, Texas) graphs, including directed settings with DirGCN, reveal a trade-off between over-squashing and oversmoothing, where excessive representation mixing can occur in deeper models. The study finds that resistance-guided rewiring enhances connectivity and signal propagation but may accelerate representation mixing. To stabilize this trade-off, combining ERR with normalization techniques like PairNorm is shown to improve performance, especially in heterophilic contexts.
Methodology
The methodology involves the ERR algorithm, which iteratively rewires the graph by adding edges between node pairs with the highest effective resistance and removing those with the lowest. The authors analyze the impact of this rewiring on message propagation and node embeddings through cosine similarity metrics across layers, comparing results with and without rewiring.
Results
The experiments show that ERR significantly improves connectivity and signal propagation in GNNs, particularly in heterophilic settings. However, it also reveals that deeper models may experience accelerated representation mixing, leading to oversmoothing. The combination of ERR with normalization techniques like PairNorm effectively stabilizes this trade-off and enhances overall performance.
Implications
The findings suggest that ERR can be a valuable tool for improving GNN performance, particularly in scenarios where long-range dependencies are critical. The approach may have applications in various domains that utilize GNNs, such as social network analysis, recommendation systems, and biological network modeling.
Matching Features, Not Tokens: Energy-Based Fine-Tuning of Language Models
NLP
Large Language Models
Reinforcement Learning
- Introduces a feature-matching loss for fine-tuning language models targeting sequence-level statistics.
- Proposes Energy-Based Fine-Tuning (EBFT) as an efficient method to optimize the feature-matching objective.
- EBFT outperforms traditional supervised fine-tuning (SFT) and matches reinforcement learning with verifiable rewards (RLVR) in downstream tasks.
- Demonstrates that EBFT achieves lower validation cross-entropy while improving downstream accuracy.
Read more
Matching Features, Not Tokens: Energy-Based Fine-Tuning of Language Models
Summary
This paper addresses the limitations of traditional cross-entropy (CE) training for language models, which optimizes next-token prediction but fails to ensure sequence-level behavior during model rollouts. The authors propose a novel feature-matching objective for fine-tuning language models that focuses on matching sequence-level statistics of the completion distribution. This approach, termed Energy-Based Fine-Tuning (EBFT), utilizes strided block-parallel sampling to efficiently generate multiple rollouts from nested prefixes, allowing for concurrent feature extraction. The resulting embeddings are then used for on-policy policy-gradient updates. The theoretical foundation connects EBFT to KL-regularized feature-matching and energy-based modeling. Empirical results demonstrate that EBFT matches the performance of reinforcement learning with verifiable rewards (RLVR) while outperforming supervised fine-tuning (SFT) in downstream tasks, achieving lower validation cross-entropy and better accuracy across various applications, including Q&A coding, unstructured coding, and translation.
Methodology
The methodology involves defining a feature-matching loss that measures the squared error between the mean feature embeddings of model rollouts and ground-truth completions. EBFT employs strided block-parallel sampling for efficient rollout generation and uses a REINFORCE-style gradient estimator for training, optimizing the feature-matching loss directly without requiring task-specific verifiers.
Results
The empirical evaluation shows that EBFT achieves lower feature-matching loss across various completion lengths compared to SFT and RLVR. It demonstrates improved downstream performance in tasks such as Q&A coding, unstructured coding, and translation, while maintaining a lower validation cross-entropy.
Implications
The findings suggest that optimizing for feature-matching rather than token-level predictions can lead to better-calibrated language models, enhancing their performance in open-ended tasks. This approach may have applications in various NLP tasks where sequence-level accuracy is crucial.
Attention Sinks Are Provably Necessary in Softmax Transformers: Evidence from Trigger-Conditional Tasks
Theory
NLP
Large Language Models
- Attention sinks are a necessary feature of softmax transformers for certain tasks.
- A trigger-conditional task is introduced to formalize the need for attention sinks.
- Softmax normalization drives the formation of attention sinks, unlike ReLU attention.
- Empirical experiments validate theoretical predictions regarding attention behavior.
Read more
Attention Sinks Are Provably Necessary in Softmax Transformers: Evidence from Trigger-Conditional Tasks
Summary
This paper investigates the phenomenon of attention sinks in softmax transformers, where attention probability mass concentrates on a fixed, content-agnostic position. The author proves that for softmax self-attention models, computing a specific trigger-conditional behavior necessitates the existence of these sinks. The study introduces a trigger-conditional task where a model must output the average of all preceding token representations when a designated trigger token appears, and output zero otherwise. This task reflects the behavior of attention heads observed in real-world applications. The author demonstrates that softmax attention models must exhibit sink behavior to achieve effective performance on this task, while non-normalized ReLU attention can solve the same task without forming sinks. The findings are supported by experiments showing that softmax models develop strong sinks, whereas ReLU attention eliminates them, confirming that the normalization constraint is the primary cause of sink behavior. Overall, the paper highlights the necessity of attention sinks in softmax transformers and provides insights into their implications for model performance and design.
Methodology
The author introduces a theoretical framework based on a trigger-conditional task to analyze the necessity of attention sinks in softmax transformers. The study includes necessity theorems for single-layer and multi-layer models, complemented by empirical experiments comparing softmax and ReLU attention mechanisms.
Results
The paper establishes that single-layer softmax attention models must concentrate attention on a fixed sink token at non-trigger positions to achieve low error rates. For multi-layer models, at least one layer must exhibit sink behavior. The experiments confirm that softmax models develop attention sinks while ReLU attention models do not, validating the theoretical claims.
Implications
The findings suggest that attention sinks are not merely an artifact of model training but are essential for certain functional behaviors in softmax transformers. This has implications for model design, optimization, and understanding attention mechanisms in various applications, including NLP and multimodal tasks.
Security Considerations for Artificial Intelligence Agents
Large Language Models
Theory
- AI agents introduce new security vulnerabilities distinct from traditional software systems.
- The distinction between code and data is increasingly blurred in LLM-powered systems.
- Existing security mechanisms may not be suitable for the autonomous and adaptable nature of AI agents.
- A layered defense strategy is necessary to address the unique risks associated with AI agents.
Read more
Security Considerations for Artificial Intelligence Agents
Summary
This paper discusses the unique security challenges posed by AI agents, particularly those powered by Large Language Models (LLMs). It highlights how these systems blur the lines between code and data, leading to new vulnerabilities that traditional security mechanisms may not adequately address. The authors identify key attack surfaces, including indirect prompt injection and cascading failures in workflows, and propose a layered defense strategy that includes input-level mitigations, sandboxed execution, and policy enforcement. They also point out gaps in current standards and research, advocating for adaptive security benchmarks and improved policy models for delegation and privilege control. The insights are drawn from Perplexity's extensive experience with AI agents used by millions, emphasizing the need for tailored security solutions in this evolving landscape.
Methodology
The authors conducted a comprehensive analysis of the security threats, risks, and vulnerabilities specific to AI agent systems, drawing from their operational experience and existing literature. They mapped attack surfaces and assessed current defenses, proposing a layered security framework.
Results
The paper identifies critical vulnerabilities in AI agent systems, such as the risks of prompt injection and the challenges of maintaining code-data separation. It also outlines a multi-layered defense strategy and highlights the need for new standards and research to address these emerging security concerns.
Implications
The findings suggest that as AI agents become more prevalent, there is an urgent need for tailored security mechanisms that can effectively mitigate risks. This could lead to the development of new standards and practices in AI security, influencing both research and industry applications.
Survival Meets Classification: A Novel Framework for Early Risk Prediction Models of Chronic Diseases
Theory
Interpretability
- Integration of survival analysis and classification for chronic disease risk prediction.
- Development of models that do not rely on lab results, enhancing early intervention capabilities.
- Performance metrics of the proposed models are competitive with leading machine learning models.
- Clinically validated explanations of model predictions using SHAP, ensuring relevance to healthcare.
Read more
Survival Meets Classification: A Novel Framework for Early Risk Prediction Models of Chronic Diseases
Summary
This paper presents a novel framework for early risk prediction models of chronic diseases by integrating survival analysis with classification techniques. The authors focus on five prevalent chronic diseases: diabetes, hypertension, chronic kidney disease (CKD), chronic obstructive pulmonary disease (COPD), and chronic ischemic heart disease (CHD). Traditional models have typically approached disease prediction through either survival analysis or classification independently, but this study re-engineers survival models to also function as classifiers. By utilizing big electronic medical record (EMR) data, the authors aim to create models that can predict disease onset without relying on laboratory test results, thus allowing for timely medical interventions. The proposed models demonstrate performance metrics such as accuracy, F1 score, and AUROC that are comparable to or exceed those of state-of-the-art models like LightGBM and XGBoost. Additionally, the study introduces a novel methodology for generating clinically validated explanations of the model outputs using the SHAP algorithm, ensuring that the findings are relevant and actionable for healthcare providers.
Methodology
The authors utilized electronic medical records (EMR) data to develop early disease risk prediction models by re-engineering survival analysis methods to also provide classification outputs. The models were trained on de-identified patient data, focusing on common chronic diseases while excluding laboratory results. The SHAP algorithm was employed to generate explanations for the model predictions, which were validated by a panel of expert physicians.
Results
The proposed survival models achieved performance metrics (accuracy, F1 score, AUROC) that are comparable to or better than existing state-of-the-art models like LightGBM and XGBoost. The models effectively predicted the risk of chronic diseases using only routine patient data recorded in EMRs, demonstrating their potential for early intervention.
Implications
The findings suggest that integrating survival analysis with classification can significantly enhance early risk prediction for chronic diseases, allowing healthcare providers to implement timely preventive measures. This approach could lead to improved patient outcomes and more efficient healthcare management by identifying at-risk patients before the onset of severe conditions.
STAMP: Selective Task-Aware Mechanism for Text Privacy
NLP
Large Language Models
Efficient ML
- STAMP provides a selective approach to text privatization, enhancing privacy without sacrificing task utility.
- The polar mechanism allows for direction-only perturbations of embeddings, preserving semantic meaning.
- Experimental results show STAMP outperforms traditional methods in maintaining privacy-utility balance.
- The framework is applicable to various contexts, including inference-time privacy and privacy-preserving text rewriting.
Read more
STAMP: Selective Task-Aware Mechanism for Text Privacy
Summary
The paper introduces STAMP (Selective Task-Aware Mechanism for Text Privacy), a novel framework designed to enhance the privacy-utility trade-off in text privatization. STAMP operates by selectively distributing privacy budgets across tokens based on their importance to specific downstream tasks and their privacy sensitivity. This token-level approach allows for nuanced control over the noise applied to different parts of the input text, thereby balancing the need for privacy protection with the relevance of the text for the task at hand. The authors propose a new perturbation technique called the polar mechanism, which modifies only the direction of token embeddings while preserving their magnitude. This method aligns the perturbation with the decoding process, maintaining semantic relationships in the embedding space better than traditional isotropic noise methods. The effectiveness of STAMP is validated through experiments on various datasets, including SQuAD, Yelp, and AG News, demonstrating its superior performance in achieving favorable privacy-utility trade-offs across different privacy budgets.
Methodology
STAMP employs a selective allocation of privacy budgets across tokens based on their task importance and privacy sensitivity. It introduces the polar mechanism for perturbing token embeddings, which modifies only the direction of embeddings while keeping their magnitude intact. The decoding process utilizes cosine nearest-neighbor search to align with the perturbation geometry, ensuring semantic neighborhoods are preserved.
Results
The experimental evaluations reveal that STAMP, when combined with the normalized polar mechanism, consistently achieves better privacy-utility trade-offs compared to existing methods across multiple datasets. The results indicate that STAMP effectively balances the need for privacy with the relevance of the text for downstream tasks.
Implications
STAMP's approach to selective text privatization could significantly enhance privacy in applications involving sensitive user-generated content, such as chatbots and recommendation systems. Its ability to maintain task utility while protecting privacy makes it a valuable tool for deploying large language models in real-world scenarios.
Representation Finetuning for Continual Learning
Efficient ML
Robotics
Theory
- CoRe is the first framework to integrate representation finetuning into continual learning.
- It performs task-specific interventions in low-rank subspaces of hidden representations.
- CoRe achieves superior parameter efficiency and mitigates catastrophic forgetting.
- Extensive experiments show CoRe outperforms existing parameter-efficient fine-tuning methods.
Read more
Representation Finetuning for Continual Learning
Summary
This paper introduces Continual Representation Learning (CoRe), a novel framework that shifts the finetuning paradigm in continual learning from weight space to representation space. Traditional Parameter-Efficient Fine-Tuning (PEFT) methods often lead to catastrophic forgetting and lack interpretability due to their reliance on black-box optimization. CoRe addresses these issues by performing task-specific interventions within a low-rank linear subspace of hidden representations, ensuring stability for past tasks while allowing for adaptability to new tasks. The framework employs explicit optimization objectives to guide representation evolution, achieving superior parameter efficiency. Extensive experiments across multiple continual learning benchmarks demonstrate that CoRe significantly outperforms existing state-of-the-art methods, enhancing the adaptability of pre-trained models in dynamic environments. This work presents a more effective and interpretable approach to continual learning, making it suitable for applications in autonomous systems, robotics, and personalized AI assistants.
Methodology
The CoRe framework constructs task-specific low-rank intervention subspaces within critical representation layers of the model. It utilizes explicit optimization goals to guide the evolution of representations, allowing for efficient adaptation to new tasks while preserving knowledge from previous tasks.
Results
CoRe consistently outperformed existing state-of-the-art methods across various continual learning benchmarks, demonstrating both enhanced performance and parameter efficiency.
Implications
The CoRe framework enhances the adaptability of pre-trained models in dynamic environments, making it particularly suitable for real-world applications such as autonomous systems, robotics, and personalized AI assistants, where efficient lifelong learning is crucial.
Heavy-Tailed Principle Component Analysis
Theory
- Introduces a robust PCA framework for heavy-tailed data using a superstatistical model.
- Formulates PCA with a logarithmic loss function, applicable even without finite moments.
- Demonstrates that principal components from heavy-tailed data coincide with those from Gaussian covariance.
- Proposes new robust covariance estimators that outperform classical methods in challenging noise conditions.
Read more
Heavy-Tailed Principle Component Analysis
Summary
This paper presents a novel approach to Principal Component Analysis (PCA) that addresses the limitations of classical PCA when dealing with heavy-tailed data and impulsive noise. Traditional PCA relies on second-order moments, making it sensitive to outliers and noise. The authors propose a framework based on a superstatistical dependent model, where data is represented as X = A^(1/2)G, with A being a positive random scalar and G a Gaussian vector. This model captures a wide range of heavy-tailed distributions, including multivariate-t and sub-Gaussian α-stable laws. The authors formulate PCA under a logarithmic loss function, which remains applicable even when moments do not exist. They demonstrate that the principal components derived from heavy-tailed observations align with those obtained from the covariance matrix of the underlying Gaussian generator. The paper introduces robust estimators for this covariance matrix directly from heavy-tailed data and compares their performance against traditional empirical covariance and Tyler’s scatter estimator. Extensive experiments, including background denoising tasks, show that the proposed method effectively recovers principal directions and significantly outperforms classical PCA in the presence of heavy-tailed and impulsive noise, while maintaining competitive performance under Gaussian noise.
Methodology
The authors develop a PCA framework based on a superstatistical model of heavy-tailed data, employing a logarithmic loss function for robustness. They derive robust covariance estimators from heavy-tailed observations and compare these with traditional methods, including empirical covariance and Tyler's scatter estimator. The methodology is validated through extensive experiments focused on background denoising tasks.
Results
The proposed robust PCA approach successfully recovers principal components from heavy-tailed data, demonstrating significant improvements over classical PCA, particularly in the presence of heavy-tailed and impulsive noise. The new covariance estimators show enhanced performance, effectively capturing the underlying data structure compared to traditional methods.
Implications
This work has significant implications for fields that deal with high-dimensional data prone to heavy-tailed distributions, such as finance, environmental monitoring, and image processing. The robust PCA framework can enhance data analysis and interpretation in these domains, leading to more reliable outcomes in the presence of noise and outliers.
LongFlow: Efficient KV Cache Compression for Reasoning Models
NLP
Large Language Models
Efficient ML
- Introduction of LongFlow, a lightweight KV cache compression algorithm tailored for long-output generation.
- Efficient importance estimation derived from attention computation, requiring negligible overhead.
- Development of a custom Triton kernel that fuses multiple operations to enhance performance.
- Achieves up to 11.8× throughput improvement and 80% KV cache compression with minimal accuracy loss.
Read more
LongFlow: Efficient KV Cache Compression for Reasoning Models
Summary
The paper introduces LongFlow, a novel KV cache compression method designed specifically for reasoning models that generate long output sequences. Traditional KV cache optimization techniques are inadequate for these models, which face significant memory and bandwidth challenges due to their extensive output requirements. LongFlow addresses these issues by proposing an efficient importance estimation metric derived from intermediate attention computation results, allowing for substantial KV cache compression without auxiliary storage or significant computational overhead. Additionally, the authors develop a custom Triton kernel that integrates various operations into a single optimized operator, enhancing system-level efficiency. Experimental results demonstrate that LongFlow achieves up to an 11.8× improvement in throughput while compressing the KV cache by 80%, all with minimal impact on model accuracy. This advancement is particularly relevant for applications in mathematical reasoning and code generation, where long outputs are common.
Methodology
LongFlow employs a novel importance estimation metric based on intermediate attention computation results, which is computed using only the current query. The method avoids auxiliary storage and introduces minimal computational overhead. The authors also implement a custom Triton kernel that integrates attention computation, importance estimation, and token eviction into a single optimized operator, enhancing efficiency.
Results
LongFlow demonstrates significant improvements in performance metrics, achieving up to an 11.8× increase in throughput and an 80% reduction in KV cache size, while maintaining model accuracy. The custom kernel reduces attention computation latency from 47 ms to 8 ms.
Implications
The advancements presented in LongFlow have the potential to significantly improve the efficiency of reasoning models in various applications, particularly in fields requiring extensive output generation such as mathematical reasoning and code generation. This could lead to more cost-effective deployment of large language models in real-world scenarios.
Context-dependent manifold learning: A neuromodulated constrained autoencoder approach
Theory
Interpretability
Robotics
- Introduction of the Neuromodulated Constrained Autoencoder (NcAE) for context-dependent manifold learning.
- Integration of a neuromodulatory mechanism to adaptively tune geometric constraints based on static context.
- Demonstrated effectiveness on dynamical systems, capturing manifold geometry variations.
- Maintains rigorous projection properties, ensuring physical consistency in latent space.
Read more
Context-dependent manifold learning: A neuromodulated constrained autoencoder approach
Summary
This paper introduces the Neuromodulated Constrained Autoencoder (NcAE), a novel framework designed to enhance manifold learning by incorporating a neuromodulatory mechanism into the constrained autoencoder (cAE) architecture. Traditional cAEs struggle to adapt to varying environmental conditions without conflating contextual shifts with primary input data. The NcAE addresses this limitation by allowing for context-dependent geometric constraints, enabling the model to learn distinct manifolds that are parameterized by static contextual information. The authors demonstrate the effectiveness of the NcAE through experiments on two dynamical systems: a 16-degree-of-freedom pendulum and the Lorenz96 system. Results indicate that the NcAE successfully captures variations in manifold geometry across different regimes while preserving rigorous projection properties, thus decoupling global contextual parameters from local manifold representations. This advancement lays the groundwork for more flexible, physics-informed representations in systems influenced by non-stationary environmental constraints.
Methodology
The NcAE framework integrates a neuromodulatory mechanism into the cAE architecture, allowing for adaptive tuning of network activation functions based on a static context vector. This approach enables the model to learn a family of context-specific manifolds while enforcing geometric constraints that ensure idempotency and smooth manifold representations. The architecture was evaluated on two dynamical systems to assess its performance against baseline models.
Results
The NcAE effectively captured the variations in manifold geometry across different regimes in the tested dynamical systems, demonstrating superior performance compared to traditional models. The results confirmed that the neuromodulatory mechanism successfully decouples global contextual parameters from local manifold representations, maintaining rigorous projection properties.
Implications
The NcAE framework has significant implications for reduced-order modeling and system identification in various fields, particularly in scenarios where environmental conditions are non-stationary. It opens avenues for developing more adaptable and interpretable machine learning models that can better represent complex physical systems.
Scaling Reasoning Efficiently via Relaxed On-Policy Distillation
Reinforcement Learning
Large Language Models
Efficient ML
- REOPOLD stabilizes on-policy distillation by relaxing strict imitation constraints.
- The framework utilizes modern RL insights to improve sample efficiency and test-time scaling.
- Empirical results show REOPOLD outperforms traditional methods in various reasoning tasks.
- The approach allows smaller models to achieve performance levels comparable to much larger models.
Read more
Scaling Reasoning Efficiently via Relaxed On-Policy Distillation
Summary
This paper introduces REOPOLD (Relaxed On-Policy Distillation), a novel framework designed to enhance the efficiency of reasoning capabilities in smaller language models (SLMs) by addressing the limitations of traditional on-policy distillation methods. The authors analyze the instability and negative transfer issues associated with standard on-policy distillation, interpreting it through the lens of reinforcement learning (RL) as a form of policy optimization. By relaxing strict imitation constraints, REOPOLD employs a mixture-based reward clipping, entropy-based dynamic sampling, and a unified exploration-to-refinement training strategy to stabilize the optimization process. Empirical results demonstrate that REOPOLD significantly outperforms existing methods, achieving 6.7 to 12 times greater sample efficiency and enabling a 7B student model to match the performance of a 32B teacher model in visual reasoning tasks, while also providing a 3.32 times speedup in inference.
Methodology
The authors propose REOPOLD, which interprets on-policy distillation as policy optimization. It employs techniques such as mixture-based reward clipping, entropy-based dynamic sampling, and a multi-stage training approach to filter harmful signals and stabilize the learning process. This framework is designed to enhance sample efficiency and scalability during training and inference.
Results
REOPOLD achieves state-of-the-art performance on the AIME-25 benchmark, demonstrating superior sample efficiency (6.7 to 12 times better than baselines) and enabling a 7B student model to closely match the performance of a 32B teacher model in visual reasoning tasks. Additionally, it provides a significant inference speedup of approximately 3.32 times.
Implications
The findings suggest that REOPOLD can be effectively utilized to enhance the reasoning capabilities of smaller models, making them more competitive with larger models. This has potential applications in various domains requiring efficient reasoning, such as educational tools, automated reasoning systems, and AI-driven decision-making processes.
Efficient Generative Modeling with Unitary Matrix Product States Using Riemannian Optimization
Generative Models
Optimization
Efficient ML
- Introduction of a unitary MPS framework that enhances generative modeling by enforcing tensor-norm constraints.
- Development of a Riemannian optimization technique that improves training stability and efficiency for MPS.
- Demonstration of strong generative performance on benchmark datasets, validating the advantages of the proposed method.
Read more
Efficient Generative Modeling with Unitary Matrix Product States Using Riemannian Optimization
Summary
This paper explores the application of unitary matrix product states (MPS) for generative modeling, leveraging their strong expressive capacity and physical interpretability. The authors propose a novel Riemannian optimization approach to enhance the training efficiency of MPS, addressing the limitations of standard gradient-based methods. By enforcing manifold constraints, the unitary MPS framework reduces ambiguity in parameter updates, leading to improved stability and efficiency during training. The proposed methodology combines DMRG-inspired updates with a space-decoupling strategy, allowing for parallel optimization of MPS cores. Experimental results on the Bars-and-Stripes and EMNIST datasets demonstrate the effectiveness of the approach, showcasing rapid adaptation to data structures and strong generative performance while maintaining the advantages of MPS. Overall, the study highlights the potential of tensor networks in generative modeling, particularly in high-dimensional and quantum-structured data contexts.
Methodology
The authors developed a Riemannian optimization approach that reformulates the generative modeling task as an optimization problem with manifold constraints. This includes a unitary MPS framework that limits optimization directions to those that adjust relative weights among MPS cores, combined with a space-decoupling strategy for efficient updates.
Results
The experiments conducted on the Bars-and-Stripes and EMNIST datasets showed that the proposed Riemannian optimization method significantly outperformed traditional Euclidean gradient descent in terms of convergence speed, stability, and generative quality, confirming the benefits of the unitary MPS framework.
Implications
The findings suggest that unitary MPS can serve as a powerful tool for generative modeling, particularly in applications involving high-dimensional data or quantum systems. The Riemannian optimization framework may also be applicable to other machine learning models that require efficient training under manifold constraints.
H2LooP Spark Preview: Continual Pretraining of Large Language Models for Low-Level Embedded Systems Code
Large Language Models
NLP
Generative Models
- Development of a large-scale training corpus for embedded systems code using repository-datasheet pairs.
- Identification of optimal hyperparameters for continual pretraining using Bayesian optimization and grid search.
- Significant improvements in perplexity and generative accuracy over existing models in specialized embedded domains.
- Demonstration that smaller models can rival larger frontier models in specific technical tasks.
Read more
H2LooP Spark Preview: Continual Pretraining of Large Language Models for Low-Level Embedded Systems Code
Summary
The paper introduces H2LooP Spark Preview, a continual pretraining (CPT) pipeline designed to adapt the OLMo-3-7B language model for low-level embedded systems programming. It addresses the limitations of large language models (LLMs) in generating code for specialized domains, particularly embedded systems, which involve unique hardware-specific patterns and APIs. The authors constructed a training corpus from 818 repository-datasheet pairs, totaling 76.4 GB of data across 117 manufacturers and 19 component categories, resulting in approximately 23.5 billion tokens. They employed BF16 LoRA with Rank-Stabilized scaling on NVIDIA H100 GPUs for training. Through extensive hyperparameter exploration involving over 1,400 runs, the study found that high-rank LoRA with conservative learning rates optimally adapted the model to the embedded domain. The results showed a significant reduction in perplexity and demonstrated that the 7B-parameter model outperformed larger models like Claude Opus 4.6 and Qwen3-Coder-30B in generative code completion tasks across 13 embedded domains. The authors also released a production checkpoint as open-source to facilitate further research in this area.
Methodology
The authors utilized a continual pretraining approach on the OLMo-3-7B model, employing BF16 LoRA with Rank-Stabilized scaling on NVIDIA H100 GPUs. They constructed a training corpus from repository-datasheet pairs and conducted systematic hyperparameter exploration across multiple runs to identify the best configurations for domain adaptation.
Results
The model achieved a 70.4% reduction in in-domain perplexity and a 66.1% reduction on held-out repositories. It outperformed larger models in generative code completion benchmarks, achieving the highest accuracy in 8 out of 13 embedded categories.
Implications
The findings suggest that continual pretraining can effectively adapt language models to specialized domains like embedded systems, enabling better automated code generation and understanding. This could lead to advancements in software development for hardware-specific applications.
Disentangled Representation Learning through Unsupervised Symmetry Group Discovery
Robotics
Reinforcement Learning
Theory
- Introduces a method for autonomous discovery of symmetry group structures in representation learning.
- Proves the identifiability of true symmetry group decomposition under minimal assumptions.
- Develops two algorithms: one for symmetry group discovery and another for LSBD representation learning.
- Demonstrates improved performance over existing LSBD methods in various environments.
Read more
Disentangled Representation Learning through Unsupervised Symmetry Group Discovery
Summary
This paper presents a novel approach to symmetry-based disentangled representation learning that eliminates the need for prior knowledge of the symmetry group's structure. The authors propose a method where an embodied agent autonomously discovers the group structure of its action space through unsupervised interactions with the environment. They prove the identifiability of the true symmetry group decomposition under minimal assumptions and introduce two algorithms: one for discovering the group decomposition from interaction data and another for learning Linear Symmetry-Based Disentangled (LSBD) representations without specific subgroup assumptions. The proposed method is validated across three environments with varying group decompositions, demonstrating superior performance compared to existing LSBD approaches. This work addresses the limitations of prior methods that relied on strong assumptions about symmetry groups, thereby advancing the field of unsupervised disentangled representation learning.
Methodology
The authors derive two algorithms based on theoretical proofs: one for discovering the symmetry group decomposition from interaction data and another for learning LSBD representations without imposing structural assumptions on subgroups. The methodology relies on an embodied agent interacting with the environment to gather transition data, which is then used to identify the underlying symmetry group.
Results
The proposed method outperformed existing LSBD approaches in three distinct environments, each exhibiting different group decompositions. The experimental validation confirms the effectiveness of the algorithms in discovering symmetry groups and learning disentangled representations.
Implications
This research has significant implications for improving interpretability, fairness, and transferability in machine learning models. By enabling unsupervised discovery of symmetry groups, it can enhance the ability to manipulate latent spaces and improve the performance of representation learning in various applications, particularly in robotics and reinforcement learning.
Exhaustive Circuit Mapping of a Single-Cell Foundation Model Reveals Massive Redundancy, Heavy-Tailed Hub Architecture, and Layer-Dependent Differentiation Control
Interpretability
- Exhaustive circuit tracing reveals a heavy-tailed hub distribution in feature connectivity.
- Massive redundancy in feature interactions is confirmed, with no synergy found in higher-order interactions.
- Late-layer features are causally linked to promoting cellular maturity, while early-layer features push cells away from maturity.
- Systematic annotation bias is identified, with many significant features lacking biological annotations.
Read more
Exhaustive Circuit Mapping of a Single-Cell Foundation Model Reveals Massive Redundancy, Heavy-Tailed Hub Architecture, and Layer-Dependent Differentiation Control
Summary
This paper addresses limitations in the mechanistic interpretability of biological foundation models by introducing a comprehensive approach to circuit mapping. The study employs exhaustive circuit tracing, higher-order combinatorial ablation, and causal trajectory steering in Geneformer, a transformer-based single-cell model. The exhaustive tracing of 4,065 features at layer 5 resulted in 1,393,850 significant downstream edges, revealing a heavy-tailed hub distribution where a small percentage of features dominate connectivity. The analysis also demonstrated that redundancy increases with interaction order, confirming a subadditive architecture. Additionally, trajectory-guided feature steering established a causal relationship between layer position and differentiation directionality, showing that late-layer features promote cellular maturity while early-layer features do the opposite. These findings challenge previous selective analyses and provide a deeper understanding of the model's organization and processing of cellular information.
Methodology
The study utilized exhaustive circuit tracing to analyze all active features at layer 5 of Geneformer, measuring causal effects on downstream features. Three-way combinatorial ablation was performed to assess redundancy and interaction orders. Causal trajectory steering was employed to investigate the relationship between layer position and differentiation directionality.
Results
The exhaustive analysis produced 1,393,850 significant edges, a 27-fold increase over previous selective tracing methods. A heavy-tailed distribution of connectivity was observed, with 1.8% of features accounting for a majority of edges. Redundancy increased with interaction order, and late-layer features were found to universally push cell states toward maturity, contrasting with early-layer features.
Implications
These findings have significant implications for the interpretability of biological foundation models, suggesting that many unannotated features may play critical roles in cellular processes. The results could inform future research on cellular differentiation and the design of more effective models for biological data analysis.
Slack More, Predict Better: Proximal Relaxation for Probabilistic Latent Variable Model-based Soft Sensors
Generative Models
Optimization
Theory
- Introduction of KProxNPLVM to improve soft sensor modeling accuracy.
- Theoretical proof of approximation error in conventional NPLVM training methods.
- Utilization of Wasserstein distance as a proximal operator for objective relaxation.
- Rigorous derivation of optimization implementation and convergence proof.
Read more
Slack More, Predict Better: Proximal Relaxation for Probabilistic Latent Variable Model-based Soft Sensors
Summary
This paper addresses the limitations of conventional Nonlinear Probabilistic Latent Variable Models (NPLVMs) in soft sensor modeling, particularly the approximation error introduced by using amortized variational inference (AVI). The authors propose a novel model called KProxNPLVM, which improves the performance of NPLVMs by relaxing the optimization objective through the use of Wasserstein distance as a proximal operator. The paper begins with a theoretical proof of the approximation error in traditional methods and then presents a new variational inference strategy that arises from solving the relaxed optimization problem. The authors provide a rigorous derivation of the optimization implementation for KProxNPLVM and demonstrate its convergence, effectively sidestepping the approximation error. Extensive experiments on both synthetic and real-world industrial datasets validate the efficacy of the proposed model, showing significant improvements in soft sensor modeling accuracy compared to traditional approaches.
Methodology
The authors developed KProxNPLVM by relaxing the learning objective using Wasserstein distance, leading to a new variational inference strategy. They provided a theoretical framework for the optimization process and proved the convergence of the algorithm, allowing it to bypass the approximation error inherent in traditional methods.
Results
The KProxNPLVM model showed significant improvements in modeling accuracy for soft sensors when tested on both synthetic and real-world datasets, outperforming conventional NPLVMs that utilize amortized variational inference.
Implications
The proposed KProxNPLVM has potential applications in industrial soft sensor modeling, improving the accuracy of predictions related to product quality and operational efficiency, which can lead to reduced energy consumption and enhanced economic outcomes.
Frequentist Consistency of Prior-Data Fitted Networks for Causal Inference
Theory
- PFNs can exhibit prior-induced confounding bias, hindering frequentist consistency.
- A one-step posterior correction (OSPC) is proposed to address this bias.
- The OSPC restores frequentist consistency and leads to a semi-parametric Bernstein-von Mises theorem.
- Martingale posteriors are utilized to implement the OSPC effectively.
Read more
Frequentist Consistency of Prior-Data Fitted Networks for Causal Inference
Summary
This paper investigates the frequentist consistency of prior-data fitted networks (PFNs) in the context of causal inference, specifically focusing on the average treatment effect (ATE). The authors identify a significant issue where existing PFN-based estimators can exhibit prior-induced confounding bias, which prevents them from achieving frequentist consistency. To address this, they propose a one-step posterior correction (OSPC) calibration procedure that effectively mitigates this bias. The OSPC allows for the recalibration of uncertainty without the need for full retraining of the PFNs. The authors demonstrate that this approach leads to a semi-parametric Bernstein-von Mises theorem for calibrated PFNs, indicating that the calibrated ATE posteriors asymptotically match the normal distribution of classical frequentist estimators. The implementation of OSPC through martingale posteriors enables the recovery of functional nuisance posteriors necessary for the calibration. Empirical results show that the calibrated PFNs yield ATE uncertainty that aligns with frequentist uncertainty in large samples and is well-calibrated in finite samples compared to other Bayesian ATE estimators.
Methodology
The authors analyze the frequentist consistency of PFNs by identifying the prior-induced confounding bias and proposing a calibration procedure (OSPC) based on efficient influence functions. They implement this calibration using martingale posteriors to recover necessary functional nuisance posteriors.
Results
The study shows that PFNs calibrated with the OSPC produce ATE uncertainty that asymptotically matches frequentist uncertainty and is well-calibrated in finite samples, outperforming other Bayesian ATE estimators.
Implications
The findings suggest that PFNs can be reliably used for causal inference in various fields such as marketing, public policy, and medicine, provided that the proposed calibration method is applied to ensure consistent uncertainty quantification.
Causal Matrix Completion under Multiple Treatments via Mixed Synthetic Nearest Neighbors
Theory
Efficient ML
- Introduction of Mixed Synthetic Nearest Neighbors (MSNN) for causal matrix completion under multiple treatments.
- MSNN retains the statistical properties of SNN while improving sample efficiency for sparse treatment levels.
- Demonstrates the feasibility of estimating causal effects using data from multiple treatment levels.
- Empirical results show MSNN's effectiveness in real-world applications, particularly in data-scarce scenarios.
Read more
Causal Matrix Completion under Multiple Treatments via Mixed Synthetic Nearest Neighbors
Summary
This paper addresses the challenge of causal matrix completion under multiple treatments, particularly in scenarios where data is missing not at random (MNAR). The authors introduce the Mixed Synthetic Nearest Neighbors (MSNN) algorithm, which enhances the existing Synthetic Nearest Neighbors (SNN) method by integrating information across different treatment levels. The MSNN approach allows for the estimation of imputation coefficients using data from multiple treatments, thereby overcoming the limitations posed by data scarcity in certain treatment levels. The authors demonstrate that MSNN retains the statistical properties of SNN, such as finite-sample error bounds and asymptotic normality, while significantly improving sample efficiency. Empirical evaluations on both synthetic and real-world datasets, including a case study on California's tobacco control policy, reveal that MSNN effectively estimates causal effects in data-scarce environments where traditional methods fail. This work contributes to the field of causal inference by formalizing the problem of entry-wise causal matrix completion and providing a robust solution that leverages shared latent structures across treatments.
Methodology
The authors propose the MSNN algorithm, which utilizes Mixed Anchor Rows (MAR) and Mixed Anchor Columns (MAC) to estimate imputation coefficients across multiple treatment levels. This method is based on the assumption of shared latent row factors, allowing for effective data integration and overcoming the limitations of existing methods that require treatment-specific data.
Results
The MSNN algorithm demonstrates exponential improvements in sample efficiency compared to SNN, particularly in scenarios with sparse treatment data. The empirical evaluations confirm that MSNN can reliably estimate causal effects where SNN fails, showcasing its practical applicability in real-world contexts.
Implications
The findings suggest that MSNN can be a valuable tool for researchers and practitioners in fields requiring causal inference from observational data, such as economics and public policy. By effectively leveraging data across treatment levels, MSNN can enhance decision-making processes in environments with limited data availability.
Flowcean - Model Learning for Cyber-Physical Systems
Optimization
Theory
Efficient ML
- Flowcean automates model generation for Cyber-Physical Systems, addressing the complexity and diversity of these systems.
- The framework supports a variety of learning strategies and data processing methods, enhancing flexibility and usability.
- Flowcean integrates multiple learning libraries, streamlining the modeling process and making it more efficient.
- Data-driven modeling reduces the need for manual effort and domain expertise, facilitating easier model generation.
Read more
Flowcean - Model Learning for Cyber-Physical Systems
Summary
The paper introduces Flowcean, a novel framework aimed at automating the generation of models for Cyber-Physical Systems (CPS) through data-driven learning. CPS are complex systems integrating physical components with digital logic, making traditional modeling approaches labor-intensive and requiring extensive domain knowledge. Flowcean addresses these challenges by providing a modular and flexible architecture that supports various learning strategies, data processing methods, and evaluation metrics tailored to CPS scenarios. The framework facilitates the integration of diverse learning libraries and tools, streamlining the model generation and evaluation process, thus enhancing efficiency and accessibility. The authors emphasize the importance of data-driven modeling as a means to reduce manual effort and specialized knowledge, allowing for the automatic generation of models from system data. Flowcean's design aims to accommodate the unique characteristics of different CPS, making it a versatile solution for a wide range of applications in industries such as energy, mobility, and logistics.
Methodology
The authors developed Flowcean as a modular framework that incorporates various machine learning strategies and data processing techniques. The framework allows users to customize the data-driven learning pipeline according to the specific characteristics of the CPS being modeled, facilitating the automatic generation of models from collected data.
Results
Flowcean demonstrates the ability to streamline the model generation and evaluation process for CPS, making it more efficient and accessible. The framework's modular architecture allows for the integration of diverse learning tools, which enhances its adaptability to various modeling tasks.
Implications
Flowcean has the potential to significantly improve the design and operation of Cyber-Physical Systems across multiple industries by simplifying the modeling process. Its data-driven approach can lead to more reliable and efficient CPS, ultimately contributing to advancements in fields such as Industry 4.0, logistics, and energy management.
Neural Thickets: Diverse Task Experts Are Dense Around Pretrained Weights
Large Language Models
Optimization
Efficient ML
- Large pretrained models have a dense neighborhood of task-specific solutions, unlike small models.
- The density of effective solutions scales with model size, making random sampling feasible for post-training.
- RandOpt, a simple ensemble method based on random perturbations, achieves competitive performance with traditional methods.
- Diversity in the neighborhood allows for task-specific improvements, where perturbations can enhance performance on some tasks while degrading others.
Read more
Neural Thickets: Diverse Task Experts Are Dense Around Pretrained Weights
Summary
This paper investigates the distribution of task-specific solutions in the weight space of pretrained neural networks, particularly focusing on the differences between small and large models. The authors propose that while small models have a sparse distribution of effective solutions, large pretrained models exhibit a dense 'thicket' of task-expert solutions surrounding their weights. This density allows for a novel post-training method called RandOpt, which utilizes random sampling of parameter perturbations to identify effective adaptations for various tasks. The study finds that the density of task-improving solutions increases with model size, making random guessing a viable strategy for post-training in large models. The paper demonstrates that RandOpt is competitive with traditional methods like PPO and GRPO, achieving similar accuracy with significantly reduced training time. The findings suggest that once a model enters the 'thicket regime', the choice of post-training method becomes less critical, as various approaches can yield effective results.
Methodology
The authors analyze the density and diversity of task-improving solutions in the Gaussian neighborhood around pretrained weights. They propose the RandOpt algorithm, which generates N random perturbations of the pretrained weights, evaluates them on specific tasks, selects the top K perturbations, and ensembles their predictions through majority voting. The performance of RandOpt is compared against traditional post-training methods like PPO, GRPO, and ES.
Results
The results indicate that in large models, the density of task-improving solutions is significantly higher, allowing random guessing to be effective for post-training. RandOpt achieves competitive accuracy with traditional methods while requiring O(1) training time, compared to O(T) for sequential methods. The paper also shows that ensembling multiple perturbations can lead to substantial performance improvements.
Implications
The findings suggest that the effectiveness of post-training methods can be enhanced by leveraging the density of solutions in large pretrained models. This could lead to more efficient training processes and broader applications of large language models across various tasks, as the choice of post-training method becomes less critical in the thicket regime.
Duration Aware Scheduling for ASR Serving Under Workload Drift
Audio & Speech
Optimization
Efficient ML
- Duration-aware scheduling can significantly reduce end-to-end latency in ASR systems.
- Shortest Job First (SJF) reduces median latency by up to 73% but can increase tail latency.
- Highest Response Ratio Next (HRRN) provides a balanced approach, reducing median latency by up to 28% while controlling tail latency degradation.
- The proposed methods incur less than 0.1 ms scheduling overhead per request.
Read more
Duration Aware Scheduling for ASR Serving Under Workload Drift
Summary
This paper addresses the inefficiencies of first-come-first-served (FCFS) scheduling in Automatic Speech Recognition (ASR) systems, particularly under variable workloads. The authors demonstrate that audio duration can serve as a reliable proxy for job processing time in ASR models, such as Whisper. By integrating two classical scheduling algorithms—Shortest Job First (SJF) and Highest Response Ratio Next (HRRN)—into the vLLM engine, the authors evaluate their performance against realistic workloads. The results show that SJF significantly reduces median end-to-end latency by up to 73% at high loads but increases tail latency due to starvation of longer requests. In contrast, HRRN balances latency reduction with tail latency control, achieving up to 28% median latency reduction while limiting tail latency degradation to 24%. Both algorithms maintain low scheduling overhead and demonstrate robustness under workload drift, highlighting the effectiveness of duration-aware scheduling in enhancing ASR responsiveness.
Methodology
The authors analyze the correlation between audio duration and job processing time in ASR models. They implement two scheduling algorithms, SJF and HRRN, into the vLLM engine and evaluate their performance on the LibriSpeech dataset and a synthetic workload. The evaluation focuses on metrics such as median end-to-end latency and tail latency under different workload conditions.
Results
The integration of SJF into the ASR pipeline results in a median end-to-end latency reduction of up to 73% at high loads, while HRRN achieves a median latency reduction of up to 28% with a maximum tail latency degradation of 24%. Both algorithms show consistent performance across different workloads without incurring throughput penalties.
Implications
The findings suggest that adopting duration-aware scheduling can significantly enhance the responsiveness of ASR systems, making them more efficient for real-time applications such as voice assistants and real-time captioning. This approach can lead to improved user satisfaction by minimizing delays in interactive applications.
UniHetCO: A Unified Heterogeneous Representation for Multi-Problem Learning in Unsupervised Neural Combinatorial Optimization
Optimization
Graph Learning
- UniHetCO introduces a unified heterogeneous graph representation for multiple combinatorial optimization problems.
- The framework allows for unsupervised learning without requiring ground-truth solutions.
- Dynamic weighting based on gradient norms is employed to balance contributions from different problem classes during training.
- Experiments show competitive performance against existing unsupervised NCO methods and effective cross-problem adaptation.
Read more
UniHetCO: A Unified Heterogeneous Representation for Multi-Problem Learning in Unsupervised Neural Combinatorial Optimization
Summary
The paper presents UniHetCO, a novel framework for unsupervised neural combinatorial optimization (NCO) that addresses the limitations of existing methods which typically focus on single problem classes. By introducing a unified heterogeneous graph representation, UniHetCO encodes the structure, objectives, and constraints of various combinatorial optimization problems into a single input format. This allows for the training of a single model across multiple problem classes without the need for ground-truth solutions. A key innovation is the implementation of a gradient-norm-based dynamic weighting scheme that mitigates gradient imbalance during multi-problem learning, ensuring that no single problem class dominates the training process. The authors demonstrate the effectiveness of their approach through experiments on diverse datasets, showing that UniHetCO achieves competitive performance compared to state-of-the-art unsupervised NCO methods, exhibits strong cross-problem adaptation capabilities, and serves as an effective warm start for classical solvers under time constraints.
Methodology
The authors developed a heterogeneous graph input representation based on the general Quadratic Programming (QP) formulation, which allows for the encoding of multiple combinatorial optimization problems into a single model. They utilized Quadratic Unconstrained Binary Optimization (QUBO) to create a universal unsupervised loss function applicable across different problem classes. To address gradient imbalance during training, a dynamic weighting strategy was implemented, normalizing the contributions of each problem class based on the Euclidean norm of their gradients.
Results
The experiments conducted across various datasets and problem classes demonstrated that UniHetCO outperformed existing unsupervised NCO baselines. The framework showed strong adaptability to different problem classes and provided effective warm starts for classical solvers, achieving high-quality approximations within constrained time limits.
Implications
The development of a unified model for multiple combinatorial optimization problems has significant implications for practical applications in logistics, network design, and resource allocation. It reduces the need for separate models for each problem class, thereby lowering training and deployment costs while facilitating knowledge transfer across related problems.
Taming the Adversary: Stable Minimax Deep Deterministic Policy Gradient via Fractional Objectives
Reinforcement Learning
Robotics
Optimization
- Introduction of MMDDPG framework for robust policy learning in continuous control tasks.
- Formulation of training as a minimax optimization problem between user and adversary.
- Use of a fractional objective to balance performance and disturbance magnitude.
- Demonstrated improved robustness against external disturbances and model uncertainties.
Read more
Taming the Adversary: Stable Minimax Deep Deterministic Policy Gradient via Fractional Objectives
Summary
This paper addresses the challenge of ensuring robust performance in reinforcement learning (RL) agents when faced with external disturbances and model uncertainties. The authors propose a novel framework called minimax deep deterministic policy gradient (MMDDPG), which formulates the training process as a minimax optimization problem between a user policy and an adversarial disturbance policy. The key innovation is the introduction of a fractional objective that balances task performance with disturbance magnitude, preventing excessively aggressive disturbances while still allowing the adversary to effectively challenge the controller. The experimental results demonstrate that MMDDPG significantly enhances robustness against external force perturbations and parametric variations in continuous control tasks, particularly in MuJoCo environments. This approach not only stabilizes the training process but also improves the sample efficiency of off-policy deterministic policy gradient methods, making it a promising solution for real-world applications where reliability under uncertainty is critical.
Methodology
The authors developed the MMDDPG framework, which involves a two-player zero-sum game setup where the user policy aims to minimize a cost function while the adversarial policy seeks to maximize it. The fractional objective introduced helps to regulate the magnitude of disturbances, ensuring that the adversary's actions do not destabilize the learning process.
Results
Experimental evaluations in MuJoCo environments showed that MMDDPG outperformed conventional RL baselines, achieving significantly improved robustness to external force perturbations and resilience to variations in actuator parameters. The results indicate that the proposed method effectively stabilizes the learning process and enhances the agent's performance in uncertain environments.
Implications
The findings suggest that MMDDPG can be applied in safety-critical domains such as robotics and autonomous systems, where reliable performance under uncertainty is essential. This framework could lead to more resilient RL agents capable of operating effectively in dynamic and unpredictable environments.
Jailbreak Scaling Laws for Large Language Models: Polynomial-Exponential Crossover
NLP
Large Language Models
Theory
- Adversarial prompt injection can significantly amplify the attack success rate of large language models.
- The scaling of attack success rates transitions from polynomial to exponential growth based on the length of injected prompts.
- A theoretical model based on spin-glass theory provides insights into the dynamics of language generation and adversarial behavior in LLMs.
- The proposed SpinLLM model allows for the analysis of inference-time scaling and the effects of prompt injection on attack success rates.
Read more
Jailbreak Scaling Laws for Large Language Models: Polynomial-Exponential Crossover
Summary
This paper investigates the scaling laws of adversarial attacks on large language models (LLMs), particularly focusing on prompt-injection attacks that can lead to unsafe behaviors. The authors empirically demonstrate that the attack success rate (ASR) transitions from a polynomial growth to an exponential growth as the number of inference-time samples increases, particularly under adversarial prompt injection. To explain this phenomenon, they propose a theoretical generative model based on spin-glass theory, which captures the dynamics of language generation in LLMs. The model, termed SpinLLM, treats tokens as spins in a spin-glass system, where the influence of injected prompts is likened to applying a magnetic field that biases the model towards unsafe outputs. The authors derive analytical expressions for the ASR in both weak and strong magnetic field regimes, confirming their theoretical predictions with empirical data from various LLMs. This work not only elucidates the mechanisms behind adversarial prompt injection but also provides a framework for understanding the scaling behavior of attack success rates in LLMs.
Methodology
The authors developed a generative model inspired by spin-glass theory, where language generation is modeled as a system of spins. They analyzed the effects of prompt injections by comparing a teacher model (defining safe and unsafe outputs) with a student model (subject to prompt injections). The scaling of attack success rates was derived analytically and validated empirically across different LLMs.
Results
The study found that for certain models, the attack success rate grows polynomially with the number of samples in the absence of prompt injection, while for others, particularly weaker models, the growth can become exponential with prompt injection. The theoretical model accurately predicted these behaviors, demonstrating a clear transition between polynomial and exponential scaling based on the strength of the injected prompts.
Implications
This research has significant implications for the development of safer AI systems, as it highlights vulnerabilities in large language models to adversarial attacks. Understanding the scaling laws can inform the design of more robust safety mechanisms and guide future research on adversarial resilience in AI.
FlexRec: Adapting LLM-based Recommenders for Flexible Needs via Reinforcement Learning
Reinforcement Learning
Large Language Models
NLP
- FlexRec enables LLM-based recommenders to adapt to dynamic user needs and business objectives.
- The framework introduces item-level rewards and uncertainty modeling to enhance training stability.
- FlexRec outperforms traditional recommenders and LLM-based baselines in multiple recommendation scenarios.
- The approach allows for efficient generalization to unseen needs with a single LLM model.
Read more
FlexRec: Adapting LLM-based Recommenders for Flexible Needs via Reinforcement Learning
Summary
The paper introduces FlexRec, a novel framework designed to enhance LLM-based recommender systems by enabling them to adapt to dynamic and need-specific objectives. Traditional recommender systems often struggle with flexibility, as they are typically optimized for a single static target. FlexRec leverages reinforcement learning (RL) to align LLMs with complex recommendation goals, specifically addressing challenges related to sequence-level rewards and sparse interaction feedback. The authors propose two key innovations: (1) a causally grounded item-level reward system that utilizes counterfactual swaps to provide fine-grained training signals, and (2) a critic-guided, uncertainty-aware scaling mechanism that models reward uncertainty to stabilize learning. The experimental results demonstrate that FlexRec significantly improves recommendation performance, achieving up to 59% gains in NDCG@5 and 109.4% in Recall@5 across various scenarios, while also showing strong generalization capabilities to unseen needs. This positions FlexRec as a competitive solution for dynamic recommendation tasks, capable of adapting its strategies without the need for retraining on specific objectives.
Methodology
FlexRec employs a post-training reinforcement learning framework that incorporates two main components: (1) an item-level reward system based on counterfactual swaps to provide detailed feedback on item placements, and (2) a critic model that predicts reward values along with their uncertainties, which helps to down-weight unreliable rewards during training. This dual approach addresses the challenges of sparse feedback and coarse credit assignment in traditional recommendation systems.
Results
FlexRec achieves significant improvements in recommendation metrics, including up to 59% enhancement in NDCG@5 and up to 109.4% improvement in Recall@5 for need-specific ranking tasks. Additionally, it shows a 24.1% increase in Recall@5 under generalization settings, outperforming both traditional and LLM-based recommendation systems.
Implications
The findings suggest that FlexRec can be effectively utilized in real-world recommendation systems that require adaptability to varying user intents and business goals. Its ability to generalize across different needs makes it a promising candidate for universal recommendation applications, potentially transforming how recommender systems are designed and implemented.
CAETC: Causal Autoencoding and Treatment Conditioning for Counterfactual Estimation over Time
Time Series
Theory
Optimization
- CAETC addresses time-dependent confounding bias in counterfactual estimation.
- The method is model-agnostic and can be applied to various sequence architectures.
- An entropy maximization adversarial game is proposed to ensure balanced representation.
- CAETC shows significant improvements over existing counterfactual estimation methods.
Read more
CAETC: Causal Autoencoding and Treatment Conditioning for Counterfactual Estimation over Time
Summary
The paper introduces CAETC, a novel method for counterfactual estimation over time, addressing the challenge of time-dependent confounding bias in observational data. The authors propose a causal autoencoding and treatment conditioning framework that leverages an autoencoding architecture to learn a partially invertible and treatment-invariant representation. This representation allows for treatment-specific conditioning to predict outcomes effectively. The method is model-agnostic, meaning it can be applied to various architectures, including LSTMs and temporal convolution networks. The authors conduct extensive experiments on synthetic, semi-synthetic, and real-world datasets, demonstrating that CAETC significantly outperforms existing counterfactual estimation methods. The paper highlights the importance of balancing representation across treatment regimes and provides a theoretical foundation for the proposed adversarial game that minimizes the generalized Jensen-Shannon divergence, ensuring a balanced representation for accurate outcome predictions.
Methodology
The CAETC method employs a causal autoencoding architecture to learn a partially invertible representation that is invariant to treatment. It incorporates treatment conditioning to predict future outcomes and utilizes an entropy maximization adversarial game to achieve balanced representation across treatment regimes. This approach is independent of the underlying sequence model, allowing for flexibility in implementation.
Results
The experimental results indicate that CAETC achieves substantial improvements in counterfactual estimation accuracy compared to existing methods across various datasets, including synthetic, semi-synthetic, and real-world data. The theoretical analysis supports the effectiveness of the adversarial game in minimizing representation bias.
Implications
The findings suggest that CAETC can enhance personalized medicine and other applications requiring accurate counterfactual estimations by effectively managing time-dependent confounding biases. This method could lead to better decision-making processes in healthcare and other domains reliant on causal inference.
Ensuring Safety in Automated Mechanical Ventilation through Offline Reinforcement Learning and Digital Twin Verification
Reinforcement Learning
Time Series
Optimization
- Introduction of T-CQL, a novel offline RL framework that incorporates temporal modeling and safety measures.
- Development of a clinically relevant reward function that captures early indicators of VILI.
- Validation of the framework using digital twin simulations for real-time policy evaluation.
- Demonstration of improved performance over existing offline RL methods in optimizing mechanical ventilation.
Read more
Ensuring Safety in Automated Mechanical Ventilation through Offline Reinforcement Learning and Digital Twin Verification
Summary
This paper addresses the critical issue of ensuring safety in automated mechanical ventilation (MV) for patients with acute respiratory failure (ARF) in intensive care units (ICUs). The authors propose a novel offline reinforcement learning (RL) framework called Transformer-based Conservative Q-Learning (T-CQL), which integrates a Transformer encoder to effectively model temporal dependencies in patient dynamics. The framework employs conservative adaptive regularization to ensure safety and utilizes a clinically informed reward function that captures indicators of ventilator-induced lung injury (VILI) and patient severity. The study highlights the limitations of previous approaches that relied on static offline data and mortality-based rewards, which failed to adequately represent the dynamic nature of patient conditions. To validate their approach, the authors utilized interactive digital twins of ARF patients for real-time evaluation of the RL policies. The results indicate that T-CQL consistently outperforms existing offline RL methodologies, providing safer and more effective ventilatory adjustments, thereby demonstrating the potential of combining Transformer models with conservative RL strategies as a decision support tool in critical care settings.
Methodology
The methodology involves the development of the T-CQL framework, which combines a Transformer encoder for temporal feature extraction, a multi-head Q-network for action-value estimation, and uncertainty quantification modules for safe policy learning. The framework is validated using digital twin simulations of ARF patients to evaluate the RL policies in a dynamic clinical environment.
Results
The results show that T-CQL significantly outperforms state-of-the-art offline RL methods in providing safer and more effective ventilatory adjustments. The framework's ability to capture temporal dynamics and early indicators of VILI contributes to improved decision-making in mechanical ventilation.
Implications
The findings suggest that T-CQL can serve as a valuable decision support tool in critical care, potentially leading to better patient outcomes through personalized and automated mechanical ventilation strategies. The integration of digital twin technology for real-time evaluation also opens avenues for further research in clinical applications of RL.
Beyond the Class Subspace: Teacher-Guided Training for Reliable Out-of-Distribution Detection in Single-Domain Models
Computer Vision
Theory
- Identification of Domain-Sensitivity Collapse (DSC) as a critical failure mode in single-domain OOD detection.
- Introduction of Teacher-Guided Training (TGT) to enhance domain sensitivity in feature representations.
- Demonstration of significant improvements in OOD detection performance across multiple benchmarks.
- TGT maintains in-domain classification accuracy while reducing OOD detection false positives.
Read more
Beyond the Class Subspace: Teacher-Guided Training for Reliable Out-of-Distribution Detection in Single-Domain Models
Summary
This paper addresses the challenge of out-of-distribution (OOD) detection in single-domain models, which often suffer from a phenomenon termed Domain-Sensitivity Collapse (DSC). DSC occurs when supervised training compresses features into a low-rank class subspace, leading to a loss of sensitivity to domain shifts. The authors propose a novel approach called Teacher-Guided Training (TGT), which utilizes a frozen multi-domain teacher model (DINOv2) to distill class-suppressed residual structures into a student model during training. This method aims to restore domain-sensitive geometry without adding inference overhead. The results demonstrate that TGT significantly reduces the false positive rate at 95% recall (FPR@95) for various distance-based OOD detection methods across eight single-domain benchmarks while maintaining or slightly improving in-domain OOD and classification accuracy. The findings highlight the geometric nature of OOD detection failures and provide a practical solution for enhancing the reliability of single-domain models.
Methodology
The authors formalize the concept of Domain-Sensitivity Collapse (DSC) and propose Teacher-Guided Training (TGT) as a solution. TGT involves training a student model using residual structures from a frozen multi-domain teacher model, focusing on class-suppressed features to enhance domain sensitivity. The teacher and auxiliary components are discarded post-training, ensuring no inference overhead.
Results
TGT achieved substantial reductions in FPR@95 for distance-based OOD detection methods: MDS improved by 11.61 percentage points, ViM by 10.78 percentage points, and kNN by 12.87 percentage points on average across ResNet-50. The method also maintained or slightly improved in-domain OOD and classification accuracy across eight benchmarks.
Implications
The findings suggest that TGT can be effectively applied in practical systems trained on single-domain data, improving OOD detection reliability in fields such as medical imaging, remote sensing, and industrial inspection. This work opens avenues for further research into geometric properties of representations in machine learning.
Personalized Federated Learning via Gaussian Generative Modeling
Federated Learning
Generative Models
Optimization
- Introduces pFedGM, a personalized federated learning method using Gaussian generative modeling.
- Balances global collaboration and personalization through a dual objective approach.
- Decouples the Gaussian classifier into a navigator and a statistic extractor for improved representation learning.
- Employs a dual-scale fusion framework for personalized classifier head development.
Read more
Personalized Federated Learning via Gaussian Generative Modeling
Summary
This paper addresses the challenges of personalized federated learning (PFL) in the context of data heterogeneity among clients. Traditional federated learning approaches often struggle with non-IID data distributions, leading to suboptimal model performance. The authors propose a novel method called pFedGM, which utilizes Gaussian generative modeling to better capture client-specific data characteristics. The method involves training a Gaussian generator that models client heterogeneity through weighted re-sampling. pFedGM employs a dual objective approach that maximizes inter-class distance across clients while minimizing intra-class distance within them. This is achieved by decoupling the Gaussian classifier into a navigator for global optimization and a statistic extractor for capturing distributional statistics. The framework incorporates a dual-scale fusion mechanism to provide each client with a personalized classifier head, enabling Bayesian inference for class probability estimation. The evaluation of pFedGM across various scenarios, including class count heterogeneity and environmental corruption, demonstrates its superior performance compared to existing state-of-the-art methods.
Methodology
The methodology involves training a Gaussian generator to model client heterogeneity, employing a dual objective that maximizes inter-class distance and minimizes intra-class distance. The Gaussian classifier is decoupled into a navigator for global optimization and a statistic extractor for capturing distributional statistics. A dual-scale fusion framework is used to personalize classifier heads for each client, enabling Bayesian inference for class probability estimation.
Results
pFedGM outperforms or matches the performance of state-of-the-art methods in various scenarios, including those with class count heterogeneity and environmental corruption, demonstrating its effectiveness in addressing the challenges of personalized federated learning.
Implications
The proposed method has significant implications for applications requiring privacy-preserving collaborative learning, particularly in environments with heterogeneous data distributions. It can enhance model performance in various domains, including healthcare, finance, and IoT, where data privacy and security are paramount.
Attention Gathers, MLPs Compose: A Causal Analysis of an Action-Outcome Circuit in VideoViT
Computer Vision
Interpretability
- The study reveals that pre-trained Video Vision Transformers can represent nuanced action outcomes distinctly, despite producing the same final classification.
- Mechanistic interpretability techniques, including delta analysis and activation patching, are employed to uncover the internal workings of the model.
- Attention Heads and MLP Blocks have distinct roles, with Attention Heads gathering evidence and MLP Blocks composing concepts for outcome representation.
- The findings highlight the potential for AI models to develop hidden knowledge, necessitating careful oversight for trustworthy AI deployment.
Read more
Attention Gathers, MLPs Compose: A Causal Analysis of an Action-Outcome Circuit in VideoViT
Summary
This paper investigates the internal mechanisms of video models, specifically a pre-trained Video Vision Transformer (ViViT), to understand how they represent nuanced semantic information related to action outcomes. The study employs mechanistic interpretability techniques to reverse-engineer the model's internal circuits responsible for determining success or failure in classification tasks. The analysis reveals that while low-level differences are present from the initial layers, the representation of outcomes is progressively amplified in the mid-layers (5 to 11). The causal analysis, utilizing activation patching and ablation studies, identifies a division of labor within the model: Attention Heads serve as 'evidence gatherers' for low-level information, while MLP Blocks act as 'concept composers' that generate the success signal. This sophisticated internal circuit demonstrates the model's ability to develop hidden knowledge beyond its explicit task, emphasizing the importance of mechanistic oversight for building trustworthy AI systems.
Methodology
The methodology combines observational techniques (attention visualization) and quantitative analysis (delta analysis) on contrastive video pairs to identify internal outcome signals. Activation patching is used to measure causal effects and determine the functional roles of specific components within the model.
Results
The analysis demonstrates that the Video Vision Transformer distinctly represents action outcomes through a layered amplification process, with Attention Heads and MLP Blocks playing critical roles in the model's decision-making process. The results provide strong causal evidence for the internal circuit's resilience to simple ablations.
Implications
The findings suggest that video models can possess complex internal representations that extend beyond their training tasks, highlighting the need for interpretability in AI systems, especially in high-stakes applications. This research underscores the importance of mechanistic oversight to ensure the reliability and trustworthiness of AI models.
ARROW: Augmented Replay for RObust World models
Reinforcement Learning
Robotics
Efficient ML
- ARROW introduces a dual-buffer system for memory-efficient experience replay in continual reinforcement learning.
- The method significantly reduces catastrophic forgetting while maintaining performance on previously learned tasks.
- ARROW is evaluated in both non-shared and shared structure environments, showcasing its versatility.
- The findings suggest that model-based reinforcement learning can effectively address challenges in continual learning.
Read more
ARROW: Augmented Replay for RObust World models
Summary
The paper introduces ARROW (Augmented Replay for RObust World models), a novel model-based continual reinforcement learning (CRL) algorithm that addresses the challenge of catastrophic forgetting in agents learning sequential tasks. Traditional model-free methods often rely on large replay buffers, which can be memory-intensive and inefficient. In contrast, ARROW draws inspiration from neuroscience, utilizing a memory-efficient replay mechanism that maintains two complementary buffers: a short-term buffer for recent experiences and a long-term buffer that preserves task diversity through intelligent sampling. The authors evaluate ARROW across two challenging CRL settings: tasks without shared structure (Atari) and tasks with shared structure (Procgen CoinRun variants). The results demonstrate that ARROW significantly reduces forgetting compared to both model-free and model-based baselines while maintaining comparable forward transfer, highlighting the effectiveness of model-based approaches and bio-inspired strategies in continual learning.
Methodology
ARROW extends the DreamerV3 algorithm by implementing a strategically managed replay mechanism that utilizes two buffers: a short-term buffer for recent experiences and a long-term buffer for preserving task diversity. This approach allows for efficient memory usage while enabling the agent to learn from both recent and diverse past experiences.
Results
ARROW was tested on two continual learning regimes, demonstrating significantly less forgetting in tasks without shared structure compared to existing model-free and model-based baselines. It also maintained comparable forward transfer, indicating that the method is effective in leveraging prior knowledge for new tasks.
Implications
The findings suggest that ARROW could be applied in various real-world scenarios where continual learning is essential, such as robotics and adaptive AI systems. The model-based approach may lead to more efficient learning in dynamic environments, reducing memory requirements and improving performance.
Huntington Disease Automatic Speech Recognition with Biomarker Supervision
Audio & Speech
- Introduces a high-fidelity clinical corpus for HD speech ASR, the first of its kind for end-to-end evaluation.
- Demonstrates that different ASR architectures exhibit unique error patterns when processing HD speech.
- Achieves a significant reduction in WER through HD-specific adaptations of the Parakeet-TDT model.
- Proposes the use of biomarker-based auxiliary supervision to enhance ASR performance and analyzes its effects on error behavior.
Read more
Huntington Disease Automatic Speech Recognition with Biomarker Supervision
Summary
This paper addresses the challenges of automatic speech recognition (ASR) for individuals with Huntington's disease (HD), a condition characterized by irregular speech patterns due to motor-speech disorders. The authors present a systematic study utilizing a high-fidelity clinical speech corpus, which includes recordings from 94 HD-positive individuals and 36 healthy controls. They compare various ASR architectures, revealing that HD speech induces architecture-specific error patterns, with the Parakeet-TDT model outperforming traditional encoder-decoder and CTC baselines. The study introduces HD-specific adaptations that significantly reduce word error rates (WER) from 6.99% to 4.95%. Additionally, the authors propose a novel method for incorporating biomarker-based auxiliary supervision, analyzing how this supervision reshapes error behavior in a severity-dependent manner. The research highlights the importance of specialized ASR systems for pathological speech and opens avenues for future work in atypical speech recognition.
Methodology
The authors conducted a comparative analysis of multiple ASR architectures using a clinical speech corpus. They adapted the Parakeet-TDT model specifically for HD speech using encoder-side adapters and incorporated clinically grounded biomarkers as auxiliary supervision for adaptation. The study involved a detailed error analysis across different ASR models and severity cohorts.
Results
The study found that the Parakeet-TDT model significantly outperformed other ASR architectures, achieving a WER reduction from 6.99% to 4.95% with HD-specific adaptations. The incorporation of biomarker-based auxiliary supervision reshaped error patterns in a manner dependent on the severity of the speech disorder, rather than uniformly improving WER.
Implications
This research has significant implications for the development of ASR systems tailored for individuals with Huntington's disease and potentially other motor-speech disorders. The findings suggest that incorporating clinical biomarkers can enhance ASR performance and provide insights into the unique speech characteristics of affected individuals.
Chemical Reaction Networks Learn Better than Spiking Neural Networks
Theory
- CRNs can solve classification tasks without requiring hidden layers, unlike SNNs.
- The study provides mathematical guarantees for the learning behavior of CRNs.
- Numerical experiments show CRNs outperform SNNs in accuracy and efficiency for digit classification.
- The findings suggest potential advantages of biochemical networks over neuronal networks in learning tasks.
Read more
Chemical Reaction Networks Learn Better than Spiking Neural Networks
Summary
This paper presents a mathematical proof demonstrating that chemical reaction networks (CRNs) without hidden layers can effectively solve classification tasks that require hidden layers in spiking neural networks (SNNs). The authors utilize deterministic mass-action kinetics to establish that a specific CRN can learn a classification task previously achievable by an SNN with hidden layers. They provide analytical regret bounds for the network's global behavior and analyze its asymptotic behavior and Vapnik–Chervonenkis dimension. In numerical experiments, the proposed CRN outperforms an SNN with hidden layers in classifying handwritten digits, achieving higher accuracy and efficiency. This work suggests that chemical computers may offer more efficient learning mechanisms compared to traditional neuronal networks, providing insights into the learning capabilities of biological cells through biochemical processes.
Methodology
The authors establish a CRN model based on deterministic mass-action kinetics, drawing parallels to stochastic models of SNNs. They derive theoretical guarantees for the CRN's learning capabilities and conduct numerical experiments to validate its performance in supervised classification tasks.
Results
The CRN demonstrated the ability to classify handwritten digits with greater accuracy and efficiency than an SNN with hidden layers. The theoretical analysis provided regret bounds and insights into the CRN's learning dynamics, confirming its effectiveness as a learning machine.
Implications
This research opens avenues for utilizing CRNs in machine learning applications, suggesting that biochemical systems may provide more efficient learning mechanisms than traditional neural networks. It also encourages further exploration of CRNs in computational tasks and their potential biological relevance.
Meta-Reinforcement Learning with Self-Reflection for Agentic Search
Reinforcement Learning
Large Language Models
NLP
- MR-Search leverages self-reflection to improve exploration strategies in agentic search tasks.
- The method conditions on past episodes, allowing for adaptive learning across multiple interactions.
- A novel multi-turn RL algorithm is introduced for precise credit assignment during training.
- Empirical results show significant performance improvements over baseline RL methods.
Read more
Meta-Reinforcement Learning with Self-Reflection for Agentic Search
Summary
This paper presents MR-Search, a novel approach to meta-reinforcement learning (RL) designed to enhance agentic search through self-reflection. Unlike traditional RL methods that optimize policies within isolated episodes and rely on sparse rewards, MR-Search conditions its search strategy on past episodes, allowing agents to adapt and improve their exploration tactics over time. The key innovation is the incorporation of explicit self-reflections after each episode, which serve as additional context for guiding future search attempts. This iterative self-reflection process enables agents to consolidate knowledge across episodes, transforming exploration into a more informed and effective search process. The authors also introduce a multi-turn RL algorithm that provides fine-grained credit assignment at the turn level, facilitating better learning dynamics. Empirical evaluations across various benchmarks demonstrate that MR-Search significantly outperforms existing RL-based methods, achieving relative improvements of 9.2% to 19.3% across eight different tasks, thereby validating its effectiveness in enhancing agentic search capabilities.
Methodology
MR-Search employs a meta-reinforcement learning framework that integrates self-reflection after each episode. This allows agents to learn from previous experiences and adjust their strategies accordingly. The multi-turn RL algorithm estimates relative advantages at the turn level, enabling localized credit assignment without the need for auxiliary value models.
Results
The empirical results indicate that MR-Search achieves an average relative improvement of 9.2% to 19.3% over strong baseline methods across eight benchmarks, demonstrating its effectiveness in enhancing agentic search performance.
Implications
The findings suggest that incorporating self-reflection in meta-reinforcement learning can significantly improve the efficiency and effectiveness of search agents in complex decision-making tasks. This approach could be applied to various domains requiring multi-turn interactions and adaptive learning, such as information retrieval and autonomous systems.
A Multi-Label Temporal Convolutional Framework for Transcription Factor Binding Characterization
Time Series
- Introduces a multi-label classification approach for predicting TF binding sites, moving beyond binary classification.
- Utilizes Temporal Convolutional Networks (TCNs) to capture correlations among multiple TFs effectively.
- Demonstrates that TCNs outperform traditional RNNs and attention-based models in biological sequence analysis.
- Reveals biologically meaningful motifs and novel TF interactions through model explainability.
Read more
A Multi-Label Temporal Convolutional Framework for Transcription Factor Binding Characterization
Summary
This paper addresses the challenge of predicting transcription factor (TF) binding sites on DNA sequences, emphasizing the cooperative nature of TF interactions. Traditional approaches have primarily focused on binary classification for individual TFs, neglecting the complex interplay among multiple TFs. The authors propose a novel multi-label classification framework utilizing Temporal Convolutional Networks (TCNs) to simultaneously predict binding profiles for multiple TFs. This approach captures the correlations and cooperative regulatory mechanisms among TFs, offering insights into biologically meaningful motifs and co-binding patterns. The study demonstrates that TCNs outperform traditional recurrent neural networks (RNNs) and attention-based models in this context, providing reliable predictions with fewer data requirements. The integration of explainability methods further enhances the understanding of the biological relevance of the model's predictions, potentially guiding future experimental investigations.
Methodology
The authors employed Temporal Convolutional Networks (TCNs) for multi-label classification of TF binding sites, utilizing datasets derived from ChIP-seq data. They compared the performance of TCNs against traditional RNNs and attention-based models, focusing on their ability to learn correlations among TF labels from DNA sequence data. Explainability methods were applied to assess the biological relevance of the model's predictions.
Results
The study found that TCNs achieved reliable predictive performance for multiple TFs, revealing significant co-binding patterns and motifs consistent with known TF interactions. The results indicated that the multi-label learning framework could uncover novel relationships among TFs, enhancing the understanding of their cooperative regulatory mechanisms.
Implications
The findings suggest that the proposed multi-label TCN framework can significantly advance the field of transcriptional regulation research by providing a deeper understanding of TF interactions. This approach may facilitate the identification of new regulatory mechanisms and inform experimental designs in molecular biology.
Mitigating the Multiplicity Burden: The Role of Calibration in Reducing Predictive Multiplicity of Classifiers
Theory
Interpretability
- Predictive multiplicity can lead to conflicting outcomes for the same individual due to multiple near-optimal models.
- Minority class observations are disproportionately affected by predictive multiplicity.
- Post-hoc calibration methods can significantly reduce predictive multiplicity and improve prediction stability.
- Platt Scaling and Isotonic Regression are the most effective calibration techniques tested.
Read more
Mitigating the Multiplicity Burden: The Role of Calibration in Reducing Predictive Multiplicity of Classifiers
Summary
This paper investigates the relationship between classification calibration and predictive multiplicity, particularly in high-stakes environments such as credit risk assessment. Predictive multiplicity refers to the existence of multiple near-optimal models that yield conflicting predictions for the same instance, raising concerns about the reliability and stability of machine learning models. The study uses nine diverse credit risk benchmark datasets to analyze whether predictive multiplicity is concentrated in regions of low predictive confidence and how post-hoc calibration methods can mitigate this issue. The findings reveal that minority class observations experience a disproportionate multiplicity burden, with significant disparities in predictive multiplicity and confidence levels. The paper evaluates three calibration techniques—Platt Scaling, Isotonic Regression, and Temperature Scaling—and finds that Platt Scaling and Isotonic Regression are particularly effective in reducing predictive multiplicity. The results suggest that calibration can serve as a consensus-enforcing mechanism, promoting procedural fairness by addressing the arbitrariness inherent in predictive multiplicity.
Methodology
The study employs rigorous non-parametric statistical testing to analyze nine credit risk benchmark datasets, assessing the relationship between calibration and predictive multiplicity. It evaluates the effectiveness of three post-hoc calibration methods in reducing multiplicity and improving prediction confidence.
Results
The analysis indicates that predictive multiplicity is concentrated in areas of low predictive confidence, particularly affecting minority class observations. Calibration methods, especially Platt Scaling and Isotonic Regression, significantly reduce predictive multiplicity and enhance the stability of credit decisions.
Implications
The findings highlight the importance of calibration in machine learning models deployed in high-stakes environments, suggesting that effective calibration can improve fairness and reliability in algorithmic decision-making processes, particularly in credit scoring and similar applications.
CFD-HAR: User-controllable Privacy through Conditional Feature Disentanglement
Time Series
- CFD allows for user-controllable privacy by separating sensitive attributes from activity features.
- The technique provides dynamic privacy filtering tailored to individual user preferences.
- CFD outperforms traditional perturbation methods by maintaining high recognition performance.
- A comparative analysis shows that few-shot HAR methods excel in label efficiency but compromise on privacy.
Read more
CFD-HAR: User-controllable Privacy through Conditional Feature Disentanglement
Summary
This paper addresses the dual challenges of user privacy and limited labeled data in Human Activity Recognition (HAR) systems deployed on wearable and mobile devices. The authors propose a novel technique called Conditional Feature Disentanglement (CFD), which allows users to control their privacy preferences by disentangling sensitive attributes from activity-related features in the data. This approach enables dynamic privacy filtering, allowing users to specify which attributes they consider sensitive and adjust the level of data perturbation accordingly. The authors compare CFD with a state-of-the-art few-shot HAR method that utilizes autoencoder-based representation learning. Their analysis reveals that while CFD provides explicit and tunable privacy controls, the autoencoder method offers superior label efficiency but lacks inherent privacy safeguards. The study highlights the security implications of both approaches in Internet of Things (IoT) settings, emphasizing the need for a unified framework that optimally balances privacy preservation, adaptability, and robustness for future IoT HAR systems.
Methodology
The authors developed a feature disentanglement technique that separates sensitive user attributes from activity-related information in the latent space. They conducted a comparative analysis with few-shot HAR using autoencoder representations, examining architectural designs, learning objectives, and privacy guarantees.
Results
The CFD-based HAR system demonstrated effective privacy controls by allowing users to specify sensitivity levels for different attributes. It maintained high recognition performance even with limited labeled data, while the autoencoder-based method showed better label efficiency but lacked privacy protections. The analysis revealed that neither method alone fully meets the requirements for next-generation IoT HAR systems.
Implications
The findings suggest that CFD could enhance privacy-preserving applications in sensitive domains like healthcare monitoring and personal wearables, where user trust and regulatory compliance are critical. The research points towards the need for unified frameworks that integrate privacy, adaptability, and robustness in IoT systems.
Topological DeepONets and a generalization of the Chen-Chen operator approximation theorem
Theory
- Introduction of topological DeepONets that operate on locally convex spaces.
- Generalization of the Chen-Chen operator approximation theorem to a broader context.
- Construction of neural networks using continuous linear functionals from dual spaces.
- Demonstration of uniform approximation of continuous operators on compact sets.
Read more
Topological DeepONets and a generalization of the Chen-Chen operator approximation theorem
Summary
This paper introduces a novel framework for Deep Operator Networks (DeepONets) that extends the classical operator approximation theory to locally convex spaces. Traditionally, DeepONets utilize a branch-trunk architecture to approximate nonlinear operators between function spaces, where the input is a function defined on a compact set. The author proposes a topological extension where the input function can belong to an arbitrary Hausdorff locally convex space. The architecture is constructed using continuous linear functionals from the dual space, allowing the branch component to act on the input space through linear measurements while the trunk component processes the output in a Euclidean domain. The main theorem demonstrates that continuous operators can be uniformly approximated by these topological DeepONets, thereby generalizing the Chen-Chen operator approximation theorem beyond the classical framework. This work not only broadens the applicability of DeepONets to more complex function spaces but also provides a theoretical foundation for their use in various scientific and engineering applications.
Methodology
The paper develops a topological framework for DeepONets by constructing feedforward neural networks that utilize continuous linear functionals from the dual space of a locally convex space. The architecture is designed to process inputs through linear measurements while maintaining the traditional branch-trunk structure for output generation.
Results
The main result establishes that continuous operators mapping from a compact subset of a locally convex space to a compact Euclidean domain can be uniformly approximated by the proposed topological DeepONets. This result extends existing approximation theorems and provides a structured approach to operator learning in more complex spaces.
Implications
The findings have significant implications for operator learning in various scientific and engineering fields, particularly in scenarios where inputs are not confined to normed spaces. This framework can enhance the modeling of complex systems, such as those found in dynamical systems, multiphysics, and PDEs, thereby broadening the scope of applications for DeepONets.
Algorithmic Capture, Computational Complexity, and Inductive Bias of Infinite Transformers
Theory
Large Language Models
NLP
- Formal definition of Algorithmic Capture and algorithmic learning.
- Transformers show a bias towards low-complexity algorithms, limiting their ability to learn higher-complexity tasks.
- Upper bounds on inference-time complexity for infinite-width transformers are established.
- Examples of captured algorithms include induction head search and sorting, while complex problems like shortest path and max flow are not captured.
Read more
Algorithmic Capture, Computational Complexity, and Inductive Bias of Infinite Transformers
Summary
This paper introduces the concept of Algorithmic Capture, defined as a neural network's ability to generalize to arbitrary problem sizes with controlled error and minimal sample adaptation. The authors analyze infinite-width transformers in both lazy and rich regimes, deriving upper bounds on the inference-time computational complexity of the functions these networks can learn. They demonstrate that while transformers exhibit universal expressivity, they have an inductive bias towards low-complexity algorithms within the Efficient Polynomial Time Heuristic Scheme (EPTHS) class. This bias limits their ability to capture higher-complexity algorithms, although they succeed in simpler tasks like search, copy, and sort. The paper also provides a formal definition of algorithmic learning, examples of captured and non-captured algorithms, and upper complexity bounds for transformer inference. The findings suggest that transformers are biased towards lower complexity functions, which has implications for understanding their learning capabilities and limitations in algorithmic reasoning tasks.
Methodology
The authors analyze infinite-width transformers by examining their performance in lazy and rich regimes. They derive theoretical bounds on computational complexity and provide a formal definition of algorithmic learning. The analysis includes empirical examples of algorithms that transformers can and cannot capture, alongside complexity evaluations.
Results
The study finds that infinite-width transformers can represent complex functions but are limited to capturing algorithms of heuristic complexity no greater than O(T^2+ϵ). The results indicate that while transformers can perform well on simpler tasks, they struggle with more complex algorithmic challenges.
Implications
The findings have significant implications for the design and understanding of neural networks, particularly in the context of large language models. They suggest that while transformers are powerful, their inductive biases may restrict their ability to generalize to more complex algorithmic tasks, which is crucial for applications requiring robust algorithmic reasoning.
Cornserve: A Distributed Serving System for Any-to-Any Multimodal Models
Multimodal
Generative Models
Efficient ML
- Cornserve is the first distributed serving system specifically for Any-to-Any multimodal models.
- It allows for flexible task abstraction and model fission, enabling independent scaling of model components.
- The system improves serving throughput by up to 3.81 times and reduces tail latency by up to 5.79 times.
- Built on Kubernetes, Cornserve supports diverse multimodal models and enhances resource efficiency.
Read more
Cornserve: A Distributed Serving System for Any-to-Any Multimodal Models
Summary
The paper introduces Cornserve, a distributed serving system designed for Any-to-Any multimodal models that can process and generate various combinations of data modalities such as text, images, videos, and audio. The challenge in serving these models arises from the heterogeneous nature of requests and the differing scaling characteristics of model components. Cornserve addresses these challenges by providing a flexible task abstraction for model computation graphs, enabling model fission to disaggregate components for independent scaling, and employing a distributed runtime that utilizes a record-and-replay execution model. This system is built on Kubernetes and supports a variety of Any-to-Any models, significantly improving throughput and reducing latency. The implementation consists of approximately 23,000 lines of Python code, and the system is open-source, allowing for broader accessibility and experimentation.
Methodology
Cornserve employs a flexible task abstraction to express model computations, enabling model fission to disaggregate components for independent scaling. It utilizes a distributed runtime with a record-and-replay execution model to manage data dependencies and tensor data forwarding efficiently. The system is implemented on Kubernetes, facilitating resource management and orchestration.
Results
Cornserve demonstrates significant performance improvements, achieving up to 3.81 times higher throughput and 5.79 times lower tail latency compared to existing serving systems. It effectively supports a range of Any-to-Any models, showcasing its versatility and efficiency in handling multimodal data.
Implications
The development of Cornserve has the potential to enhance the deployment and scalability of multimodal AI applications, making it easier for developers to serve complex models in production environments. Its open-source nature encourages collaboration and innovation in the field of multimodal machine learning.
On the Role of Reversible Instance Normalization
Time Series
- Identifies three key challenges in normalization for time series forecasting: temporal, spatial, and conditional distribution shifts.
- Conducts ablation studies on RevIN, revealing redundancies and limitations in its components.
- Challenges the effectiveness of RevIN in mitigating distribution shifts in time series data.
- Proposes new perspectives for improving normalization strategies in forecasting applications.
Read more
On the Role of Reversible Instance Normalization
Summary
This paper investigates the role of Reversible Instance Normalization (RevIN) in the context of time series forecasting, identifying critical challenges associated with data normalization in this domain. The authors highlight three main distribution shifts that affect time series forecasting: temporal, spatial, and conditional shifts. They argue that standard normalization techniques, including RevIN, may not adequately address these challenges. Through extensive ablation studies, the authors analyze the components of RevIN, revealing that some are redundant or counterproductive. The findings suggest that while RevIN has been widely adopted in forecasting studies, its effectiveness may be overstated. The paper concludes by proposing new perspectives for enhancing the robustness and generalization of normalization techniques in time series forecasting.
Methodology
The authors conducted extensive ablation studies on the RevIN method using standard forecasting benchmarks. They analyzed the normalization techniques used in neural networks and their limitations, particularly in the context of time series data. The study involved reviewing existing literature and performing empirical evaluations to assess the impact of different normalization components.
Results
The ablation studies demonstrated that several components of RevIN are unnecessary or detrimental to its performance. The authors found that standard normalization techniques fail to address the unique challenges posed by time series data, leading to potential misinterpretations of the effectiveness of RevIN in real-world applications.
Implications
The findings suggest that researchers and practitioners should reconsider the reliance on RevIN and similar normalization techniques in time series forecasting. Improved understanding and development of normalization strategies could enhance the robustness and generalization of forecasting models, leading to better performance in practical applications.
Monitoring and Prediction of Mood in Elderly People during Daily Life Activities
Time Series
- Development of a wearable system for mood monitoring in elderly individuals.
- Utilization of ecological momentary assessment (EMA) to simplify mood state evaluation.
- Machine learning classifier trained on physiological data from a wristband.
- Promising results in mood prediction accuracy, especially for happiness and activeness.
Read more
Monitoring and Prediction of Mood in Elderly People during Daily Life Activities
Summary
This paper presents an intelligent wearable system designed to monitor and predict the mood states of elderly individuals during their daily activities. The system comprises a wristband that records various physiological signals and a mobile application for ecological momentary assessment (EMA). The authors employ machine learning techniques to train a classifier that predicts mood states based solely on data from the wristband. The study highlights the importance of addressing mental health in the elderly, particularly those living alone, who are at higher risk for mental health issues. The proposed system aims to provide a more effective means of monitoring mood states in real-life conditions, overcoming limitations of previous laboratory-based studies. The results indicate that the system achieves promising accuracy in mood prediction, particularly for happiness and activeness, comparable to existing state-of-the-art methods.
Methodology
The system integrates a wristband (Empatica E4) that captures physiological data (e.g., heart rate, skin temperature, accelerometer data) and a mobile app for EMA input. The app prompts users to report their mood state five times a day using a simplified two-question format. Data is processed offline, with features extracted from physiological signals using a sliding window approach. A machine learning classifier is trained to predict mood states based on these features.
Results
The system demonstrated effective mood prediction capabilities, achieving accuracy levels comparable to existing methods for detecting happiness and activeness. The use of EMA allowed for frequent mood assessments without overwhelming users, thus maintaining the validity of the data collected.
Implications
The findings suggest that this wearable system could significantly enhance mental health monitoring and support for elderly individuals, particularly those living independently. By providing real-time mood assessments, the system may facilitate timely interventions and improve overall quality of life for the aging population.
Deep Learning Network-Temporal Models For Traffic Prediction
Time Series
Graph Learning
Large Language Models
- Introduction of two deep learning models for multivariate time series traffic prediction: GAT and LLM.
- GAT model effectively reduces prediction variance across time series and horizons.
- LLM model shows superior overall prediction and generalization performance compared to traditional methods.
- Comprehensive analysis reveals insights into correlation variability and prediction distribution discrepancies.
Read more
Deep Learning Network-Temporal Models For Traffic Prediction
Summary
This paper addresses the challenges of predicting network traffic using multivariate time series analysis, highlighting the limitations of traditional statistical and shallow machine learning models. The authors propose two innovative deep learning models: a customized network-temporal graph attention network (GAT) and a fine-tuned multi-modal large language model (LLM) enhanced with clustering techniques. These models are designed to simultaneously capture temporal patterns and network topological correlations. The study compares the performance of these models against a Long Short-Term Memory (LSTM) model, which has previously outperformed statistical methods. Through extensive evaluations on a real-world network dataset, the LLM-based model demonstrates superior prediction accuracy and generalization capabilities, while the GAT model excels in minimizing prediction variance across different time series and forecasting horizons. The paper also provides insights into correlation variability and discrepancies in prediction distributions over time, contributing to a deeper understanding of network traffic dynamics.
Methodology
The authors developed a customized spatial-temporal graph attention network (ST-GAT) and a fine-tuned multi-modal large language model (LLM) with a clustering pre-training step. They conducted extensive training and performance evaluations using a real-world network traffic dataset, optimizing hyperparameters and analyzing model performance across various prediction horizons.
Results
The LLM-based model outperformed the LSTM model in terms of prediction accuracy and generalization, while the GAT model effectively reduced prediction variance. The study provided detailed insights into the variability of correlations and discrepancies in prediction distributions over time, enhancing the understanding of network traffic behavior.
Implications
The findings suggest that deep learning models, particularly LLMs and GATs, can significantly improve the accuracy and reliability of network traffic predictions, which is crucial for effective network management and control. This research could influence future developments in AI-driven network analytics and traffic engineering.
EvoFlows: Evolutionary Edit-Based Flow-Matching for Protein Engineering
Optimization
Generative Models
- EvoFlows enables controlled edit-based mutations in protein sequences, predicting both mutation type and location.
- The model captures evolutionary patterns and protein distributions effectively, outperforming traditional models in generating realistic protein variants.
- A new grid-free inference procedure enhances the model's efficiency in vectorized operations.
- EvoFlows is particularly suited for protein optimization tasks, preserving structural and functional integrity while modifying properties.
Read more
EvoFlows: Evolutionary Edit-Based Flow-Matching for Protein Engineering
Summary
EvoFlows presents a novel approach to protein engineering through a variable-length sequence-to-sequence model that focuses on edit-based mutations. Unlike traditional autoregressive and masked language models, EvoFlows allows for a controlled number of insertions, deletions, and substitutions on a template protein sequence, effectively predicting both the type and location of mutations. The model employs edit flows to learn mutational trajectories between evolutionarily related protein sequences, capturing the distributions of natural proteins and their mutational paths. Extensive evaluations on diverse protein communities demonstrate that EvoFlows achieves performance comparable to leading protein language models while exhibiting enhanced capabilities in generating non-trivial, natural-like protein mutants. This approach addresses the limitations of existing models by providing a more efficient and effective means of protein optimization, crucial for applications in pre-clinical drug development.
Methodology
EvoFlows utilizes a discrete flow-matching approach to learn edit-based sequence-to-sequence transition rules. It generates protein variants by applying controlled mutations (substitutions, insertions, deletions) to existing protein sequences, rather than modeling token-level likelihoods. The model is trained on natural protein communities to capture the desired sequence distributions.
Results
The evaluation results indicate that EvoFlows effectively learns the desired distributions of protein sequences and generates meaningful mutations. It matches the performance of state-of-the-art protein language models while providing enhanced capabilities for generating non-trivial protein variants.
Implications
The introduction of EvoFlows has significant implications for protein engineering, particularly in drug development and biotechnology. By enabling efficient and targeted protein optimization, this model can facilitate the design of proteins with improved functionalities, potentially accelerating the development of therapeutic proteins and enzymes.
Teleodynamic Learning a new Paradigm For Interpretable AI
Theory
Interpretability
Optimization
- Teleodynamic Learning shifts the focus from optimization to the co-evolution of structure, parameters, and resources.
- The framework introduces emergent stabilization and phase-structured behavior in learning processes.
- DE11, the proposed teleodynamic learner, achieves high accuracy on benchmark datasets while producing interpretable rules.
- The approach integrates concepts from biology and physics to enhance understanding of adaptive systems.
Read more
Teleodynamic Learning a new Paradigm For Interpretable AI
Summary
The paper introduces Teleodynamic Learning, a novel approach to machine learning that redefines learning as the emergence and stabilization of functional organization within constrained dynamical systems. Unlike traditional optimization methods that focus on minimizing a static objective, this paradigm emphasizes the co-evolution of structure, parameters, and resources. The authors model learning as navigation in a system characterized by two coupled timescales: inner dynamics (continuous adaptation) and outer dynamics (discrete structural modifications). This framework leads to three significant phenomena: emergent stabilization without external stopping criteria, phase-structured behavior identifiable through dynamical signatures, and convergence guarantees based on the geometry of the parameter manifold. The Distinction Engine (DE11) is presented as an instantiation of this paradigm, achieving competitive accuracy on standard benchmarks while generating interpretable logical rules that arise from the system's dynamics. Overall, Teleodynamic Learning offers a thermodynamically grounded approach to adaptive and interpretable AI, integrating regularization, architecture search, and resource-bounded inference into a unified framework.
Methodology
The authors model learning as navigation in a constrained dynamical system with two coupled timescales. They utilize concepts from information geometry and tropical optimization to develop the Distinction Engine (DE11), which embodies the principles of Teleodynamic Learning.
Results
DE11 achieves 93.3% test accuracy on the IRIS dataset, 92.6% on the WINE dataset, and 94.7% on the Breast Cancer dataset, outperforming traditional logistic regression. The system generates interpretable logical rules endogenously from its dynamics.
Implications
Teleodynamic Learning has the potential to advance the field of interpretable AI by providing a framework that inherently produces interpretable models. It could lead to more adaptive systems that better mimic biological learning processes and enhance our understanding of machine learning dynamics.
Bayesian Optimization of Partially Known Systems using Hybrid Models
Optimization
- Introduction of a hybrid model-based Bayesian Optimization framework that combines mechanistic models with Gaussian processes.
- Demonstrated significant improvements in optimization efficiency, achieving convergence in as few as one iteration.
- The hybrid model formulation allows for the inclusion of physical constraints, enhancing the robustness of the optimization process.
- Outperformed standard BO methods in an in-silico optimization case study of a single-stage distillation.
Read more
Bayesian Optimization of Partially Known Systems using Hybrid Models
Summary
This paper presents a novel approach to Bayesian Optimization (BO) tailored for partially known systems by integrating mechanistic models with probabilistic data-driven models. The authors propose a hybrid model-based BO framework that utilizes known mechanistic equations alongside Gaussian processes (GPs) to infer missing variables. This method allows for the incorporation of physical constraints into the optimization process, transforming it into a constrained, nonlinear stochastic program. The hybrid approach significantly enhances the efficiency of the optimization process, particularly in high-dimensional and nonlinear systems, as demonstrated through an in-silico optimization of a single-stage distillation process. The results indicate that the hybrid BO model can achieve convergence in as few as one iteration, outperforming traditional BO methods that often require numerous iterations without convergence. This advancement highlights the potential of combining mechanistic insights with data-driven optimization techniques to improve decision-making in complex systems.
Methodology
The authors developed a hybrid Bayesian Optimization framework that integrates known mechanistic equations with Gaussian processes to infer missing variables. This approach formulates the optimization problem as a constrained, nonlinear stochastic program, which is discretized using sample-average approximation. The method iteratively queries the system to optimize the design based on the combined model.
Results
The hybrid Bayesian Optimization model demonstrated superior performance in an in-silico optimization of a single-stage distillation, yielding better designs compared to standard BO methods. The hybrid model achieved convergence in as few as one iteration, while the standard BO did not converge within 25 iterations across various seeds.
Implications
This research has significant implications for optimizing complex engineering systems where partial knowledge exists. By leveraging both mechanistic and data-driven approaches, the proposed method can enhance decision-making processes in fields such as chemical engineering, robotics, and automated design, potentially leading to more efficient and effective system designs.
Multi-Task Anti-Causal Learning for Reconstructing Urban Events from Residents' Reports
Theory
Optimization
- MTAC framework effectively disentangles task-invariant and task-specific causal mechanisms.
- Utilizes a multi-task structural equation model (SEM) for causal discovery and inference.
- Demonstrates significant improvements in urban event reconstruction accuracy over strong baselines.
- Achieves up to 34.61% reduction in mean absolute error (MAE) in real-world applications.
Read more
Multi-Task Anti-Causal Learning for Reconstructing Urban Events from Residents' Reports
Summary
This paper introduces Multi-Task Anti-Causal Learning (MTAC), a framework designed to infer latent causes from observed effects in multi-task scenarios, particularly focusing on urban event reconstruction from residents' reports. The authors argue that many real-world machine learning tasks are anti-causal, requiring the estimation of causes based on effects. MTAC addresses the challenge of learning invariant causal mechanisms across multiple tasks by employing a structured multi-task structural equation model (SEM). This model separates the outcome-generation process into task-invariant and task-specific components, allowing for a shared backbone with task-specific heads. The framework first conducts causal discovery to establish a shared causal graph, followed by maximum A posteriori (MAP) inference to reconstruct causes by optimizing latent mechanism variables and cause magnitudes. The effectiveness of MTAC is demonstrated through its application to three urban event types: parking violations, abandoned properties, and unsanitary conditions, using real-world data from Manhattan and Newark. The results show that MTAC significantly enhances reconstruction accuracy compared to existing methods, achieving up to a 34.61% reduction in mean absolute error (MAE).
Methodology
The methodology involves causal discovery to learn a shared causal graph, followed by the implementation of a structured multi-task structural equation model (SEM) that separates the outcome-generation process into task-invariant and task-specific components. A MAP-based inference algorithm is then used to estimate causes by jointly optimizing the causes and shared mechanism variables under the learned causal structure.
Results
MTAC outperformed state-of-the-art anti-causal learning methods in reconstructing urban events, achieving up to a 34.61% reduction in mean absolute error (MAE) across three tasks: parking violations, abandoned properties, and unsanitary conditions, based on real-world datasets from Manhattan and Newark.
Implications
The findings suggest that MTAC can enhance the accuracy of urban event reconstruction, which could improve municipal operations and public safety. The framework's ability to learn transferable causal mechanisms across tasks may also have broader applications in various domains where anti-causal learning is relevant.
A Learning-Based Superposition Operator for Non-Renewal Arrival Processes in Queueing Networks
Theory
Efficient ML
Time Series
- Introduces a learning-based superposition operator for non-renewal arrival processes in queueing networks.
- Utilizes deep learning to accurately reconstruct higher-order moments and dependence structures of merged arrival streams.
- Demonstrates superior performance compared to classical renewal-based approximations through extensive computational experiments.
- Enables decomposition-based evaluation of queueing networks, preserving critical variability and dependence information.
Read more
A Learning-Based Superposition Operator for Non-Renewal Arrival Processes in Queueing Networks
Summary
This paper addresses the challenge of superposing non-renewal arrival processes in queueing networks, a task that is analytically intractable with classical methods. The author proposes a novel, scalable, data-driven superposition operator that utilizes deep learning to map low-order moments and autocorrelation descriptors of multiple arrival streams to those of their merged process. The operator is trained on synthetically generated Markovian Arrival Processes (MAPs), allowing it to learn a compact representation that accurately reconstructs the first five moments and short-range dependence structure of the aggregate stream. The results from extensive computational experiments demonstrate that the proposed method significantly outperforms traditional renewal-based approximations, providing uniformly low prediction errors across various variability and correlation regimes. By integrating this operator with learning-based modules for departure-process and steady-state analysis, the framework enables a decomposition-based evaluation of feed-forward queueing networks with merging flows, thus offering a scalable alternative to traditional analytical approaches while retaining essential higher-order variability and dependence information necessary for accurate distributional performance analysis.
Methodology
The proposed superposition operator is a deep learning model trained on synthetically generated Markovian Arrival Processes (MAPs). It learns to map low-order statistical descriptors (moments and autocorrelation) of multiple input streams to those of their superposed process. The training involves generating a diverse set of MAPs, computing their exact superpositions, and using these pairs as labeled data for the neural network.
Results
The proposed operator achieves uniformly low prediction errors across different variability and correlation regimes, significantly outperforming classical methods. It successfully reconstructs the first five moments and short-range dependence structure of the merged arrival processes, demonstrating its effectiveness in practical applications.
Implications
The framework has the potential to enhance performance analysis in various fields such as manufacturing systems, communication networks, and service systems, where accurate modeling of arrival processes is crucial. It provides a more scalable and efficient approach to queueing analysis, allowing for better predictions in complex systems with multiple traffic streams.
Procedural Fairness via Group Counterfactual Explanation
Theory
Interpretability
- Formalizes procedural fairness as group counterfactual explanation invariance.
- Introduces GCIG, a regularization approach that minimizes cross-group variation in explanations.
- Demonstrates that GCIG effectively reduces explanation disparity while preserving predictive performance.
- Highlights the necessity of integrating fairness constraints during the training process.
Read more
Procedural Fairness via Group Counterfactual Explanation
Summary
This paper addresses the gap in machine learning fairness research by focusing on procedural fairness, which concerns the consistency of decision-making processes across different protected groups. The authors introduce Group Counterfactual Integrated Gradients (GCIG), a novel in-processing regularization framework that enforces explanation invariance across groups based on true labels. Unlike traditional outcome-oriented fairness metrics, GCIG minimizes cross-group variation in feature attributions during training, ensuring that models provide stable explanations regardless of group context. The authors empirically compare GCIG against six state-of-the-art methods and demonstrate that it significantly reduces explanation disparity while maintaining competitive predictive performance and accuracy-fairness trade-offs. This work highlights the importance of aligning model reasoning across groups as a means to enhance fairness beyond mere outcome parity.
Methodology
The authors propose GCIG, which computes explanations relative to multiple group conditional baselines and penalizes differences in these attributions during training. This approach integrates explanation-based constraints directly into the learning process to ensure consistent reasoning across protected groups.
Results
Empirical evaluations show that GCIG significantly reduces cross-group explanation disparity compared to six state-of-the-art fairness methods, while also maintaining competitive predictive performance and favorable accuracy-fairness trade-offs.
Implications
The findings suggest that procedural fairness can be effectively integrated into machine learning models, potentially increasing trust in AI systems by ensuring consistent reasoning across diverse groups. This approach could be applied in various domains where fairness is critical, such as hiring algorithms, loan approvals, and criminal justice.
AutoScout: Structured Optimization for Automating ML System Configuration
Optimization
Efficient ML
- AutoScout formulates ML system configuration as a mixed discrete-continuous optimization problem.
- It employs a hybrid optimization framework that combines tree-based search and gradient-guided optimization.
- AutoScout achieves 2.7–3.0× training speedup over expert-tuned configurations.
- The system is 13.7–16.5× faster than existing system configurators in identifying optimal configurations.
Read more
AutoScout: Structured Optimization for Automating ML System Configuration
Summary
The paper presents AutoScout, a novel system configurator designed to automate the optimization of machine learning (ML) system configurations. As ML systems become increasingly complex, the configuration space expands significantly, incorporating various model-parallelism strategies, communication optimizations, and low-level runtime parameters. Identifying high-performance configurations is challenging due to the heterogeneous nature of features, conditional dependencies, and high profiling costs. AutoScout addresses these challenges by formulating the configuration problem as a mixed discrete-continuous optimization task with hierarchical dependencies. It employs a hybrid optimization framework that simultaneously refines sparse structural decisions and dense execution parameters. To enhance efficiency, AutoScout prioritizes high-impact configuration features and utilizes an ensemble of simulators with varying fidelity to minimize profiling costs. The results demonstrate that AutoScout consistently outperforms existing methods, achieving substantial speedups in training while being significantly faster than current system configurators.
Methodology
AutoScout utilizes a hybrid optimization approach that includes a tree-based search for sparse structural decisions and coordinate-wise stochastic gradient descent for optimizing dense execution parameters. It incorporates a hybrid bandit mechanism for adaptive exploration and employs a tournament-based design to prioritize impactful configuration features while using multiple simulators of varying fidelity to reduce profiling costs.
Results
AutoScout consistently identifies high-performance configurations across various models and hardware platforms, achieving training speedups of 2.7–3.0× compared to expert-tuned settings. It also demonstrates a significant improvement in efficiency, being 13.7–16.5× faster than existing system configurators.
Implications
The findings suggest that AutoScout can greatly enhance the efficiency of ML system configurations, making it a valuable tool for practitioners looking to optimize their ML training and inference processes. Its ability to adaptively navigate complex configuration spaces could lead to broader applications in various ML frameworks and deployment scenarios.