AI-generated summaries
Today's ML research,
without the noise.
Daily summaries of the latest machine learning papers from arXiv, processed every 8 hours.
63
Papers today
8h
Update frequency
7
Days of history
Smart Commander: A Hierarchical Reinforcement Learning Framework for Fleet-Level PHM Decision Optimization
Reinforcement Learning
Optimization
- Introduction of a hierarchical framework for fleet-level PHM decision-making.
- Two-tier architecture separating strategic and tactical decision-making.
- Integration of layered reward shaping and planning-enhanced neural networks.
- Demonstrated superior performance compared to conventional DRL and rule-based methods.
Read more
Smart Commander: A Hierarchical Reinforcement Learning Framework for Fleet-Level PHM Decision Optimization
Summary
The paper introduces Smart Commander, a novel Hierarchical Reinforcement Learning (HRL) framework aimed at optimizing decision-making in military aviation Prognostics and Health Management (PHM). The authors address significant challenges posed by the 'curse of dimensionality' in large-scale fleet operations, characterized by sparse feedback and stochastic mission profiles. Smart Commander employs a two-tier hierarchical structure: a General Commander at the strategic level, responsible for fleet availability and cost objectives, and multiple Operation Commanders at the tactical level, executing specific actions related to sortie generation, maintenance scheduling, and resource allocation. The framework is validated through a high-fidelity discrete-event simulation that models aircraft dynamics and logistics. By integrating layered reward shaping with planning-enhanced neural networks, the framework effectively mitigates issues related to sparse and delayed rewards. Empirical evaluations indicate that Smart Commander outperforms traditional monolithic Deep Reinforcement Learning (DRL) and rule-based approaches, achieving significant reductions in training time while demonstrating enhanced scalability and robustness in failure-prone environments. These findings underscore the potential of HRL as a viable paradigm for advanced fleet management in military aviation.
Methodology
The methodology involves a two-tier HRL framework where a General Commander oversees strategic decisions and multiple Operation Commanders handle tactical actions. The framework utilizes a high-fidelity discrete-event simulation for validation, incorporating layered reward structures and planning-enhanced neural networks to address sparse feedback and complex operational dependencies.
Results
Smart Commander significantly outperformed conventional DRL and rule-based baselines in empirical evaluations, achieving a notable reduction in training time and demonstrating superior scalability and robustness in environments prone to failure.
Implications
The proposed HRL framework has the potential to transform fleet management practices in military aviation, offering a more efficient and effective approach to decision-making under uncertainty and complex operational conditions.
A Mixture of Experts Foundation Model for Scanning Electron Microscopy Image Analysis
Computer Vision
- Introduction of the first foundation model for SEM image analysis.
- Utilizes a self-supervised transformer architecture with a Mixture of Experts mechanism.
- Pretrained on a large dataset of SEM images to enhance generalization across diverse conditions.
- Demonstrates superior performance in defocus-to-focus image translation tasks.
Read more
A Mixture of Experts Foundation Model for Scanning Electron Microscopy Image Analysis
Summary
This paper presents the first foundation model specifically designed for analyzing Scanning Electron Microscopy (SEM) images, addressing the limitations of existing task-specific models and the labor-intensive nature of SEM image acquisition. The proposed model is pretrained on a diverse dataset of 125,000 unlabeled SEM images, utilizing a self-supervised transformer architecture augmented with a Mixture of Experts (MoE) mechanism. This allows the model to dynamically allocate computational resources based on the characteristics of the input data, enhancing its adaptability across various imaging conditions and material systems. A key application demonstrated is the defocus-to-focus image translation, which restores focused details from defocused inputs without requiring paired supervision. The model outperforms existing state-of-the-art techniques in this area, showcasing its potential to improve automated microscopy workflows and accelerate materials discovery by providing robust, generalizable analysis tools for SEM data.
Methodology
The model employs a masked autoencoding framework with a ViT-Large backbone, pretrained on a large corpus of SEM images. The integration of a Mixture of Experts mechanism allows for dynamic routing of model capacity based on input characteristics, enabling specialization for different imaging conditions.
Results
The proposed model significantly outperforms state-of-the-art techniques in defocus-to-focus image translation, demonstrating its effectiveness in restoring image quality from defocused inputs. The model's architecture allows it to handle a wide range of imaging conditions and material types, showcasing its robustness and adaptability.
Implications
This work lays the foundation for developing adaptable SEM models that can enhance the efficiency and accuracy of automated microscopy workflows. It has the potential to accelerate materials discovery by providing more reliable analysis tools that can operate under realistic imaging conditions.
Learning to Query History: Nonstationary Classification via Learned Retrieval
Time Series
- Nonstationarity in classification is addressed by leveraging historical labeled examples.
- A learned retrieval mechanism samples relevant historical data, improving efficiency.
- The approach allows for adaptation to distribution shifts without retraining.
- Experiments show significant robustness improvements over standard classifiers.
Read more
Learning to Query History: Nonstationary Classification via Learned Retrieval
Summary
This paper addresses the challenge of nonstationarity in classification tasks, where models often struggle to generalize to new data distributions that evolve over time. The authors propose a novel approach that reframes nonstationary classification as a time series prediction problem. Instead of relying solely on the current input, the classifier is conditioned on a sequence of historical labeled examples, allowing it to adapt to changes in data distribution without the need for retraining. To manage the potentially large volume of historical data, the authors introduce a learned discrete retrieval mechanism that samples relevant historical examples based on input-dependent queries. This mechanism is trained end-to-end with the classifier, enabling efficient use of historical data stored on arbitrary filesystems. The proposed method demonstrates improved robustness to distribution shifts compared to standard classifiers, particularly in experiments conducted on synthetic benchmarks and the Amazon Reviews β23 dataset in the electronics category.
Methodology
The authors develop a system that utilizes a query generator to create input-dependent queries, which are used to sample relevant historical examples. This sampling is done through a hard attention-like mechanism, allowing the model to condition on a sequence of historical data. The entire system is trained end-to-end to optimize classification performance, with historical data stored externally to manage memory constraints.
Results
The experiments reveal that the proposed method successfully learns to retrieve and utilize relevant historical context, leading to enhanced robustness against distribution shifts. The results indicate that the VRAM usage scales predictably with the length of the historical data sequence, allowing for effective handling of large datasets.
Implications
This work has significant implications for real-world applications in fields such as fraud detection, policy violation monitoring, and any domain where data distributions evolve over time. The ability to leverage historical data without retraining can lead to more resilient and adaptable machine learning systems.
Equivariant Multi-agent Reinforcement Learning for Multimodal Vehicle-to-Infrastructure Systems
Reinforcement Learning
Multimodal
Graph Learning
- Introduces a self-supervised multimodal learning framework for V2I systems.
- Utilizes rotation symmetries to reduce the search space in decentralized MARL.
- Implements a graph neural network for policy computation and coordination among RSUs.
- Achieves significant accuracy and performance improvements over existing methods.
Read more
Equivariant Multi-agent Reinforcement Learning for Multimodal Vehicle-to-Infrastructure Systems
Summary
This paper investigates a vehicle-to-infrastructure (V2I) system where distributed base stations (BSs) act as road-side units (RSUs) to collect multimodal data from moving vehicles. The authors address a decentralized rate maximization problem, where each RSU optimizes its resources based on local observations while collaborating with other RSUs to enhance overall network performance. The problem is reformulated as a distributed multi-agent reinforcement learning (MARL) challenge, incorporating rotation symmetries related to vehicle locations. A novel self-supervised learning framework is proposed, enabling each BS agent to align latent features from multimodal observations to determine vehicle positions. The authors employ a graph neural network (GNN) with message passing layers to train an equivariant policy network, allowing local policy computation while coordinating across agents through a signaling scheme that addresses partial observability. Numerical simulations demonstrate the effectiveness of the proposed approach, achieving over two-fold accuracy improvements compared to baseline methods and more than 50% performance gains in MARL training over traditional techniques.
Methodology
The authors recast the decentralized rate maximization problem as a distributed multi-agent Markov decision process (MMDP) with symmetries. They propose a self-supervised learning framework for multimodal data, allowing BS agents to align features from wireless channel state information (CSI) and visual data. A graph neural network (GNN) is employed to compute policies locally while coordinating actions among agents through a signaling scheme that maintains equivariance.
Results
The proposed approach demonstrated over two-fold accuracy gains compared to baseline methods and achieved more than 50% performance improvements in the MARL training process compared to standard approaches, showcasing the effectiveness of the self-supervised multimodal sensing and equivariant training.
Implications
This research has potential applications in advanced wireless communication systems, particularly in enhancing the performance of V2I networks. The findings could lead to more efficient resource management in future wireless networks, supporting applications such as autonomous vehicles and smart city infrastructure.
PD-SOVNet: A Physics-Driven Second-Order Vibration Operator Network for Estimating Wheel Polygonal Roughness from Axle-Box Vibrations
Time Series
- Introduction of PD-SOVNet, a physics-guided framework for wheel roughness estimation.
- Integration of multiple innovative components, including second-order vibration kernels and a MIMO coupling module.
- Demonstrated competitive accuracy and stability across various datasets, especially under challenging conditions.
- Highlights the importance of structured physical priors in improving regression stability.
Read more
PD-SOVNet: A Physics-Driven Second-Order Vibration Operator Network for Estimating Wheel Polygonal Roughness from Axle-Box Vibrations
Summary
This paper introduces PD-SOVNet, a novel physics-guided gray-box framework designed to estimate multi-order wheel polygonal roughness from axle-box vibration signals. The proposed architecture integrates shared second-order vibration kernels, a 4 Γ 4 MIMO coupling module, an adaptive physical correction branch, and a Mamba-based temporal branch. This combination allows the model to embed modal-response priors while maintaining flexibility for data-driven corrections and capturing residual temporal dynamics. The authors address the challenge of continuous regression of wheel roughness spectra, which has been underexplored in existing literature that primarily focuses on detection and classification tasks. The effectiveness of PD-SOVNet is validated through experiments on three real-world datasets, demonstrating competitive accuracy and stable performance across different wheels, particularly excelling in more challenging scenarios. The results indicate that incorporating structured physical priors can enhance the robustness of roughness regression in practical rail-vehicle monitoring applications, although further validation is suggested for broader operational conditions.
Methodology
The methodology involves a gray-box modeling approach that combines physics-based principles with data-driven techniques. The architecture includes shared second-order vibration kernels to capture dynamic characteristics, a 4 Γ 4 MIMO coupling module for multi-input multi-output processing, an adaptive correction branch for sample-dependent adjustments, and a temporal branch to handle residual dynamics. This design allows for the regression of the 1st to 40th-order wheel roughness spectrum from axle-box vibrations.
Results
The experiments conducted on three real-world datasets show that PD-SOVNet achieves competitive prediction accuracy and stable performance across different wheel types. The method particularly excels in the more challenging Dataset III. Additionally, noise injection tests indicate that the Mamba temporal branch effectively mitigates performance degradation under perturbed inputs.
Implications
The findings suggest that incorporating structured physical priors into machine learning models can significantly enhance the reliability of continuous regression tasks in rail-vehicle monitoring. This approach could lead to improved maintenance decision-making and health management for rail vehicles, potentially reducing operational costs and enhancing safety.
Optimal-Transport-Guided Functional Flow Matching for Turbulent Field Generation in Hilbert Space
Generative Models
- FOT-CFM generalizes Conditional Flow Matching to infinite-dimensional Hilbert spaces, enhancing turbulence modeling.
- The integration of Optimal Transport theory allows for efficient and accurate generation of turbulent fields.
- The method achieves high-quality sampling with fewer function evaluations compared to traditional diffusion-based approaches.
- FOT-CFM demonstrates superior fidelity in reproducing turbulent statistics across complex chaotic systems.
Read more
Optimal-Transport-Guided Functional Flow Matching for Turbulent Field Generation in Hilbert Space
Summary
This paper presents Functional Optimal Transport Conditional Flow Matching (FOT-CFM), a novel generative framework designed for high-fidelity turbulence modeling in infinite-dimensional Hilbert spaces. Traditional generative models, such as diffusion models, struggle with the inherent functional nature of turbulence data, which is better represented as continuous fields rather than discrete vectors. FOT-CFM addresses this limitation by employing Optimal Transport (OT) theory to create deterministic paths between noise and target data measures, allowing for simulation-free training and efficient sampling. The authors demonstrate that FOT-CFM can accurately reproduce complex turbulent dynamics, achieving superior performance in generating high-order turbulent statistics and energy spectra across various chaotic systems, including the Navier-Stokes equations. The method's ability to learn continuous physical operators independent of discretization meshes enables zero-shot super-resolution, significantly reducing inference latency compared to existing methods.
Methodology
The authors develop FOT-CFM by formulating conditional-to-marginal path mixing in terms of probability measures and weak continuity equations, avoiding density-based approaches. They incorporate OT theory to construct straight-line probability paths in Hilbert space, enabling simulation-free training. Neural Operators are used to parameterize the vector field, allowing for resolution-invariant generative dynamics.
Results
FOT-CFM was rigorously evaluated on chaotic dynamical systems, including the Navier-Stokes equations, Kolmogorov Flow, and Hasegawa-Wakatani equations. The results indicate that FOT-CFM accurately reproduces high-order turbulent statistics and energy spectra while achieving a significant reduction in inference latency compared to baseline methods.
Implications
The proposed framework has significant implications for turbulence modeling in various scientific and engineering applications, including climate prediction, energy technologies, and fluid dynamics. Its ability to generate high-fidelity turbulence data efficiently can enhance simulations and predictive modeling in complex systems.
Not All Turns Are Equally Hard: Adaptive Thinking Budgets For Efficient Multi-Turn Reasoning
Large Language Models
Reinforcement Learning
Efficient ML
- Introduces Turn-Adaptive Budgets (TAB) for efficient multi-turn reasoning in LLMs.
- Models multi-turn reasoning as a multi-objective Markov Decision Process (MDP).
- Achieves up to 35% token savings while maintaining accuracy on benchmarks.
- Proposes TAB All-SubQ for systems with prior knowledge of sub-questions, saving up to 40% tokens.
Read more
Not All Turns Are Equally Hard: Adaptive Thinking Budgets For Efficient Multi-Turn Reasoning
Summary
This paper addresses the challenge of improving inference-time compute efficiency in multi-turn reasoning tasks involving large language models (LLMs). As LLM reasoning performance plateaus, the authors propose a novel approach called Turn-Adaptive Budgets (TAB), which formulates multi-turn reasoning as a sequential compute allocation problem modeled as a multi-objective Markov Decision Process (MDP). TAB learns to allocate computational resources adaptively based on the difficulty of each turn, maximizing task accuracy while adhering to global token constraints. The authors demonstrate that TAB can save up to 35% of tokens while maintaining or improving accuracy on mathematical reasoning benchmarks compared to static and off-the-shelf LLM baselines. Additionally, they introduce a variant called TAB All-SubQ, which utilizes prior knowledge of all sub-questions to achieve even greater efficiency, saving up to 40% of tokens. This work highlights the importance of adaptive compute allocation in multi-turn reasoning contexts, where the sequential dependency of turns complicates traditional single-turn efficiency methods.
Methodology
The authors formulate multi-turn reasoning as a sequential compute allocation problem and model it using a multi-objective Markov Decision Process (MDP). They develop the TAB policy, which is trained via Group Relative Policy Optimization (GRPO) to maximize task accuracy while respecting token constraints. The policy adapts the token budget based on conversation history and the difficulty of sub-questions.
Results
TAB demonstrates a superior accuracy-tokens tradeoff, achieving up to 35% token savings while maintaining accuracy on mathematical reasoning tasks compared to static and off-the-shelf baselines. The TAB All-SubQ variant further enhances efficiency, saving up to 40% tokens by considering the entire conversation history and all sub-questions.
Implications
The findings suggest that adaptive compute allocation can significantly enhance the efficiency of multi-turn reasoning systems, potentially leading to reduced costs and improved performance in real-world applications involving LLMs. This approach could be applied in various domains requiring complex reasoning tasks, such as education, customer support, and automated problem-solving.
Drifting Fields are not Conservative
Generative Models
Optimization
Theory
- Drift fields in generative models are generally non-conservative and cannot be expressed as gradients of scalar potentials.
- The Gaussian kernel is an exception where the drift field is conservative.
- A new normalization method using the sharp kernel restores conservatism for radial kernels.
- The drifting field matching objective is more general than scalar loss minimization but offers minimal practical advantages.
Read more
Drifting Fields are not Conservative
Summary
This paper investigates the properties of drifting models in generative modeling, specifically focusing on the nature of drift fields used to transport generated samples toward a target data distribution. The authors establish that drift fields are generally non-conservative, meaning they cannot be expressed as the gradient of a scalar potential due to position-dependent normalization. They identify the Gaussian kernel as a unique case where the drift field is conservative. To address the non-conservatism issue, the authors propose a new normalization method using the sharp kernel, which restores conservatism for any radial kernel and allows for well-defined loss functions for training drifting models. Although the drifting field matching objective is more general than traditional loss minimization, the practical benefits of this generality are minimal. The authors recommend using simpler loss function formulations for training drifting models, which simplifies implementation and enhances interpretability.
Methodology
The authors analyze the properties of drift fields in generative models, particularly focusing on their conservatism. They derive the conditions under which drift fields can be expressed as gradients of scalar potentials and propose a new normalization method using the sharp kernel to restore conservatism. They conduct experiments to compare the performance of the sharp normalization against traditional drift fields.
Results
The experiments demonstrate that the sharp normalization performs comparably to the original drift field, indicating that the non-conservative aspects do not significantly contribute to performance. Additionally, sharp normalization improves tail behavior in cases where the original drift field diverges far from the data distribution.
Implications
The findings suggest that while drifting models offer flexibility in sample transport, simpler loss-based training approaches may be more effective and interpretable. This could lead to more efficient implementations of generative models in practical applications.
Transformer See, Transformer Do: Copying as an Intermediate Step in Learning Analogical Reasoning
NLP
Large Language Models
Theory
- Transformers can learn analogical reasoning through a meta-learning approach.
- Incorporating copying tasks in training data improves generalization to new alphabets.
- The proposed model outperforms many existing large language models on letter-string analogy tasks.
- Interpretability analyses reveal the model's reasoning mechanisms.
Read more
Transformer See, Transformer Do: Copying as an Intermediate Step in Learning Analogical Reasoning
Summary
This paper investigates the ability of transformer models to perform analogical reasoning, a cognitive process crucial for human intelligence. The authors propose a novel approach using Meta-Learning for Compositionality (MLC) to train transformers on letter-string analogy tasks. They find that incorporating copying tasks into the training data significantly enhances the models' ability to generalize to new alphabets and combinations of transformations. The study reveals that while the models excel at generalizing to new contexts, they struggle with entirely novel transformations. Through interpretability analyses, the authors uncover the underlying mechanisms of the models' reasoning processes, providing insights into how these findings can inform the development of larger models and their analogical reasoning capabilities.
Methodology
The authors developed a suite of datasets for one- and few-shot letter-string analogy problems and trained small encoder-decoder transformers using the MLC approach. They systematically evaluated the models' performance on learning tasks and generalization capabilities across various targets, including new alphabets and transformations.
Results
The results indicate that the transformer models trained with MLC can effectively solve letter-string analogies, outperforming most frontier models. They demonstrate strong generalization to new alphabets and moderate success with combinations of known transformations, but struggle with entirely new transformations.
Implications
The findings suggest that enhancing training data with copying tasks can significantly improve the analogical reasoning capabilities of AI models. This has potential implications for developing more robust AI systems that can generalize knowledge across different contexts, similar to human reasoning.
VLMShield: Efficient and Robust Defense of Vision-Language Models against Malicious Prompts
Multimodal
- Introduction of the MAFE framework for effective multimodal feature extraction.
- Development of VLMShield as a lightweight and efficient safety detector.
- Demonstration of superior performance compared to existing defense methods.
- Identification of distinct patterns in benign vs. malicious prompts.
Read more
VLMShield: Efficient and Robust Defense of Vision-Language Models against Malicious Prompts
Summary
This paper addresses the vulnerabilities of Vision-Language Models (VLMs) to malicious prompt attacks, which exploit weakened safety alignments during visual integration. The authors propose a novel framework called Multimodal Aggregated Feature Extraction (MAFE) that allows the CLIP model to handle long text inputs and fuse multimodal information into unified representations. Through empirical analysis, they identify distinct distributional patterns between benign and malicious prompts, leading to the development of VLMShield, a lightweight safety detector that operates as a plug-and-play solution. Extensive experiments show that VLMShield significantly outperforms existing defenses in terms of robustness, efficiency, and utility, achieving very low attack success rates and high benign accuracy. The work aims to enhance the safety of multimodal AI applications by providing a more secure deployment strategy against diverse malicious attacks.
Methodology
The authors developed the MAFE framework to enable the CLIP model to process long text and integrate multimodal information. They conducted empirical analyses to discover distributional patterns in features extracted by MAFE, which informed the design of VLMShield, a three-layer neural network that identifies multimodal malicious attacks efficiently.
Results
VLMShield achieved in-domain attack success rates as low as 0.00-0.19% and out-of-domain rates of β€2.13%, with benign accuracy ranging from 96.33% to 100%. The system demonstrated robust defense capabilities against adaptive attacks, with a maximum effective attack success rate of 1.41%. Overall, it showed exceptional efficiency and robustness compared to state-of-the-art defenses.
Implications
The findings suggest that VLMShield can significantly enhance the safety of VLMs in various applications, including medical diagnosis and educational tools, by providing a reliable defense mechanism against malicious prompts. This work paves the way for more secure deployments of multimodal AI systems in real-world scenarios.
Probabilistic Language Tries: A Unified Framework for Compression, Decision Policies, and Execution Reuse
Generative Models
Reinforcement Learning
Efficient ML
- Introduction of Probabilistic Language Tries (PLTs) as a unified representation for generative models.
- PLTs enable optimal lossless compression, decision policy representation, and efficient execution reuse.
- A prior-guided caching theorem shows PLTs outperform empirical-frequency caches in inference cost.
- Hybrid compression architecture achieves description lengths below Shannon entropy when the model is accurate.
Read more
Probabilistic Language Tries: A Unified Framework for Compression, Decision Policies, and Execution Reuse
Summary
This paper introduces Probabilistic Language Tries (PLTs), a novel representation that explicitly captures the prefix structure defined by generative models over sequences. PLTs assign conditional probabilities to outgoing edges, enabling them to function as optimal lossless compressors through frequency-weighted interval encoding, serve as policy representations for sequential decision-making tasks, and act as memoization indices for efficient inference query handling. The central technical contribution is a prior-guided caching theorem, demonstrating that PLTs can achieve lower expected inference costs compared to traditional empirical-frequency caches. The paper also presents a hybrid compression architecture that separates datasets into a PLT-covered majority and a sparse residual store, achieving compression below the Shannon entropy of the empirical distribution when the generative model accurately reflects the source structure. The framework is instantiated across various domains, including chess, web search, robotics, and large language model inference systems, showcasing its versatility in unifying compression, decision-making, and computational reuse.
Methodology
The methodology involves defining PLTs with conditional probabilities for each edge, implementing frequency-weighted interval encoding for compression, and developing a caching mechanism based on prior probabilities. The paper also explores a hybrid architecture that separates data into a PLT-covered majority and a sparse residual store, connecting it to Shannon entropy and Kolmogorov-style representations.
Results
The results demonstrate that PLTs can significantly reduce expected inference costs from O(n^2) to O(log N) for practical queries, and the hybrid compression architecture achieves lower description lengths than the Shannon entropy of the empirical distribution when the generative model is accurate.
Implications
The implications of this work suggest that PLTs can enhance the efficiency of various machine learning applications, particularly in areas requiring sequential decision-making and inference reuse. The unification of compression and decision policies under a single framework may lead to more efficient algorithms in robotics, game playing, and large language model applications.
AE-ViT: Stable Long-Horizon Parametric Partial Differential Equations Modeling
Theory
Efficient ML
Optimization
- AE-ViT integrates convolutional encoding, transformer-based latent evolution, and decoding for parametric PDE modeling.
- The model employs multi-stage parameter injection and coordinate channel injection to improve conditioning and spatial awareness.
- AE-ViT outperforms existing deep learning reduced order models and latent transformers in multi-field predictions.
- The approach significantly reduces relative rollout error by approximately five times compared to traditional methods.
Read more
AE-ViT: Stable Long-Horizon Parametric Partial Differential Equations Modeling
Summary
The paper presents AE-ViT, a novel approach for modeling parametric partial differential equations (PDEs) using deep learning reduced order models (ROMs). Traditional methods either rely on full solution fields, which are computationally expensive, or on compressed latent representations that struggle with evolving complex spatial interactions. AE-ViT addresses these challenges by integrating a convolutional encoder, a transformer for latent representation evolution, and a decoder for reconstruction. The model introduces multi-stage parameter injection and coordinate channel injection, allowing it to adaptively condition its computations based on the specific parameters of the PDEs. This approach enhances the model's ability to predict multiple solution components with varying magnitudes and sensitivities. The authors demonstrate AE-ViT's effectiveness through experiments on the Advection-Diffusion-Reaction equation and Navier-Stokes flow, showing a significant reduction in relative rollout error compared to existing methods, thus combining the efficiency of latent evolution with the fidelity of full-field models.
Methodology
The methodology involves a joint model architecture comprising a convolutional encoder to capture spatial features, a transformer to evolve latent representations, and a decoder for reconstructing the solution. The model incorporates multi-stage parameter injection and coordinate channel injection to enhance its adaptability to varying PDE parameters. The training process emphasizes autoregressive stability and joint learning of multiple solution components.
Results
Experiments conducted on the Advection-Diffusion-Reaction equation and Navier-Stokes flow indicate that AE-ViT achieves a reduction in relative rollout error by approximately five times compared to existing deep learning reduced order models, latent transformers, and plain vision transformers, demonstrating superior performance in multi-field prediction tasks.
Implications
The AE-ViT model has significant implications for applications requiring efficient and accurate simulations of parametric PDEs, such as in fields like hemodynamics and aerodynamics. Its ability to handle high-dimensional data and adapt to varying parameters could enhance the modeling of complex physical systems.
TwinLoop: Simulation-in-the-Loop Digital Twins for Online Multi-Agent Reinforcement Learning
Reinforcement Learning
- TwinLoop leverages digital twins for accelerated policy adaptation in multi-agent systems.
- The framework enables cost-free exploration through simulation before applying changes in the real environment.
- Evaluation in vehicular edge computing scenarios shows significant improvements in adaptation efficiency.
- TwinLoop reduces the need for costly trial-and-error interactions in dynamic environments.
Read more
TwinLoop: Simulation-in-the-Loop Digital Twins for Online Multi-Agent Reinforcement Learning
Summary
The paper introduces TwinLoop, a novel framework that integrates digital twins (DTs) into online multi-agent reinforcement learning (MARL) to enhance adaptation efficiency in dynamic environments. Traditional decentralized online learning methods often struggle with costly trial-and-error interactions when environmental conditions shift, leading to performance degradation. TwinLoop addresses this challenge by utilizing a simulation-in-the-loop approach, where a digital twin is activated to reconstruct the current system state and perform accelerated policy improvement through simulation-based what-if analysis. This allows agents to rehearse adaptations in a virtual environment before applying changes in the physical system. The framework is evaluated in a vehicular edge computing task-offloading scenario, demonstrating that TwinLoop significantly improves adaptation efficiency and reduces reliance on expensive real-world exploration. The findings suggest that digital twins can effectively facilitate policy rehearsal and enhance the robustness of multi-agent systems in fluctuating conditions.
Methodology
The TwinLoop framework employs a digital twin that mirrors the physical system's state and agent policies. Upon detecting a context shift, the digital twin is triggered to simulate various scenarios and perform what-if analyses to refine agent policies. The updated policies are then synchronized back to the physical agents, allowing for rapid adaptation without the need for extensive real-world trials.
Results
The evaluation of TwinLoop in a vehicular edge computing task-offloading scenario revealed that the framework significantly enhances post-shift adaptation efficiency. Agents using TwinLoop demonstrated improved performance and reduced reliance on costly real-world trial-and-error interactions, indicating a successful application of digital twins in multi-agent reinforcement learning.
Implications
The findings suggest that integrating digital twins into online learning frameworks can provide a robust mechanism for real-time adaptation in multi-agent systems, particularly in dynamic environments such as vehicular networks. This approach could be applied to various domains requiring adaptive decision-making under uncertainty, including robotics, smart cities, and autonomous systems.
Reproducing AlphaZero on Tablut: Self-Play RL for an Asymmetric Board Game
Reinforcement Learning
- Successful adaptation of AlphaZero's self-play framework to Tablut, an asymmetric board game.
- Implementation of separate policy and value heads for each player to address the game's unique dynamics.
- Challenges of catastrophic forgetting were mitigated through data augmentation and an increased replay buffer.
- The model achieved a BayesElo rating of 1235, indicating steady improvement in performance over iterations.
Read more
Reproducing AlphaZero on Tablut: Self-Play RL for an Asymmetric Board Game
Summary
This paper explores the adaptation of the AlphaZero reinforcement learning framework to Tablut, an asymmetric board game. The authors investigate whether the self-play methodology of AlphaZero can be effectively transferred to Tablut, which features distinct roles and objectives for players. The key modification involves implementing separate policy and value heads for each player to accommodate the game's asymmetry. The authors replicate the neural network architecture of AlphaZero while reducing its complexity to fit Tablut's requirements. They address challenges such as catastrophic forgetting during self-play and employ techniques like C4 data augmentation and an increased replay buffer to stabilize training. The model was trained over 100 iterations, achieving a BayesElo rating of 1235, indicating significant improvement in performance. The results suggest that while the attackerβs strategies became more effective, the defenderβs strategies were harder to learn, highlighting the complexities of training in asymmetric games.
Methodology
The authors utilized a neural network architecture similar to AlphaZero, with modifications for Tablut's asymmetry, including separate policy and value heads for each player. Training involved MCTS with the GumbelMuZero variant, and techniques like C4 data augmentation were applied to stabilize learning. The model was implemented in JAX and trained on NVIDIA GPUs over 100 self-play iterations.
Results
The model reached a BayesElo rating of 1235 after 100 iterations, with a decrease in policy entropy and an increase in the attacker's win rate to 86% by the end of training. The defender's win rate declined to 52%, indicating a disparity in learning effectiveness between the two roles.
Implications
This research demonstrates the potential for applying reinforcement learning frameworks to asymmetric games, highlighting the need for tailored approaches in such contexts. The findings could inform future developments in game AI and self-play methodologies, particularly for games with distinct player roles.
Weighted Bayesian Conformal Prediction
Theory
- WBCP generalizes BQ-CP to importance-weighted settings, addressing the limitations of i.i.d. assumptions.
- The method replaces uniform Dirichlet distributions with weighted Dirichlet distributions for better threshold estimation.
- Theoretical results confirm calibration consistency and improved posterior concentration rates.
- WBCP is instantiated for spatial prediction, yielding interpretable diagnostics and effective sample size maps.
Read more
Weighted Bayesian Conformal Prediction
Summary
This paper introduces Weighted Bayesian Conformal Prediction (WBCP), a novel method that extends Bayesian Quadrature Conformal Prediction (BQ-CP) to handle distribution shifts through importance weighting. While BQ-CP provides data-conditional guarantees using Dirichlet posteriors, it relies on the assumption of independent and identically distributed (i.i.d.) data, which limits its applicability in real-world scenarios where data may not be i.i.d. On the other hand, traditional weighted conformal prediction addresses distribution shifts but lacks the Bayesian framework and does not account for meta-uncertainty regarding threshold reliability. WBCP bridges this gap by replacing the uniform Dirichlet distribution with a weighted Dirichlet distribution based on effective sample size and importance weights, allowing for a full posterior distribution over thresholds. The authors prove several theoretical results regarding the calibration consistency and posterior concentration of WBCP. They also demonstrate its application in spatial prediction, termed Geographical BQ-CP, which provides interpretable diagnostics. Experimental results on synthetic and real-world datasets show that WBCP maintains coverage guarantees while offering richer uncertainty information compared to existing methods.
Methodology
WBCP employs a weighted Dirichlet distribution to model the uncertainty in threshold estimation, utilizing importance weights derived from the effective sample size. The method builds on the principles of Bayesian Quadrature and weighted conformal prediction, allowing for a posterior distribution over thresholds rather than a single point estimate.
Results
The authors prove four key theoretical results: (1) the effective sample size is a unique concentration parameter aligning frequentist and Bayesian variances; (2) the posterior standard deviation decreases at a rate of O(1/βneff); (3) the stochastic dominance guarantee of BQ-CP extends to weighted settings; and (4) the highest posterior density (HPD) threshold improves conditional coverage by O(1/βneff). Experimental results demonstrate that WBCP maintains coverage guarantees while providing richer uncertainty information across various datasets.
Implications
WBCP has significant implications for fields requiring robust uncertainty quantification under distribution shifts, such as spatial prediction, domain adaptation, and high-stakes decision-making scenarios. Its ability to provide a full posterior distribution over thresholds enhances the reliability of prediction intervals, making it suitable for applications in finance, healthcare, and environmental modeling.
The UNDO Flip-Flop: A Controlled Probe for Reversible Semantic State Management in State Space Model
Theory
- Introduces the UNDO Flip-Flop task to evaluate reversible semantic state management.
- Demonstrates that existing models struggle to learn stack-based rollback mechanisms.
- Finds that models converge on a heuristic that fails under adversarial conditions.
- Establishes a distinction between theoretical expressibility and practical learnability.
Read more
The UNDO Flip-Flop: A Controlled Probe for Reversible Semantic State Management in State Space Model
Summary
This paper introduces the UNDO Flip-Flop task, which extends the standard Flip-Flop task to evaluate reversible semantic state management in state space models (SSMs). While existing benchmarks focus on monotonic state tracking or structural nesting, they do not address the ability to retrieve historical states under non-monotonic updates. The UNDO Flip-Flop requires models to maintain an implicit bounded stack and recover previous states, thus probing the limitations of gradient-based optimization in learning such mechanisms. The study evaluates the Mamba-2 architecture, both in one-layer and two-layer configurations, and finds that neither successfully acquires the necessary stack-based rollback mechanism. Instead, both configurations resort to a local toggle heuristic that merely inverts the current state rather than retrieving stored history. Under adversarial conditions, the two-layer model performs poorly, achieving only 41.10% accuracy, which is below random chance. Causal ablation indicates that the primary bottleneck lies in retrieval capabilities rather than storage. This work highlights the distinction between theoretical expressibility of architectures and what can be reliably learned through optimization, suggesting that gradient descent may not effectively capture complex state retrieval tasks.
Methodology
The study employs the UNDO Flip-Flop task to assess the performance of the Mamba-2 architecture in one-layer and two-layer configurations. It conducts evaluations under standard and adversarial conditions to analyze the models' ability to retrieve historical states. Causal ablation is used to identify the bottleneck in the models' performance.
Results
Both one-layer and two-layer Mamba-2 models fail to learn the required stack-based rollback mechanism, instead relying on a local toggle heuristic. The two-layer model achieves only 41.10% accuracy under adversarial retraction pressure, indicating a systematic failure in retrieval capabilities.
Implications
The findings suggest that while SSMs may theoretically represent complex state management tasks, gradient descent optimization may not effectively learn these capabilities. This has implications for the design of future models and training algorithms aimed at improving state retrieval in sequential tasks.
Extending Tabular Denoising Diffusion Probabilistic Models for Time-Series Data Generation
Generative Models
Time Series
- Introduces a temporal extension of TabDDPM for generating synthetic time-series data.
- Incorporates lightweight temporal adapters and context-aware embeddings to model temporal dependencies.
- Demonstrates improved temporal coherence and diversity in synthetic data compared to baseline methods.
- Achieves competitive classification performance on the WISDM dataset, addressing data imbalance issues.
Read more
Extending Tabular Denoising Diffusion Probabilistic Models for Time-Series Data Generation
Summary
This paper addresses the limitations of existing Tabular Denoising Diffusion Probabilistic Models (TabDDPM) in generating synthetic time-series data, particularly for Human Activity Recognition (HAR) tasks. While TabDDPM effectively generates high-quality synthetic data from heterogeneous tabular datasets, it assumes independence between samples, which is inadequate for time-series data where temporal dependencies are crucial. The authors propose a temporal extension of TabDDPM that incorporates sequence awareness through lightweight temporal adapters and context-aware embedding modules. By reformulating sensor data into windowed sequences and modeling temporal context with timestep embeddings, conditional activity labels, and observed/missing masks, the proposed method generates temporally coherent synthetic sequences. Validation on the WISDM accelerometer dataset demonstrates that the new approach produces synthetic time-series data that closely resembles real-world sensor patterns, achieving a macro F1-score of 0.64 and an accuracy of 0.71. This method not only enhances temporal realism and diversity but also addresses issues of minority class representation and statistical alignment with real distributions. The findings suggest that diffusion-based models can be effectively adapted for sequential data synthesis, paving the way for future research to explore longer sequences and stronger temporal architectures.
Methodology
The authors modify the TabDDPM framework by integrating temporal embeddings, conditional context, and an observed value mask. They reformulate sensor data into windowed sequences to capture temporal dependencies and utilize bigram transition matrices and autocorrelation analysis for validation.
Results
The proposed method generates synthetic time-series data with a macro F1-score of 0.64 and accuracy of 0.71 on the WISDM dataset. It shows enhanced temporal coherence and realism compared to original TabDDPM, as evidenced by bigram transition matrices and autocorrelation metrics.
Implications
The findings indicate that diffusion-based models can effectively generate synthetic time-series data, which is crucial for applications in activity recognition, health monitoring, and privacy-preserving data augmentation. This approach can help mitigate data scarcity and imbalance in HAR research.
Jeffreys Flow: Robust Boltzmann Generators for Rare Event Sampling via Parallel Tempering Distillation
Generative Models
Theory
Efficient ML
- Jeffreys Flow mitigates mode collapse in Boltzmann generators by using symmetric Jeffreys divergence.
- The framework distills empirical data from Parallel Tempering to enhance sampling accuracy.
- Theoretical results confirm that Jeffreys Flow generates distributions closer to the target than empirical samples.
- Demonstrated scalability and accuracy on high-dimensional benchmarks.
Read more
Jeffreys Flow: Robust Boltzmann Generators for Rare Event Sampling via Parallel Tempering Distillation
Summary
The paper introduces Jeffreys Flow, a novel generative framework designed to enhance rare event sampling in physical systems characterized by complex energy landscapes. Traditional Boltzmann generators often suffer from mode collapse due to their reliance on reverse KullbackβLeibler divergence, which can lead to the omission of significant modes in multi-modal distributions. Jeffreys Flow addresses this issue by employing the symmetric Jeffreys divergence as its loss function, allowing for a more balanced approach between local precision and global mode coverage. The framework distills empirical sampling data from Parallel Tempering (PT) trajectories, effectively mitigating mode collapse and correcting inaccuracies in the sampling process. The authors demonstrate the scalability and accuracy of Jeffreys Flow on challenging multi-dimensional benchmarks, including applications in Replica Exchange Stochastic Gradient Langevin Dynamics and Path Integral Monte Carlo for quantum thermal states. Theoretical guarantees are provided to support the effectiveness of the method, highlighting its potential to significantly improve sampling efficiency in complex systems.
Methodology
The Jeffreys Flow framework utilizes the symmetric Jeffreys divergence as a loss function to train normalizing flows, distilling knowledge from Parallel Tempering samples. This approach allows the model to learn from both local and global features of the target distribution, effectively addressing the challenges posed by rare events and mode collapse.
Results
The numerical tests show that Jeffreys Flow outperforms traditional Boltzmann generators in sampling from multi-modal distributions, significantly reducing the incidence of mode collapse. The method also demonstrates improved accuracy in applications such as Replica Exchange Stochastic Gradient Langevin Dynamics and Path Integral Monte Carlo, showcasing its effectiveness in complex sampling scenarios.
Implications
Jeffreys Flow has the potential to revolutionize rare event sampling in various fields, including statistical mechanics and computational physics, by providing a more reliable and efficient method for generating samples from complex distributions. Its ability to correct inaccuracies and avoid mode collapse could lead to advancements in simulations of physical systems and other applications requiring robust sampling techniques.
A Theory-guided Weighted $L^2$ Loss for solving the BGK model via Physics-informed neural networks
Theory
Optimization
- The standard L2 PINN loss is insufficient for ensuring accuracy in the BGK model.
- A new velocity-weighted L2 loss function is proposed to effectively penalize high-velocity errors.
- The paper establishes a rigorous stability estimate for the weighted loss, ensuring convergence.
- Numerical experiments demonstrate superior performance of the weighted loss over the standard approach.
Read more
A Theory-guided Weighted $L^2$ Loss for solving the BGK model via Physics-informed neural networks
Summary
This paper addresses the limitations of the standard L2 loss function when applied to the Bhatnagar-Gross-Krook (BGK) model in the context of Physics-Informed Neural Networks (PINNs). The authors demonstrate that minimizing the standard L2 loss does not guarantee accurate predictions of macroscopic moments, which are critical for capturing the true physical solutions in kinetic equations. To overcome this challenge, they propose a velocity-weighted L2 loss function that penalizes errors in high-velocity regions more effectively. The paper establishes a stability estimate for this new loss function, proving that minimizing it ensures convergence of the approximate solution to the true solution. Numerical experiments validate the effectiveness of the weighted loss, showing improved accuracy and robustness compared to the standard approach across various benchmarks. The findings highlight the necessity of tailored loss functions in ensuring the reliability of solutions derived from PINNs, particularly for complex kinetic models like the BGK model.
Methodology
The authors introduce a velocity-weighted L2 loss function that modifies the standard PINN loss by incorporating a velocity-dependent weighting scheme. They rigorously analyze the stability of this new loss function and provide theoretical guarantees for convergence. Numerical experiments are conducted to compare the performance of the weighted loss against the standard L2 loss across various benchmarks.
Results
The proposed weighted PINN loss function significantly improves the accuracy and robustness of solutions for the BGK model compared to the standard L2 loss. The theoretical analysis confirms that minimizing the weighted loss leads to convergence of the approximate solution to the true solution, addressing the shortcomings of the standard approach.
Implications
This work has implications for the application of PINNs in solving complex kinetic equations, particularly in fields such as aerodynamics and gas flow modeling. The findings suggest that tailored loss functions can enhance the reliability of neural network-based solutions in physics-informed contexts.
$S^3$: Stratified Scaling Search for Test-Time in Diffusion Language Models
NLP
Large Language Models
Generative Models
- S3 improves upon naive best-of-K sampling by reallocating compute during the denoising process.
- The method utilizes a verifier-guided search to enhance output quality without retraining the model.
- S3 achieves significant performance gains on benchmarks like MATH-500 and GSM8K.
- The approach maintains diversity in candidate outputs while favoring high-quality results.
Read more
$S^3$: Stratified Scaling Search for Test-Time in Diffusion Language Models
Summary
This paper introduces S3 (Stratified Scaling Search), a novel method aimed at enhancing the performance of diffusion language models (DLMs) during test-time scaling without requiring additional training. The authors identify a critical limitation in existing best-of-K sampling techniques, which often fail to align high-probability regions of the model's output distribution with high-quality outputs. S3 addresses this issue by reallocating computational resources during the denoising process, rather than solely at the final output stage. The method involves expanding multiple candidate trajectories at each denoising step, evaluating them using a lightweight verifier that does not require ground-truth labels, and selectively resampling promising candidates to maintain diversity. Experimental results demonstrate that S3 significantly improves performance across various benchmarks, particularly in mathematical reasoning tasks, while keeping the underlying model and decoding schedule unchanged. The findings suggest that classical search methods over denoising trajectories can effectively facilitate test-time scaling in DLMs.
Methodology
The S3 method employs a verifier-guided particle search over denoising trajectories, where multiple candidate outputs are generated at each denoising step. A lightweight verifier evaluates these candidates, allowing the method to selectively resample the most promising ones while preserving diversity. This approach approximates a reward-tilted sampling distribution that prioritizes higher-quality outputs while remaining anchored to the model's prior.
Results
S3 demonstrated notable improvements in accuracy across several benchmarks: MATH-500 improved from 25.60% to 30.20%, GSM8K from 68.16% to 70.21%, and TruthfulQA from 46.49% to 49.57%. The method also achieved competitive results on ARC-Challenge, increasing performance from 76.11% to 77.86%. These results indicate that S3 effectively enhances the quality of outputs generated by DLMs.
Implications
The introduction of S3 suggests that existing DLMs can achieve better performance through strategic allocation of computational resources during inference, potentially leading to more efficient and effective applications in natural language processing tasks, especially those requiring complex reasoning.
BiScale-GTR: Fragment-Aware Graph Transformers for Multi-Scale Molecular Representation Learning
Graph Learning
- BiScale-GTR combines GNN-based local encoding with Transformer-based global reasoning for molecular representation learning.
- The framework employs a graph-based BPE tokenizer to ensure consistent and chemically valid fragment tokenization.
- It captures both atom-level and fragment-level structures, enhancing the model's ability to learn long-range dependencies.
- Experiments show that BiScale-GTR achieves state-of-the-art performance on multiple molecular property prediction benchmarks.
Read more
BiScale-GTR: Fragment-Aware Graph Transformers for Multi-Scale Molecular Representation Learning
Summary
The paper introduces BiScale-GTR, a novel framework for self-supervised molecular representation learning that integrates graph neural networks (GNNs) with Transformers to enhance molecular property prediction. Traditional GNNs often struggle with long-range dependencies due to over-smoothing and over-squashing issues, while existing hybrid models primarily focus on atom-level representations, limiting their ability to capture higher-level structural interactions. BiScale-GTR addresses these challenges by employing a fragment-aware approach that utilizes a graph-based Byte Pair Encoding (BPE) tokenizer to generate consistent and chemically valid fragment tokens. These tokens serve as inputs to a parallel GNN-Transformer architecture, allowing the model to learn both local chemical environments and long-range molecular dependencies. The framework demonstrates state-of-the-art performance on various benchmarks, including MoleculeNet and PharmaBench, and provides interpretable insights into the relationship between molecular structure and predicted properties through attribution analysis.
Methodology
BiScale-GTR utilizes a hybrid architecture that first encodes atom-level information using a GNN, which is then aggregated into fragment-level embeddings. These embeddings are fused with fragment token embeddings before being processed by Transformer layers, enabling the model to reason across multiple scales. The graph-based BPE tokenizer is designed to maintain chemical validity and consistency in fragment identification.
Results
The proposed framework achieved state-of-the-art results on classification and regression tasks across several benchmarks, including MoleculeNet, PharmaBench, and the Long Range Graph Benchmark (LRGB). Attribution analysis revealed that the model effectively highlights chemically meaningful motifs, linking molecular structures to predicted properties.
Implications
BiScale-GTR has significant implications for drug discovery and materials science, as it enhances the predictive capabilities of machine learning models in understanding molecular properties. The ability to provide interpretable insights into molecular structures can facilitate the design of new compounds and materials.
FLeX: Fourier-based Low-rank EXpansion for multilingual transfer
Large Language Models
Optimization
Efficient ML
- LoRA fine-tuning on a small dataset outperforms broader fine-tuning approaches.
- Sophia optimizer provides faster convergence but with marginal final performance differences compared to Adam.
- Fourier-based regularization significantly enhances cross-lingual transfer capabilities.
- The approach demonstrates practical strategies for deploying multilingual code-generation models.
Read more
FLeX: Fourier-based Low-rank EXpansion for multilingual transfer
Summary
The paper presents FLeX, a novel approach to enhance cross-lingual code generation by fine-tuning large language models (LLMs) using parameter-efficient techniques. The study focuses on the Code Llama 7B model and investigates the effectiveness of Low-Rank Adaptation (LoRA) combined with different optimizers, specifically Adam and Sophia, alongside a new Fourier-based regularization method. The research demonstrates that fine-tuning with LoRA on a high-quality Python dataset (MBPP) can surpass the performance of a more broadly fine-tuned model, achieving a pass@1 score of 40.1% compared to 38.4%. Additionally, while Sophia optimizer shows faster convergence, the final performance differences are marginal. The introduction of Fourier-based regularization significantly boosts cross-lingual transfer, achieving a remarkable 42.1% pass@1 score on Java tasks, compared to a baseline of 34.2%. These findings suggest that integrating LoRA, optimized training methods, and frequency-domain regularization can effectively adapt single-language LLMs for multilingual code generation in computationally constrained environments.
Methodology
The study employs Low-Rank Adaptation (LoRA) for parameter-efficient fine-tuning of the Code Llama 7B model, comparing the performance of Adam and Sophia optimizers. A novel Fourier-based regularization technique is introduced to improve cross-lingual transfer, preserving low-frequency parameter updates during training.
Results
The LoRA fine-tuned model achieved a pass@1 score of 40.1% on Python tasks, surpassing the Code Llama-Python-7B model's score of 38.4%. The Fourier-based regularization improved Java task performance to 42.1% pass@1, significantly higher than the baseline of 34.2%.
Implications
The findings provide a pathway for efficiently adapting LLMs for multilingual code generation, which is crucial for enterprise environments that rely on multiple programming languages. This research could lead to more reliable AI agents capable of maintaining infrastructure across heterogeneous systems.
PromptEvolver: Prompt Inversion through Evolutionary Optimization in Natural-Language Space
Generative Models
Optimization
Multimodal
- PromptEvolver is the first text-level evolutionary framework for prompt inversion, using image-aware VLM operators.
- The method operates without requiring access to model internals, making it applicable to both open-source and black-box models.
- PromptEvolver achieves state-of-the-art results in prompt inversion, with up to 7.8% improvement in image reconstruction scores compared to existing baselines.
- The genetic algorithm used in PromptEvolver promotes diversity in prompt generation, reducing the risk of local optima.
Read more
PromptEvolver: Prompt Inversion through Evolutionary Optimization in Natural-Language Space
Summary
The paper introduces PromptEvolver, a novel approach to prompt inversion in text-to-image (T2I) generation that utilizes evolutionary optimization in the natural-language space. The authors highlight the challenges of existing methods, which often yield suboptimal and hard-to-interpret prompts, making them unsuitable for practical applications. PromptEvolver employs a genetic algorithm to optimize prompts, leveraging a vision-language model (VLM) to guide the evolution process. This method operates in the space of human-readable text, allowing for greater transparency and controllability. The authors demonstrate that PromptEvolver consistently outperforms existing methods across multiple benchmarks, achieving significant improvements in image reconstruction fidelity. The paper emphasizes the importance of generating natural-language prompts that can be easily understood and edited by users, thereby enhancing the usability of T2I models.
Methodology
PromptEvolver employs a genetic algorithm for prompt inversion, utilizing a vision-language model to generate diverse initial prompts and perform crossover and mutation operations. This evolutionary optimization occurs entirely in the natural-language space, allowing for the generation of human-readable prompts without requiring access to the internal workings of the T2I model.
Results
The evaluation of PromptEvolver across multiple prompt inversion benchmarks shows that it consistently outperforms competing methods, achieving up to a 7.8% improvement in image reconstruction scores. The method demonstrates robustness in capturing fine details and complex concepts in the generated images.
Implications
PromptEvolver has significant implications for creative workflows in T2I generation, enabling users to efficiently generate and modify prompts for desired visual outputs. Additionally, it enhances the understanding and auditing of generative models by providing interpretable prompts that explain the generated images.
A machine learning framework for uncovering stochastic nonlinear dynamics from noisy data
Time Series
Theory
Interpretability
- Introduces a hybrid framework combining symbolic regression and Gaussian processes.
- Successfully identifies both symbolic and stochastic components of dynamical systems.
- Demonstrates data efficiency, requiring only 102-103 data points.
- Validates the approach on both numerical benchmarks and experimental biological systems.
Read more
A machine learning framework for uncovering stochastic nonlinear dynamics from noisy data
Summary
This paper presents a novel machine learning framework designed to uncover stochastic nonlinear dynamics from noisy data. The authors address the challenge of modeling real-world systems, which often exhibit noise due to various unpredictable factors. Traditional symbolic regression methods can identify governing equations but typically overlook uncertainty, while Gaussian processes provide uncertainty quantification but lack insights into the underlying dynamics. The proposed hybrid framework integrates deep symbolic regression with Gaussian process-based maximum likelihood estimation, allowing for the recovery of symbolic forms of governing equations while simultaneously inferring uncertainty in system parameters. The methodology is validated through numerical benchmarks, including harmonic, Duffing, and van der Pol oscillators, and is further tested on an experimental system of coupled biological oscillators. The results demonstrate that the framework is data-efficient, requiring only 102-103 data points, and is robust against noise, showcasing its potential applicability in domains where uncertainty is inherent and both the structure and variability of dynamical systems need to be understood.
Methodology
The framework combines deep symbolic regression, which generates symbolic expressions from data, with Gaussian process-based maximum likelihood estimation to model both deterministic dynamics and noise structure without prior assumptions about their functional forms.
Results
The framework successfully identifies governing equations and quantifies uncertainty in system parameters across various numerical benchmarks and an experimental setup, demonstrating robustness to noise and requiring significantly fewer data points than traditional methods.
Implications
This work has broad implications for fields requiring accurate modeling of complex dynamical systems under uncertainty, such as finance, biology, and environmental science, where understanding both the governing dynamics and the associated uncertainties is crucial.
Mining Electronic Health Records to Investigate Effectiveness of Ensemble Deep Clustering
Theory
Optimization
Multimodal
- Traditional clustering methods like K-means dominate EHR analysis but are limited in effectiveness.
- An ensemble-based deep clustering approach is proposed to enhance clustering performance by aggregating multiple embeddings.
- The study utilizes real EHR data from the All of Us Research Program to evaluate clustering methods.
- The proposed method outperforms traditional and deep learning methods across various metrics and patient cohorts.
Read more
Mining Electronic Health Records to Investigate Effectiveness of Ensemble Deep Clustering
Summary
This paper explores the effectiveness of various clustering methods applied to electronic health records (EHRs), particularly focusing on heart failure patient cohorts. Traditional clustering methods, especially K-means, have been the dominant approach in healthcare informatics but have shown limited success when applied to deep learning embeddings. The authors propose an ensemble-based deep clustering method that aggregates cluster assignments from multiple embedding dimensions, addressing the shortcomings of existing deep clustering techniques. By combining traditional and deep clustering approaches, the ensemble method outperforms 14 diverse clustering methods across multiple patient cohorts. The study emphasizes the importance of biological sex-specific clustering and demonstrates that a hybrid approach can yield better results than relying on a single method.
Methodology
The authors extend the Gaussian Cluster Embedding Autoencoder Latent Space (G-CEALS) for healthcare applications and evaluate it against traditional clustering methods using real-world EHR data. They introduce a novel ensemble framework that combines the strengths of both traditional and deep clustering methods, utilizing multiple embedding dimensions for improved clustering accuracy.
Results
The ensemble deep clustering approach demonstrated superior performance compared to traditional methods and other deep learning techniques across 14 clustering algorithms. The results indicate that the hybrid method effectively captures the complexity of EHR data, leading to better patient stratification and insights into heart failure cohorts.
Implications
This research has significant implications for clinical decision-making and patient management by providing more accurate clustering of patient data. The findings suggest that integrating traditional and deep learning methods can enhance the understanding of disease subtypes and improve healthcare outcomes.
Prune-Quantize-Distill: An Ordered Pipeline for Efficient Neural Network Compression
Efficient ML
- Proposes a three-stage pipeline for neural network compression: pruning, quantization, and distillation.
- Demonstrates that the order of these stages significantly impacts the accuracy and efficiency of the model.
- Shows that traditional metrics may not accurately reflect real-world performance, advocating for runtime-based evaluations.
- Achieves competitive accuracy and low latency across multiple architectures and datasets.
Read more
Prune-Quantize-Distill: An Ordered Pipeline for Efficient Neural Network Compression
Summary
This paper addresses the challenge of efficiently compressing neural networks while maintaining performance, particularly under CPU and memory constraints. The authors propose a systematic pipeline that combines three established techniques: unstructured pruning, INT8 quantization-aware training (QAT), and knowledge distillation (KD). The study highlights that traditional metrics like parameter count or FLOPs do not accurately predict inference time, especially when unstructured sparsity is involved. The proposed pipeline is structured as follows: first, global unstructured pruning reduces the model's capacity, which stabilizes the subsequent low-precision optimization; second, INT8 QAT is applied to achieve significant runtime benefits; and finally, KD is used to recover accuracy in the constrained sparse INT8 regime. The authors evaluate their approach on CIFAR-10/100 datasets using various backbone architectures (ResNet-18, WRN-28-10, and VGG-16-BN) and demonstrate that the ordered pipeline outperforms individual techniques in terms of accuracy, size, and latency. The results indicate that the order of the stages is crucial for achieving optimal performance, and they advocate for evaluating compression strategies based on measured runtime rather than proxy metrics.
Methodology
The methodology involves a three-stage process: first, global unstructured pruning is applied to reduce the model's parameter count; second, INT8 quantization-aware training is conducted to optimize the model for efficient inference; and finally, knowledge distillation is employed to recover any lost accuracy. The authors conduct controlled experiments to analyze the impact of stage ordering on performance.
Results
The proposed pipeline achieves CPU latencies between 0.99 and 1.42 ms while maintaining competitive accuracy across different backbone architectures. The study confirms that the ordered approach yields better trade-offs in the accuracy-size-latency space compared to using any single technique alone. Controlled ablation studies further validate the importance of the stage order.
Implications
The findings suggest that practitioners should prioritize runtime evaluations when selecting compression strategies for neural networks, especially in resource-constrained environments. The proposed pipeline offers a practical guideline for deploying efficient models on edge devices.
Neural Computers
Generative Models
Theory
Multimodal
- Introduction of Neural Computers (NCs) as a unified computing paradigm.
- Demonstration of NC prototypes for command-line and GUI interactions.
- Identification of early runtime primitives learned from raw I/O data.
- Outline of challenges and roadmap towards Completely Neural Computers (CNCs).
Read more
Neural Computers
Summary
This paper introduces the concept of Neural Computers (NCs), a novel machine form that integrates computation, memory, and I/O into a learned runtime state. Unlike traditional computers that execute explicit programs, NCs aim to function as the computer itself, with the long-term vision of achieving a Completely Neural Computer (CNC) that is a general-purpose, stable, and reusable computing system. The authors investigate the feasibility of learning NC primitives from I/O traces without needing access to program states. They implement NCs as video models that simulate command-line and graphical user interface interactions, demonstrating that these models can learn early interface primitives such as I/O alignment and short-horizon control. The paper outlines the challenges that remain, including the need for robust long-horizon reasoning and reliable symbolic processing, and presents a roadmap for future development towards CNCs.
Methodology
The authors developed video-based prototypes of NCs that model terminal and desktop interactions. They utilized I/O traces to train the models, focusing on learning interface primitives without direct access to program states. The experiments involved two specific NC implementations: NCCLIGen for command-line interactions and NCGUIWorld for graphical user interfaces.
Results
The experiments revealed that the NCs could effectively render and execute basic workflows in both CLI and GUI settings. The models demonstrated an ability to align with terminal buffers and capture common user interactions, achieving significant improvements in performance with stronger conditioning. However, challenges such as symbolic stability and long-horizon reasoning were noted as areas needing further research.
Implications
If the challenges in developing CNCs can be addressed, this could lead to a transformative shift in computing paradigms, enabling machines that learn and adapt their programming dynamically. This could have broad applications in automation, user interface design, and interactive systems.
Improving Sparse Memory Finetuning
NLP
Large Language Models
Efficient ML
- Introduction of Sparse Memory Finetuning (SMF) to address catastrophic forgetting in LLMs.
- Development of an open-source pipeline for retrofitting pretrained models with sparse memory layers.
- Novel slot-selection mechanism based on KL divergence for prioritizing memory updates.
- Empirical validation showing improved stability on held-out benchmarks while learning new tasks.
Read more
Improving Sparse Memory Finetuning
Summary
This paper addresses the challenge of continual learning in Large Language Models (LLMs), which typically become static post-training. The authors propose Sparse Memory Finetuning (SMF) as a solution to mitigate catastrophic forgetting, a common issue where updates to a model degrade its performance on previously learned tasks. They introduce an open-source pipeline that retrofits existing pretrained models, specifically Qwen-2.5-0.5B, with sparse memory modules, allowing for effective continual learning on consumer hardware. A novel slot-selection mechanism based on Kullback-Leibler (KL) divergence is introduced to prioritize updates for tokens that provide significant new information. The experiments conducted demonstrate that models retrofitted with this approach can learn new factual knowledge while preserving their existing capabilities, thus validating the sparse update hypothesis in practical scenarios.
Methodology
The authors replace the standard Feed-Forward Networks (FFNs) in Transformers with sparse key-value memory layers. They implement a slot-selection mechanism based on KL divergence to identify which memory slots to update, focusing on those that provide the most informational gain relative to a background distribution. The retrofitting process includes a 'healing' stage to recover general capabilities after the introduction of sparse memory layers.
Results
The experiments show that the retrofitted models can learn new tasks, such as TriviaQA, while maintaining higher stability on held-out benchmarks like GSM8k and NaturalQuestions compared to traditional dense finetuning methods. This supports the hypothesis that sparse updates can minimize catastrophic forgetting.
Implications
The proposed approach has significant implications for the deployment of LLMs in dynamic environments where continual learning is essential. It allows for efficient updates without the risk of degrading previously learned knowledge, making it suitable for applications that require real-time adaptation to new information.
Asymptotic-Preserving Neural Networks for Viscoelastic Parameter Identification in Multiscale Blood Flow Modeling
Theory
- Introduction of Asymptotic-Preserving Neural Networks for viscoelastic parameter identification.
- Integration of physical principles into the neural network training process enhances model accuracy.
- Utilization of non-invasive patient-specific data for pressure waveform estimation.
- Demonstrated effectiveness through numerical simulations in synthetic and real-world scenarios.
Read more
Asymptotic-Preserving Neural Networks for Viscoelastic Parameter Identification in Multiscale Blood Flow Modeling
Summary
This paper presents a novel approach to identifying viscoelastic parameters in a one-dimensional multiscale blood flow model using Asymptotic-Preserving Neural Networks (APNNs). The study addresses the challenge of accurately determining the viscoelastic properties of arterial walls, which are crucial for understanding how arteries deform under pulsatile pressure. By embedding the governing physical principles of the blood flow model within the neural network training process, the authors enable the simultaneous inference of viscoelastic parameters and reconstruction of time-dependent state variables of blood vessels. The methodology leverages readily accessible patient-specific data, such as cross-sectional area and velocity measurements obtained from Doppler ultrasound, to estimate pressure waveforms in vascular segments where direct pressure measurements are not feasible. The effectiveness of the proposed APNN framework is demonstrated through various numerical simulations, both in synthetic and patient-specific scenarios, showcasing its potential for improving the practical applicability of blood flow modeling in clinical settings.
Methodology
The authors employ Asymptotic-Preserving Neural Networks (APNNs) that incorporate the governing equations of a multiscale viscoelastic blood flow model into the learning process. This allows the network to maintain physical consistency while inferring viscoelastic parameters and reconstructing state variables from available hemodynamic data.
Results
The numerical simulations indicate that the APNN framework successfully estimates pressure waveforms and identifies viscoelastic parameters, demonstrating improved accuracy over traditional methods. The results validate the approach in both synthetic datasets and real patient-specific scenarios.
Implications
The findings suggest that APNNs can significantly enhance non-invasive cardiovascular diagnostics by providing reliable estimates of hemodynamic variables, potentially leading to better patient outcomes and more effective monitoring of cardiovascular health.
Dynamic Linear Coregionalization for Realistic Synthetic Multivariate Time Series
Time Series
Generative Models
- DynLMC generates synthetic multivariate time series with realistic, nonstationary correlation structures.
- The model incorporates time-varying correlations, regime-switching, and lagged dependencies.
- Fine-tuning on DynLMC-generated data improves forecasting performance across multiple benchmarks.
- The approach enhances the transferability of foundation models for time series analysis.
Read more
Dynamic Linear Coregionalization for Realistic Synthetic Multivariate Time Series
Summary
The paper introduces DynLMC, a novel Dynamic Linear Model of Coregionalization designed to generate synthetic multivariate time series that accurately reflect the dynamic inter-channel dependencies found in real-world data. Traditional synthetic data generators often assume static correlations, which fail to capture the evolving relationships and lagged dependencies present in actual multivariate time series. DynLMC addresses this limitation by incorporating time-varying correlations, regime-switching mechanisms, and lagged dependencies into its generative process. The authors demonstrate that fine-tuning three foundational forecasting models on data generated by DynLMC leads to significant improvements in zero-shot forecasting performance across nine benchmarks. This highlights the importance of realistic synthetic data in enhancing the transferability and robustness of foundation models for time series analysis. The findings suggest that incorporating dynamic inter-channel correlations into synthetic data generation can lead to better performance in real-world applications across various domains such as finance, healthcare, and climate science.
Methodology
DynLMC extends the Linear Model of Coregionalization by allowing dynamic mixing weights and random lags between latent and observed channels. It employs autoregressive updates for correlation drift, a Hidden Markov Model for regime-switching correlations, and incorporates lagged dependencies to simulate lead-lag effects. The generative process is designed to produce datasets that reflect the temporal variability and complex interdependencies found in real-world multivariate time series.
Results
The empirical evaluation shows that fine-tuning pretrained models on DynLMC-generated data consistently yields improvements in robustness and generalization, with significant performance gains observed on real-world test datasets across various forecasting tasks.
Implications
The development of DynLMC has significant implications for the field of time series analysis, particularly in enhancing the training of foundation models. By providing a more realistic synthetic data generation method, it can improve model adaptability and performance in diverse applications, including finance, healthcare, and environmental monitoring.
Learning $ ext{AC}^0$ Under Graphical Models
Theory
- Introduces quasipolynomial-time algorithms for learning AC0 under graphical models with strong spatial mixing.
- Overcomes the limitations of Fourier analysis in learning from correlated distributions.
- Demonstrates the applicability of low-degree polynomial approximations beyond product structures.
- Extends results to other function classes, enhancing the generality of the findings.
Read more
Learning $ ext{AC}^0$ Under Graphical Models
Summary
This paper addresses the challenge of learning constant-depth circuits (AC0) under more realistic correlated distributions, moving beyond the traditional reliance on product structures. Building on the foundational work of Linial, Mansour, and Nisan (1993), which provided a quasipolynomial-time algorithm for learning AC0 under uniform distribution, the authors propose new quasipolynomial-time algorithms applicable to inputs from any graphical model exhibiting strong spatial mixing. The key innovation lies in circumventing the limitations of Fourier analysis, which has been a barrier in extending learning guarantees to non-product distributions. By developing tailored sampling algorithms, the authors demonstrate that low-degree polynomial approximations can be effectively transferred from uniform settings to graphical models. This approach not only applies to AC0 but also extends to other function classes such as monotone functions and halfspaces, thereby broadening the scope of efficient learning in complex distributions.
Methodology
The authors utilize tailored sampling algorithms to analyze and approximate low-degree polynomials in the context of graphical models. This involves a detailed examination of the dependence structure of high-dimensional distributions, allowing for the transference of results from uniform distributions to more complex correlated settings.
Results
The paper establishes that it is possible to learn AC0 circuits efficiently under graphical models with polynomial growth and strong spatial mixing. The results indicate that low-degree polynomial approximations can be achieved even in the absence of product structure, thus providing a significant advancement in the field of computational learning theory.
Implications
The findings have potential applications in various fields that rely on learning from complex, correlated data distributions, including computer science, economics, and statistics. The ability to efficiently learn from such distributions could enhance machine learning models in real-world scenarios where data is not independently distributed.
Busemann energy-based attention for emotion analysis in PoincarΓ© discs
NLP
Theory
Efficient ML
- EmBolic leverages hyperbolic geometry for emotion analysis, capturing hierarchical relationships between words and emotions.
- The model operates in a continuous space of emotions, avoiding the limitations of categorical representations.
- An attention mechanism based on Busemann energy is utilized to evaluate the alignment of textual messages with emotional classes.
- Experiments show strong generalization and prediction accuracy, even in small dimensions.
Read more
Busemann energy-based attention for emotion analysis in PoincarΓ© discs
Summary
This paper introduces EmBolic, a novel fully hyperbolic deep learning architecture designed for fine-grained emotion analysis from textual messages. The authors argue that hyperbolic geometry effectively captures the hierarchical relationships between words and emotions, addressing the semantic ambiguities inherent in emotion analysis. Unlike traditional models that treat emotions as discrete categories, EmBolic represents emotions in a continuous hyperbolic space, allowing for a more nuanced understanding of emotional contexts. The architecture employs an attention mechanism within the hyperbolic disc, where queries are generated from textual inputs and keys are derived from these queries. Predictions are made based on the Busemann energy, which evaluates the alignment of textual messages with emotional class directions. The experiments conducted demonstrate that EmBolic exhibits strong generalization capabilities and achieves commendable prediction accuracy, even in lower-dimensional representation spaces. This research highlights the advantages of hyperbolic representations in affective computing, suggesting that they can significantly enhance the performance of emotion analysis tasks.
Methodology
The EmBolic architecture employs a fully hyperbolic deep learning framework that maps both words and emotions to hyperbolic manifolds. It generates queries from textual messages and derives keys from these queries, using Busemann energy to assess the alignment of messages with emotional classes. The model is trained on annotated data to learn the emotion map without imposing a priori assumptions on the distances between emotions.
Results
The experiments indicate that EmBolic achieves strong generalization properties and reasonable prediction accuracy, demonstrating the effectiveness of hyperbolic representations in emotion analysis tasks. The model's performance is particularly notable in lower-dimensional spaces, suggesting its efficiency and robustness.
Implications
The findings suggest that hyperbolic representations can significantly improve emotion analysis in natural language processing, potentially leading to advancements in affective computing applications. This could enhance systems in areas such as sentiment analysis, customer feedback interpretation, and emotional AI.
Bridging Theory and Practice in Crafting Robust Spiking Reservoirs
Theory
Time Series
Efficient ML
- Introduction of the robustness interval as a measure for hyperparameter tuning in spiking reservoirs.
- Identification of monotonic trends linking robustness interval width to presynaptic connection density and firing threshold.
- Discovery of iso-performance manifolds in the hyperparameter space that maintain performance near the critical point.
- Validation of the theoretical critical point as a robust starting coordinate for parameter search.
Read more
Bridging Theory and Practice in Crafting Robust Spiking Reservoirs
Summary
This paper addresses the challenges of tuning spiking reservoir computing systems, particularly Liquid State Machines (LSMs), to operate near the edge of chaos, which is crucial for optimal performance in temporal processing tasks. The authors introduce the concept of the 'robustness interval', an operational measure that quantifies the range of hyperparameters over which the reservoir maintains performance above specific thresholds. Through systematic evaluations of Leaky Integrate-and-Fire (LIF) networks on both static (MNIST) and temporal (synthetic Ball Trajectories) tasks, the study identifies consistent trends indicating that the width of the robustness interval decreases with increasing presynaptic connection density and firing threshold. The authors also explore specific pairs of connection density and firing threshold that preserve the theoretical critical point, revealing iso-performance manifolds in the hyperparameter space. Control experiments using ErdΕsβRΓ©nyi graphs confirm that these phenomena are intrinsic to the dynamics of the reservoirs, rather than dependent on specific network topologies. The findings validate the theoretical critical point as a reliable starting point for parameter tuning, and the authors provide their Python code publicly to promote reproducibility.
Methodology
The authors employed Leaky Integrate-and-Fire (LIF) networks with small-world topology and conducted experiments on two tasks: static (MNIST) and temporal (synthetic Ball Trajectories). They systematically varied hyperparameters such as presynaptic connection density, firing threshold, and external input to analyze the robustness interval and its relationship to performance. Control experiments were also performed using directed ErdΕsβRΓ©nyi graphs to validate the findings.
Results
The study found that the width of the robustness interval decreases with increasing presynaptic connection density and firing threshold. Specific pairs of connection density and firing threshold were identified that preserve the theoretical critical point, leading to overlapping normalized performance across different configurations. The critical point consistently fell within high-performance regions, supporting its use as a starting point for parameter tuning.
Implications
The findings have significant implications for the design and tuning of spiking neural networks in practical applications, particularly in energy-efficient temporal processing tasks. The robustness interval provides a framework for ensuring stable performance despite uncertainties, which is crucial for real-world implementations of reservoir computing.
MO-RiskVAE: A Multi-Omics Variational Autoencoder for Survival Risk Modeling in Multiple Myeloma
Generative Models
Multimodal
- MO-RiskVAE improves survival risk modeling in multiple myeloma by addressing limitations in traditional VAE approaches.
- The study highlights the importance of latent regularization scale and structure in survival-driven training.
- Moderate relaxation of KL regularization consistently enhances survival discrimination.
- The model integrates multimodal omics data effectively, improving risk stratification without added complexity.
Read more
MO-RiskVAE: A Multi-Omics Variational Autoencoder for Survival Risk Modeling in Multiple Myeloma
Summary
The paper presents MO-RiskVAE, a novel multi-omics variational autoencoder designed for survival risk modeling in multiple myeloma (MM). The authors identify that traditional latent regularization strategies in VAEs often fail to maintain prognostically relevant variations when trained under survival supervision, leading to unstable representations. Through a systematic investigation of latent modeling choices, the study reveals that the scale and structure of latent regularization significantly influence survival-driven training outcomes. The authors demonstrate that moderate relaxation of Kullback-Leibler (KL) regularization enhances survival discrimination, while alternative divergence mechanisms provide limited benefits unless appropriately scaled. Additionally, structuring the latent space improves the alignment of learned representations with survival risk gradients. The proposed MO-RiskVAE model outperforms the original MyeVAE framework in risk stratification without requiring additional supervision or complex training heuristics, thus providing a robust tool for integrating heterogeneous omics data to improve survival predictions in MM.
Methodology
The authors utilize a multi-modal variational autoencoder framework, specifically extending the MyeVAE model. They systematically investigate the effects of latent regularization scale, posterior geometry, and latent space structure on survival prediction. The model is trained using a Cox proportional hazards objective, allowing for end-to-end optimization driven by survival outcomes.
Results
The MO-RiskVAE model demonstrates consistent improvements in risk stratification compared to the original MyeVAE framework. The findings indicate that adjusting the scale of KL regularization enhances the model's ability to retain prognostically relevant variations, leading to better survival discrimination. Additionally, structuring the latent space improves the alignment of learned representations with survival risk gradients.
Implications
The development of MO-RiskVAE has significant implications for clinical practice in multiple myeloma, as it provides a more accurate tool for risk stratification based on integrated omics data. This could lead to improved treatment planning and prognosis assessment for patients with MM.
AdaBoost Does Not Always Cycle: A Computer-Assisted Counterexample
Theory
- The paper disproves the conjecture that exhaustive AdaBoost always converges to a finite cycle.
- A specific counterexample is constructed using a block-product matrix that demonstrates non-periodic behavior.
- The irrationality of the eigenvalue ratio in the linearized return maps is crucial to the findings.
- The results are supported by rigorous mathematical proofs and computational certificates.
Read more
AdaBoost Does Not Always Cycle: A Computer-Assisted Counterexample
Summary
This paper presents a computer-assisted counterexample to the conjecture that the AdaBoost algorithm always converges to a finite cycle. The author constructs a specific 20 Γ 8 sign matrix using a block-product gadget, demonstrating that the dynamics of the algorithm can lead to non-periodic behavior under certain conditions. The construction relies on the properties of the linearized return maps, which exhibit an irrational eigenvalue ratio, leading to an irrational asymptotic frequency in the burst-winner sequence. This irrationality prevents the convergence to a finite cycle, thus providing a negative answer to the open question posed by Rudin, Schapire, and Daubechies in 2012. The paper includes rigorous proofs and computational certificates to support the findings, showcasing the use of exact rational arithmetic and interval arithmetic to validate the results.
Methodology
The author constructs a counterexample using a 20 Γ 8 sign matrix derived from two smaller gadgets. The dynamics of the AdaBoost algorithm are analyzed through the lens of linearized return maps, and rigorous interval arithmetic is employed to certify the results. The proof involves demonstrating the irrationality of the eigenvalue ratio and its implications for the burst-winner sequence.
Results
The main result shows that there exists a finite {β1, +1}-valued matrix for which exhaustive AdaBoost does not converge to any finite cycle, thereby answering the COLT question negatively. Additionally, a corollary extends this result to matrices closed under negation, reinforcing the findings.
Implications
These findings have significant implications for the theoretical understanding of AdaBoost and similar algorithms, suggesting that their long-term behavior may be more complex than previously thought. This could influence future research on boosting algorithms and their convergence properties.
The Rhetoric of Machine Learning
Theory
- Machine learning is inherently rhetorical, influencing perceptions and decisions.
- The concept of 'manipulation as a service' highlights the persuasive use of machine learning in business.
- Viewing machine learning through the lens of rhetoric can open new lines of inquiry and discussion.
- The paper challenges traditional narratives about the objectivity of machine learning technologies.
Read more
The Rhetoric of Machine Learning
Summary
In 'The Rhetoric of Machine Learning', Robert C. Williamson explores the intersection of machine learning technology and rhetoric, arguing that machine learning is not merely an objective tool for building world models but is inherently rhetorical in nature. The paper posits that machine learning systems are designed to persuade users by presenting data-driven outputs as factual, thereby influencing decision-making processes. Williamson discusses the concept of 'manipulation as a service' as a prevalent business model that utilizes machine learning for persuasive ends. He emphasizes the importance of understanding machine learning through a rhetorical lens, which can reveal the socio-technical implications of its deployment. The author draws on a diverse literature to challenge existing narratives and stimulate new discussions around the philosophical and rhetorical dimensions of machine learning, ultimately suggesting that this perspective can lead to a more nuanced understanding of its role in society.
Methodology
The paper employs a rhetorical analysis framework to examine machine learning technologies, drawing on literature from philosophy, science, and rhetoric to support its claims. It is based on discussions and presentations given by the author at various workshops and symposia.
Results
The main result of the paper is the assertion that machine learning systems function as persuasive technologies, rather than purely objective tools. This perspective encourages a critical examination of how machine learning outputs are accepted and utilized in various contexts.
Implications
The implications of this work suggest that stakeholders in machine learning should be more aware of the persuasive nature of these technologies, leading to more responsible development and deployment practices. It also calls for interdisciplinary dialogue to better understand the socio-technical impacts of machine learning.
The Depth Ceiling: On the Limits of Large Language Models in Discovering Latent Planning
NLP
Large Language Models
Theory
- LLMs can discover latent planning strategies with limited depth, with small transformers achieving up to three steps.
- Fine-tuned models like GPT-4o and Qwen3-32B reach a maximum of five latent planning steps.
- GPT-5.4 demonstrates the ability to generalize strategies to eight steps during testing, despite training on fewer steps.
- The study reveals a dissociation between the ability to discover strategies and the ability to execute them.
Read more
The Depth Ceiling: On the Limits of Large Language Models in Discovering Latent Planning
Summary
This paper investigates the limitations of large language models (LLMs) in discovering and executing multi-step planning strategies without supervision on intermediate steps. The authors focus on the ability of models to perform latent reasoning within a single forward pass, specifically through graph path-finding tasks that control the number of required latent planning steps. The study reveals that while smaller transformers can discover strategies requiring up to three latent steps, larger models like fine-tuned GPT-4o and Qwen3-32B can reach five steps, and GPT-5.4 can achieve seven steps under few-shot prompting. Notably, the maximum latent planning depth learned during training is five, but models can generalize strategies up to eight steps at test time. This indicates a significant gap between the discovery of latent strategies and their execution, suggesting that complex multi-step strategies may need explicit teaching or externalization. The findings challenge previous literature and highlight the need for further exploration of LLM capabilities in latent reasoning.
Methodology
The authors employed graph path-finding tasks to assess the latent planning capabilities of various LLMs. They controlled the number of required latent planning steps and provided supervision only on task success, allowing them to evaluate the models' ability to discover and execute multi-step strategies without intermediate supervision.
Results
The study found that small transformers can discover strategies requiring up to three latent steps, while larger models can achieve a maximum of five steps. GPT-5.4, evaluated under few-shot prompting, reached a latent planning depth of seven steps. Importantly, models trained on fewer steps could generalize to deeper strategies at test time, indicating a gap between discovery and execution capabilities.
Implications
The findings suggest that LLMs may not inherently possess the ability to discover complex multi-step strategies without explicit training, which has implications for the development of more effective training methodologies and oversight strategies in AI systems. This could influence how LLMs are utilized in applications requiring advanced reasoning and planning.
Extraction of linearized models from pre-trained networks via knowledge distillation
Efficient ML
Theory
- Proposes a framework for extracting linearized models from pre-trained neural networks.
- Integrates Koopman operator theory with knowledge distillation for improved classification tasks.
- Demonstrates superior performance over conventional least-squares-based Koopman approximations.
- Focuses on enhancing energy efficiency in machine learning architectures, particularly for optical devices.
Read more
Extraction of linearized models from pre-trained networks via knowledge distillation
Summary
This paper addresses the challenge of improving the energy efficiency of machine learning architectures, particularly in the context of optical devices that excel at linear operations. The authors propose a novel framework that extracts linearized models from pre-trained neural networks for classification tasks by integrating Koopman operator theory with knowledge distillation. The method approximates the non-linear transformations of hidden layers as a linear system in a higher-dimensional observable space, thereby enhancing classification accuracy. The authors demonstrate the effectiveness of their approach through numerical experiments on the MNIST and Fashion-MNIST datasets, showing that their model consistently outperforms conventional least-squares-based Koopman approximations in terms of both classification accuracy and numerical stability. This work aims to bridge the gap between high-performance deep learning and the physical constraints of optical devices, paving the way for more energy-efficient artificial intelligence solutions.
Methodology
The authors employ a combination of Koopman operator theory and knowledge distillation to extract linearized models from pre-trained neural networks. They approximate non-linear transformations of hidden layers as linear systems in a higher-dimensional observable space, enhancing the accuracy of classification tasks. Additionally, principal component analysis (PCA) is incorporated to facilitate initial-stage processing with weak nonlinearity.
Results
The proposed method was tested on the MNIST and Fashion-MNIST datasets, where it consistently outperformed traditional least-squares-based Koopman approximations in both classification accuracy and numerical stability, demonstrating the effectiveness of the approach in practical scenarios.
Implications
The findings suggest that it is possible to construct energy-efficient machine learning architectures that leverage optical devices by reducing reliance on non-linear operations. This could lead to advancements in green artificial intelligence and more sustainable computing solutions.
SBBTS: A Unified SchrΓΆdinger-Bass Framework for Synthetic Financial Time Series
Generative Models
Time Series
Optimization
- Introduces SBBTS, a unified framework for generating synthetic financial time series.
- Jointly models drift and stochastic volatility, overcoming limitations of existing methods.
- Demonstrates improved forecasting performance and data augmentation capabilities.
- Empirical evaluations show accurate recovery of volatility and correlation structures.
Read more
SBBTS: A Unified SchrΓΆdinger-Bass Framework for Synthetic Financial Time Series
Summary
This paper addresses the challenge of generating synthetic financial time series that accurately reflect both marginal distributions and temporal dynamics. Traditional methods often struggle to model drift and stochastic volatility simultaneously. The authors propose the SchrΓΆdingerβBass Bridge for Time Series (SBBTS), which extends the SchrΓΆdingerβBass framework to multi-step time series, allowing for the joint calibration of drift and volatility. The SBBTS framework constructs a diffusion process that can be decomposed into conditional transport problems, facilitating efficient learning. Numerical experiments demonstrate that SBBTS effectively recovers stochastic volatility and correlation parameters that previous methods failed to capture. When applied to S&P 500 data, synthetic time series generated by SBBTS significantly enhance downstream forecasting performance, leading to improved classification accuracy and Sharpe ratios compared to models trained solely on real data. The results indicate that SBBTS is a practical and effective approach for realistic time series generation and data augmentation in financial contexts.
Methodology
The SBBTS framework combines optimal transport with modern machine learning techniques to generate synthetic time series. It extends the SchrΓΆdinger-Bass formulation to full time series distributions, allowing for the joint calibration of drift and volatility. The method decomposes the problem into a sequence of conditional optimal transport problems, enabling scalable neural implementations that capture path-dependent dynamics.
Results
The SBBTS framework successfully recovers stochastic volatility and correlation parameters in numerical experiments, outperforming prior methods. In applications to real financial data, synthetic time series generated by SBBTS lead to higher classification accuracy and improved Sharpe ratios, demonstrating its effectiveness for data augmentation.
Implications
The SBBTS framework has significant implications for financial machine learning, particularly in scenarios where real data is scarce or sensitive. It can be utilized for stress testing, risk management, and enhancing predictive models, thereby improving decision-making in financial markets.
From Load Tests to Live Streams: Graph Embedding-Based Anomaly Detection in Microservice Architectures
Graph Learning
- Introduces a graph-based anomaly detection system for microservices using unsupervised learning.
- Utilizes GCN-GAE to learn structural representations from service interaction graphs.
- Achieves high precision and low false positive rates in anomaly detection.
- Addresses gaps in traditional load testing by focusing on real-world traffic patterns.
Read more
From Load Tests to Live Streams: Graph Embedding-Based Anomaly Detection in Microservice Architectures
Summary
The paper presents a novel graph-based anomaly detection system designed for microservice architectures, particularly in the context of Prime Video's operations. Traditional load tests often fail to accurately simulate real-world traffic patterns, leading to potential service issues during peak events. To address this, the authors propose an unsupervised approach utilizing Graph Convolutional Networks (GCN) and Graph Autoencoders (GAE) to learn structural representations from service interaction graphs at a minute-level resolution. The system identifies anomalies by comparing embeddings from load tests with those from actual event traffic, focusing on cosine similarity for anomaly scoring. A synthetic anomaly injection framework is introduced for evaluation, achieving a precision of 96% and a low false positive rate of 0.08%, although recall is limited at 58%. The findings highlight the system's capability for early detection of service-related incidents and its practical utility within Prime Video, while also providing insights for broader applications in microservice ecosystems.
Methodology
The authors extend the GCN-GAE framework to train on multiple independently sampled, weighted graph snapshots. They compute node embeddings from directed, weighted service graphs and use cosine similarity for anomaly scoring. A synthetic anomaly injection framework is employed for controlled evaluation of the system's performance.
Results
The proposed system demonstrates a precision of 96% and a false positive rate of 0.08% in detecting anomalies, although the recall is limited to 58%. The system successfully identifies incident-related services and showcases early detection capabilities.
Implications
This research has significant implications for enhancing the reliability and performance of microservice architectures, particularly during high-demand events. The methodology can be applied to improve anomaly detection and incident response in various distributed systems, potentially leading to better customer experiences and operational efficiency.
AgentOpt v0.1 Technical Report: Client-Side Optimization for LLM-Based Agent
Large Language Models
Optimization
Efficient ML
- AGENTOPT is the first framework-agnostic tool for client-side optimization of LLM-based agents.
- Model selection is identified as a critical factor, with significant cost differences between model combinations.
- The paper presents eight search algorithms to efficiently navigate the model assignment space.
- Empirical results show that Arm Elimination can reduce evaluation budgets significantly while maintaining accuracy.
Read more
AgentOpt v0.1 Technical Report: Client-Side Optimization for LLM-Based Agent
Summary
The paper introduces AGENTOPT, a framework-agnostic Python package designed for client-side optimization in large language model (LLM)-based agents. While previous research has focused on server-side efficiency, the authors argue that client-side optimization is equally critical as developers increasingly compose agents from local tools, remote APIs, and diverse models. The study emphasizes the importance of model selection in multi-step agent pipelines, revealing that the cost-effectiveness of model combinations can vary significantly, with gaps of 13β32 times in cost at matched accuracy. AGENTOPT implements eight search algorithms to efficiently explore the exponentially growing combination space of model assignments. The empirical results demonstrate that the Arm Elimination algorithm can achieve near-optimal accuracy while reducing evaluation budgets by 24β67% compared to brute-force methods across various benchmarks. This highlights the necessity of considering client-side decisions for optimizing agent performance, as they directly influence application-specific quality, cost, and latency constraints.
Methodology
The authors developed AGENTOPT, which includes eight search algorithms (Arm Elimination, Epsilon-LUCB, Threshold Successive Elimination, and Bayesian Optimization) to explore model combinations in multi-step agent pipelines. The effectiveness of these algorithms was evaluated across four benchmarks to assess their performance in terms of accuracy and evaluation budget.
Results
The results indicate that the Arm Elimination algorithm can recover near-optimal accuracy while reducing the evaluation budget by 24β67% compared to brute-force search methods. The study also reveals that the cost gap between the best and worst model combinations can reach 13β32 times at matched accuracy, underscoring the importance of model selection in client-side optimization.
Implications
The findings suggest that client-side optimization can significantly enhance the efficiency and effectiveness of LLM-based agents, allowing developers to make informed decisions about resource allocation based on specific application requirements. This approach could lead to more cost-effective and performant AI systems in real-world applications.
Training Without Orthogonalization, Inference With SVD: A Gradient Analysis of Rotation Representations
Computer Vision
Robotics
Theory
- Removing orthogonalization during training improves rotation estimation in deep learning.
- SVD orthogonalization introduces significant gradient distortions, particularly early in training.
- The SVD Jacobian has a rank of 3, indicating limited gradient information retention.
- Gram-Schmidt orthogonalization results in asymmetric gradient signals, favoring 9D over 6D parameterization.
Read more
Training Without Orthogonalization, Inference With SVD: A Gradient Analysis of Rotation Representations
Summary
This paper investigates the impact of orthogonalization on the training and inference of rotation representations in deep learning, particularly focusing on 3D rotations represented in SO(3). The author argues that removing orthogonalization during training while applying it only at inference leads to improved rotation estimation. The paper provides a detailed gradient analysis of SVD (Singular Value Decomposition) orthogonalization, revealing that it introduces quantifiable gradient distortions that are most pronounced when the predicted matrix deviates from SO(3). The author derives the exact spectrum of the SVD backward pass Jacobian, demonstrating that it retains only a fraction of the gradient energy and exhibits a rank of 3, which aligns with the dimensionality of SO(3). The paper also compares SVD with Gram-Schmidt orthogonalization, showing that the latter has an asymmetric Jacobian spectrum, thereby justifying the preference for 9D parameterization over 6D in unorthogonalized training. The findings support the approach of training with direct 9D regression and applying SVD only at inference, providing a theoretical foundation for this methodology.
Methodology
The author conducts a detailed Jacobian analysis of SVD and Gram-Schmidt orthogonalization mappings, focusing on 3x3 matrices and SO(3) projections. The analysis includes deriving the exact spectrum of the SVD Jacobian and comparing it with the Gram-Schmidt Jacobian to quantify gradient pathologies and information loss.
Results
The paper establishes that SVD backpropagation retains only one-third of the gradient energy and introduces gradient direction errors, while Gram-Schmidt's Jacobian has an asymmetric spectrum. These findings support the conclusion that training with direct 9D regression and applying SVD at inference is optimal for rotation representation tasks.
Implications
The results have significant implications for computer vision and robotics, particularly in tasks requiring accurate 3D rotation estimations, such as human pose estimation and object tracking. The findings could lead to improved training methodologies that enhance model performance in these areas.
Improving Robustness In Sparse Autoencoders via Masked Regularization
NLP
Large Language Models
Interpretability
- Sparse autoencoders are prone to feature absorption, degrading interpretability despite high reconstruction fidelity.
- The proposed masking-based regularization disrupts co-occurrence patterns, improving robustness and interpretability.
- The method enhances performance across multiple SAE architectures and reduces the OOD gap.
- Results indicate that stronger training objectives combined with architectural advances can mitigate shortcut learning in SAEs.
Read more
Improving Robustness In Sparse Autoencoders via Masked Regularization
Summary
This paper addresses the limitations of Sparse Autoencoders (SAEs) in mechanistic interpretability, particularly their susceptibility to feature absorption and poor out-of-distribution (OOD) performance. The authors propose a novel masking-based regularization technique that disrupts co-occurrence patterns during training by randomly replacing tokens in the input sequences. This approach mitigates feature absorption, enhances the robustness of the latent representations, and improves interpretability. The study demonstrates that the proposed method consistently reduces absorption across various SAE architectures and sparsity levels, leading to better probing performance and narrowing the OOD gap. The findings suggest a practical pathway for developing more reliable interpretability tools for large language models (LLMs).
Methodology
The authors introduce a masking-based regularization technique that randomly replaces tokens in input sequences with a fixed mask during training. This method aims to break spurious correlations and encourages the SAE to learn more generalizable structures, thereby reducing reliance on shortcuts. The training objective balances reconstruction fidelity with latent sparsity, while the masking strategy is applied across multiple large language models to evaluate its effectiveness.
Results
The proposed masking-based regularization consistently reduces feature absorption and improves performance on various evaluation metrics across different SAE architectures. It also enhances OOD performance, narrowing the gap with oracle probes, indicating that the method effectively promotes more robust and interpretable latent representations.
Implications
The findings suggest that integrating masking-based regularization into the training of sparse autoencoders can lead to more reliable interpretability tools for large language models. This has potential applications in areas requiring mechanistic interpretability, such as model auditing, debugging, and enhancing user trust in AI systems.
Time-Series Classification with Multivariate Statistical Dependence Features
Time Series
Audio & Speech
Efficient ML
- Introduces a framework for non-stationary time-series classification using multivariate statistical dependence features.
- Utilizes the cross density ratio (CDR) for robust statistical dependence measurement independent of sample order.
- Implements the functional maximal correlation algorithm (FMCA) to construct a projection space for feature extraction.
- Achieves competitive recognition accuracy on the TI-46 digit speech corpus with a lightweight neural network architecture.
Read more
Time-Series Classification with Multivariate Statistical Dependence Features
Summary
This paper introduces a novel framework for non-stationary time-series classification that leverages multivariate statistical dependence features, specifically through the cross density ratio (CDR). Traditional correlation-based methods struggle with non-stationary signals due to their reliance on fixed windows, which can mix statistics from different regimes. The proposed framework utilizes the functional maximal correlation algorithm (FMCA) to estimate the joint probability density function of input and target signals, allowing for robust statistical dependence measurement that is independent of sample order. The FMCA constructs a projection space that captures complex dependencies, from which multiscale features are extracted and classified using a lightweight single-hidden-layer perceptron. The framework is evaluated on the TI-46 digit speech corpus, demonstrating superior performance compared to hidden Markov models (HMMs) and state-of-the-art spiking neural networks, achieving high accuracy with minimal computational resources.
Methodology
The methodology involves the application of the functional maximal correlation algorithm (FMCA) to estimate the joint probability density function of input and target signals. This is achieved through an orthonormal spectral decomposition of the cross density ratio (CDR). Two neural networks are trained to project the input signals into a multivariate feature space, capturing complex dependencies. The resulting features are aggregated across multiple time scales and classified using a single-hidden-layer perceptron.
Results
The proposed FMCA-based framework outperforms traditional hidden Markov models and state-of-the-art spiking neural networks on the TI-46 digit speech corpus, achieving higher classification accuracy with fewer than 10 layers and a storage footprint under 5 MB.
Implications
The framework has significant implications for real-time non-stationary time-series analysis, particularly in applications such as speech recognition, where robustness to regime changes is crucial. Its lightweight architecture also suggests potential for deployment in resource-constrained environments.
Limits of Difficulty Scaling: Hard Samples Yield Diminishing Returns in GRPO-Tuned SLMs
NLP
Large Language Models
Optimization
- Identification of a capacity boundary in SLMs, limiting performance on complex reasoning tasks.
- Training on lower-difficulty samples yields competitive results with significantly reduced training effort.
- Cross-dataset generalization shows that easier training distributions can enhance numeric reasoning performance.
- GRPO's effectiveness is contingent on the base model's prior reasoning competence and dataset difficulty.
Read more
Limits of Difficulty Scaling: Hard Samples Yield Diminishing Returns in GRPO-Tuned SLMs
Summary
This paper investigates the effectiveness of Group Relative Policy Optimization (GRPO) in enhancing mathematical reasoning in Small Language Models (SLMs) under resource constraints. The authors conduct experiments using SLMs with 0.5B to 3B parameters on the GSM8K and MATH datasets, focusing on difficulty-stratified analyses. They find that as problem difficulty increases, the accuracy of the models plateaus, indicating a capacity boundary where GRPO primarily reshapes output preferences without significantly improving performance on the hardest problems. Notably, training on lower-difficulty problems achieves comparable accuracy to full-dataset training while using only about 45% of the training steps. The study also reveals a cross-dataset generalization effect, where models trained on GSM8K outperform those trained on MATH when evaluated on numeric subsets. The findings suggest that the intrinsic capacity of SLMs limits their ability to benefit from complex reasoning tasks, emphasizing the importance of selecting appropriate training samples based on difficulty.
Methodology
The authors employ a two-stage training protocol involving Supervised Fine-Tuning (SFT) followed by alignment using GRPO. They utilize Low-Rank Adaptation (LoRA) for parameter efficiency and implement a difficulty-aware reward modeling system that weights correctness based on problem difficulty. The datasets are stratified into low and high difficulty tiers to analyze the impact of training data complexity on performance.
Results
The study demonstrates that GRPO alignment reaches a saturation point in SLMs, where further exposure to high-difficulty tasks does not yield performance improvements. Training exclusively on lower-difficulty problems achieves similar accuracy to full-dataset training while requiring fewer training steps. Additionally, models trained on GSM8K show better performance on the numeric subset of MATH compared to those trained on MATH.
Implications
These findings suggest that for resource-constrained SLMs, focusing on lower-difficulty training samples can lead to more efficient learning and better performance in mathematical reasoning tasks. This has implications for the design of training protocols in machine learning, particularly in scenarios where computational resources are limited.
The Master Key Hypothesis: Unlocking Cross-Model Capability Transfer via Linear Subspace Alignment
NLP
Large Language Models
Efficient ML
- Introduces the Master Key Hypothesis for capability transfer across models.
- Presents Unlock, a training-free and label-free framework for capability transfer.
- Demonstrates significant performance improvements in reasoning tasks without retraining.
- Shows that capability transfer can match or exceed post-training performance.
Read more
The Master Key Hypothesis: Unlocking Cross-Model Capability Transfer via Linear Subspace Alignment
Summary
This paper introduces the Master Key Hypothesis, which posits that model capabilities can be represented as directions in a low-dimensional latent subspace, allowing for the transfer of these capabilities across different models without retraining. The authors propose a novel framework called Unlock, which operates in three stages: extracting a capability direction from a source model by contrasting activations of capability-present and capability-absent variants, aligning this direction with a target model through a low-rank linear transformation, and applying the direction during inference to elicit the desired behavior. The framework is training-free and label-free, making it architecture-agnostic and efficient. The authors validate their approach through experiments on reasoning tasks, including Chain-of-Thought (CoT) and mathematical reasoning, demonstrating significant performance improvements across various model scales. For instance, transferring CoT reasoning from a larger model to a smaller one resulted in a 12.1% accuracy gain on mathematical tasks, showcasing the effectiveness of their method in enhancing model capabilities without additional training.
Methodology
The methodology involves three main steps: (1) extracting a capability direction from a source model by contrasting activations of capability-present and capability-absent variants using unlabeled prompts; (2) estimating a low-rank linear transformation to align this direction with the target model's latent space; and (3) applying the transferred direction at inference time to elicit the desired behavior, all without requiring training or labeled data.
Results
The experiments showed that transferring capabilities such as Chain-of-Thought reasoning from larger models to smaller ones resulted in substantial accuracy gains, such as improving GSM8K accuracy from 9.2% to 56.0% without explicit prompting. Additionally, mathematical reasoning capabilities were successfully transferred, improving AGIEval Math accuracy from 61.1% to 71.3%, surpassing the performance of the post-trained model.
Implications
The findings suggest that the Unlock framework can significantly reduce the costs and time associated with training new models by enabling the reuse of existing capabilities. This could accelerate the development of AI systems and promote modularity in model design, allowing for more efficient deployment of advanced reasoning capabilities.
Selective Neuron Amplification for Training-Free Task Enhancement
NLP
Large Language Models
Efficient ML
- Selective Neuron Amplification (SNA) enhances transformer model performance without changing learned parameters.
- The method identifies and amplifies neurons with strong task-specific responses during inference.
- SNA shows significant improvements in low-confidence scenarios, with a mean improvement of 27.85% in certain tasks.
- The effectiveness of SNA varies across different performance zones, indicating a saturation effect.
Read more
Selective Neuron Amplification for Training-Free Task Enhancement
Summary
This paper introduces Selective Neuron Amplification (SNA), a novel framework aimed at enhancing the performance of transformer models without altering their learned parameters. The motivation stems from the observation that large language models, despite their capabilities, often struggle with tasks like arithmetic due to inconsistent activation of relevant circuits during inference. Traditional fine-tuning methods, while effective, come with significant costs, including the need for labeled data and the risk of degrading existing capabilities. SNA addresses this by identifying neurons that exhibit strong task-specific responses through differential activation analysis and amplifying their outputs during inference. This method is applied within a single forward pass and is fully reversible, allowing for practical scalability. The evaluation of SNA across 24,192 configurations across 12 tasks revealed that the model's baseline confidence is a critical predictor of improvement, with the strongest gains observed in low-confidence scenarios. The findings suggest a Three-Zone Saturation Model, indicating varying effectiveness of SNA based on baseline performance levels. Additionally, the results highlight layer-wise functional specialization and potential cross-task interference, suggesting that certain circuits may compete for representational capacity. Preliminary validation on a different model further supports the effectiveness of SNA, particularly in enhancing performance in low-confidence regimes.
Methodology
The methodology involves differential activation analysis to identify neurons with elevated task-specific responses. During inference, these neurons' outputs are scaled using activation hooks, allowing for localized amplification without parameter updates. This approach is tested across various tasks using GPT-2 models.
Results
SNA was evaluated across 24,192 configurations, yielding a mean improvement of 27.85% in low-confidence scenarios. The analysis revealed a negative correlation between baseline confidence and SNA-driven improvement (Spearman Ο = -0.762). The Three-Zone Saturation Model categorized performance based on baseline levels, with significant gains in Zone 1 and limited improvements in Zone 3.
Implications
The findings suggest that SNA could be a practical alternative to fine-tuning for enhancing model performance in specific tasks, particularly in scenarios where confidence is low. This approach could streamline the process of adapting models to new tasks without the need for extensive retraining.
On the Geometry of Positional Encodings in Transformers
NLP
Large Language Models
Theory
- Positional information is essential for Transformers to perform order-sensitive tasks.
- Distinct positional encodings are learned during training, leading to effective representation of sequence positions.
- Optimal positional encodings can be approximated using MDS on Hellinger distances, although exact reproduction is unattainable.
- The sinusoidal encoding is theoretically justified as optimal for certain types of corpora.
Read more
On the Geometry of Positional Encodings in Transformers
Summary
This paper addresses the theoretical underpinnings of positional encodings in Transformers, which are crucial for processing sequences of words. The author establishes a mathematical framework to explore three main questions: the necessity of positional information, the structure of learned positional encodings, and the characteristics of an optimal positional encoding. Theorem 1 demonstrates that without positional signals, Transformers treat all permutations of input as equivalent, rendering them incapable of handling tasks sensitive to word order. The Positional Separation Theorem (Theorem 4) shows that training leads to distinct vector representations for different sequence positions. The paper also investigates the optimality of positional encodings, revealing that while exact reproduction of statistical distances between word distributions is impossible due to the curved geometry of the relevant space, a good approximation can be achieved using classical multidimensional scaling (MDS) on Hellinger distances. The stress criterion is introduced to evaluate the quality of various encodings, with sinusoidal encoding shown to be approximately optimal for smoothly varying positional statistics. The paper concludes with empirical validation on synthetic and real-world datasets, confirming theoretical predictions and highlighting the superior performance of Attention with Linear Biases (ALiBi) over traditional sinusoidal encodings.
Methodology
The paper employs theoretical proofs to establish the necessity and structure of positional encodings, alongside classical multidimensional scaling to approximate optimal encodings. Empirical validation is conducted using synthetic and real-world datasets, comparing various encoding methods.
Results
Theoretical results confirm that positional encodings are necessary for order-sensitive tasks, and distinct representations are learned during training. The sinusoidal encoding is shown to be approximately optimal for specific corpora, while empirical tests reveal that ALiBi encoding significantly reduces stress compared to sinusoidal and RoPE encodings.
Implications
The findings provide a foundational understanding of positional encodings, guiding future research and development of more effective encoding schemes in Transformers. This could enhance the performance of NLP models in tasks where word order is critical.
Information as Structural Alignment: A Dynamical Theory of Continual Learning
Theory
- Catastrophic forgetting is a structural issue, not merely an engineering failure.
- The Informational Buildup Framework (IBF) redefines knowledge retention through structural alignment.
- IBF demonstrates superior performance in continual learning tasks without relying on raw data storage.
- The agency mechanism's effectiveness is context-dependent, yielding varying outcomes based on the learning environment.
Read more
Information as Structural Alignment: A Dynamical Theory of Continual Learning
Summary
This paper addresses the issue of catastrophic forgetting in continual learning, proposing that it is a mathematical consequence of how knowledge is stored in neural networks. The author introduces the Informational Buildup Framework (IBF), which posits that information arises from structural alignment rather than being merely stored content. The framework is governed by two key equations: a Law of Motion that drives the system towards higher coherence and Modification Dynamics that adapt the coherence landscape based on discrepancy signals. The paper demonstrates the IBF through a two-dimensional toy model and validates it across three domains: a controlled non-stationary environment, chess, and Split-CIFAR-100. The results indicate that IBF achieves superior retention without the need for raw data storage, exhibiting near-zero forgetting on CIFAR-100, positive backward transfer in chess, and significantly less forgetting compared to traditional replay methods. The findings also reveal that the agency mechanism's effectiveness varies across different discrepancy regimes, highlighting the nuanced behavior of the framework. Overall, the IBF presents a novel approach to continual learning that integrates memory, agency, and self-correction as intrinsic properties of the learning dynamics.
Methodology
The paper introduces the Informational Buildup Framework (IBF) and validates it through a two-dimensional toy model and three practical domains. The framework is defined by two governing equations that dictate the learning dynamics, focusing on coherence and discrepancy signals. The performance is evaluated in a controlled non-stationary environment, chess, and Split-CIFAR-100, comparing results against traditional methods.
Results
IBF achieved near-zero forgetting on CIFAR-100 (BT = -0.004), a positive backward transfer of +38.5 cp in chess, and 43% less forgetting than replay methods in the controlled domain. In chess, it provided a mean behavioral advantage of +88.9 Β± 2.8 cp, outperforming MLP and replay baselines.
Implications
The findings suggest that continual learning systems can be designed to inherently avoid catastrophic forgetting by leveraging structural alignment principles. This could lead to more efficient learning algorithms in dynamic environments, with applications in robotics, adaptive systems, and any domain requiring continual learning.
LLMs Should Express Uncertainty Explicitly
Large Language Models
NLP
Interpretability
- Uncertainty in LLMs should be explicitly communicated rather than inferred post-hoc.
- Two interfaces for uncertainty are proposed: global (verbalized confidence) and local (reasoning-time markers).
- The verbalized-confidence interface improves calibration and reduces overconfident errors.
- The reasoning-time interface enhances visibility of failures and aids in retrieval control.
Read more
LLMs Should Express Uncertainty Explicitly
Summary
This paper addresses the critical issue of how large language models (LLMs) handle uncertainty, particularly in decision-making contexts where uncertainty must be communicated effectively. The authors argue that existing methods treat uncertainty as a latent quantity to be estimated post-generation, rather than as a communicative signal that models should express during their operation. They propose two interfaces for expressing uncertainty: a global interface that verbalizes a calibrated confidence score for the final answer and a local interface that emits an explicit <uncertain> marker during reasoning when the model encounters high-risk states. The study demonstrates that these interfaces yield complementary benefits: the verbalized-confidence interface enhances calibration and reduces overconfident errors, while the reasoning-time interface reveals previously hidden failures and improves retrieval control. The authors provide analyses showing how these interfaces shift error types and improve model behavior, arguing for a task-matched approach to uncertainty communication in LLMs.
Methodology
The authors implemented both uncertainty interfaces within a unified post-training framework and evaluated their performance based on calibration quality, behavioral reliability, and downstream retrieval control. They conducted analyses to understand the mechanisms behind the observed improvements, including error type shifts and confidence manifold sharpening.
Results
The study found that the verbalized-confidence interface significantly improved calibration and reduced overconfident errors compared to strong baselines. The reasoning-time interface made previously silent failures visible, enhancing the model's ability to signal when intervention is needed. Overall, both interfaces contributed to a more effective uncertainty communication strategy in LLMs.
Implications
The findings suggest that explicitly communicating uncertainty can enhance the reliability of LLMs in applications requiring decision-making, such as information retrieval and verification tasks. This approach may lead to more robust AI systems that can better assist users in uncertain scenarios.
On the Price of Privacy for Language Identification and Generation
NLP
Large Language Models
Theory
- Approximate DP incurs no statistical penalty for language identification and generation tasks.
- Under pure DP, the degradation in performance is characterized by a factor of min{1,Ξ΅}.
- Generation tasks achieve a tighter privacy-utility tradeoff compared to identification tasks.
- The study provides a complete characterization of the price of privacy in language learning.
Read more
On the Price of Privacy for Language Identification and Generation
Summary
This paper investigates the cost of privacy in language identification and generation tasks using differential privacy (DP) in an agnostic statistical setting. The authors establish algorithms and lower bounds that quantify the privacy cost for these tasks. They demonstrate that under approximate (Ξ΅,Ξ΄)-DP, the error rates for both tasks recover non-private performance, while under pure Ξ΅-DP, the error rates degrade by a factor of min{1,Ξ΅}. The study reveals that the cost of privacy is minimal, being absent under approximate DP and only slightly affecting performance under pure DP. The authors provide a complete characterization of the privacy-utility tradeoff for language identification and generation, highlighting that generation benefits from a tighter tradeoff compared to identification. The paper addresses technical challenges in adapting non-private algorithms to a private setting, leading to significant insights into the implications of privacy in language learning.
Methodology
The authors utilize differential privacy frameworks to develop algorithms for language identification and generation. They adapt existing non-private algorithms by addressing their sensitivity issues and employing mechanisms such as the exponential and Gaussian mechanisms for privatization. The study operates within an agnostic statistical learning model, where the learner receives i.i.d. samples from an unknown distribution.
Results
The results indicate that under approximate DP, both language identification and generation tasks maintain their non-private error rates. Under pure DP, the error rates for identification and generation degrade by a factor of min{1,Ξ΅}, with generation achieving optimal rates that match upper and lower bounds. The findings suggest that the privacy cost is surprisingly mild, especially for generation tasks.
Implications
The findings have significant implications for the development of large language models trained on sensitive data, suggesting that privacy can be maintained without substantial performance loss. This research can inform future work on privacy-preserving techniques in natural language processing and machine learning.
Top-K Retrieval with Fixed-Size Linear-Attention Completion: Backbone- and KV-Format-Preserving Attention for KV-Cache Read Reduction
NLP
Large Language Models
Efficient ML
- Introduces a hybrid attention module that combines exact anchors with Top-K retrieval and a fixed-size completion term.
- Maintains the original backbone language model and KV-cache format, ensuring compatibility with existing systems.
- Demonstrates that the proposed method improves performance in long-context benchmarks, particularly in high-entropy attention scenarios.
- Reduces decode-time KV payload reads by estimating contributions from unretrieved tokens, thus minimizing memory traffic.
Read more
Top-K Retrieval with Fixed-Size Linear-Attention Completion: Backbone- and KV-Format-Preserving Attention for KV-Cache Read Reduction
Summary
This paper addresses the challenge of long-context generation in decoder-only Transformers, which is constrained by the traffic of key-value (KV) cache during decoding, especially when KV is offloaded beyond GPU memory. The authors propose a novel retrieval-completion attention module that maintains the backbone weights and KV-cache format while reducing KV-cache read traffic. The method computes exact attention over a small set of anchor tokens and a query-dependent Top-K selection of tokens, while estimating contributions from unretrieved mid-region tokens using a fixed-size feature-map summary created during prefill time. This approach allows for a single normalization step that recovers the missing softmax mass without requiring additional KV reads during attention computation. The proposed method shows significant improvements over traditional Top-K selection methods, particularly in scenarios with high-entropy attention heads, demonstrating its effectiveness in reducing memory traffic and enhancing long-context generation performance.
Methodology
The authors developed a retrieval-completion attention module that computes exact attention over a small fixed set of anchor tokens and a query-dependent Top-K set. They created a fixed-size cache during prefill time to estimate contributions from unretrieved tokens, allowing for a single normalization step that approximates the missing softmax mass without additional KV reads.
Results
The proposed method outperformed selection-only Top-K approaches in long-context benchmarks, particularly showing the largest gains in high-entropy attention heads. The evaluation demonstrated improved efficiency in terms of KV-read budgets, indicating a significant reduction in memory traffic during decoding.
Implications
This work has potential applications in enhancing the efficiency of long-context generation in large language models, particularly in scenarios where memory constraints are a concern. The method could be integrated into existing Transformer architectures to improve performance without altering the backbone model.
Reasoning Through Chess: How Reasoning Evolves from Data Through Fine-Tuning and Reinforcement Learning
NLP
Large Language Models
Reinforcement Learning
- Fine-tuning on single best moves leads to effective RL but unfaithful reasoning.
- Training on multi-move trajectories results in more stable RL and faithful reasoning.
- Reinforcement learning improves move quality and reduces hallucination rates.
- SFT-checkpoint metrics can predict final RL performance.
Read more
Reasoning Through Chess: How Reasoning Evolves from Data Through Fine-Tuning and Reinforcement Learning
Summary
This paper investigates how reasoning capabilities in language models can be enhanced through supervised fine-tuning (SFT) and reinforcement learning (RL), specifically in the context of chess. The authors analyze the impact of different training datasets on model performance, revealing that fine-tuning a model to predict the best move leads to effective RL but results in unfaithful reasoning. In contrast, training on multi-move trajectories yields similar performance with more stable RL and faithful reasoning. The study demonstrates that RL significantly improves move quality and reduces hallucination rates. Additionally, several SFT-checkpoint metrics are identified as predictive of post-RL performance. The authors release their checkpoints, models, and training data, achieving superior performance compared to leading open-source reasoning models in chess with a 7B-parameter model.
Methodology
The authors trained a 7B-parameter language model using both supervised fine-tuning (SFT) and reinforcement learning (RL) on custom datasets related to chess. They compared the effects of training on single best moves versus multi-move trajectories and analyzed the resulting performance and reasoning quality.
Results
The study found that while fine-tuning on the best move led to strong performance, it resulted in unfaithful reasoning during RL. Conversely, training on multi-move trajectories yielded comparable performance with more stable RL and faithful reasoning. RL was shown to significantly enhance move quality and decrease hallucination rates. Additionally, specific SFT-checkpoint metrics were found to be predictive of the model's performance after RL.
Implications
The findings suggest that careful selection of training strategies can enhance reasoning capabilities in language models, particularly in complex domains like chess. This research could inform future developments in AI reasoning across various applications, potentially improving decision-making systems in other structured environments.
Feature-Aware Anisotropic Local Differential Privacy for Utility-Preserving Graph Representation Learning in Metal Additive Manufacturing
Graph Learning
- Introduction of FI-LDP, a feature-importance-aware anisotropic local differential privacy mechanism.
- Development of a stratified Hierarchical Graph Attention Network (HGAT) for capturing spatial and thermal dependencies in additive manufacturing.
- Demonstrated significant improvements in utility recovery and defect detection accuracy while ensuring privacy.
- Mechanistic analysis shows a strong correlation between feature importance and noise allocation, enhancing interpretability.
Read more
Feature-Aware Anisotropic Local Differential Privacy for Utility-Preserving Graph Representation Learning in Metal Additive Manufacturing
Summary
This paper addresses the challenges of quality assurance in metal additive manufacturing (AM) by proposing a novel framework called FI-LDP-HGAT. The framework integrates a Feature-Importance-guided Local Differential Privacy (FI-LDP) mechanism and a stratified Hierarchical Graph Attention Network (HGAT) to enhance defect prediction while preserving data privacy. Traditional defect detection methods often overlook the interdependencies of melt-pool observations, treating them as independent samples, which can lead to inaccuracies in defect prediction. The proposed FI-LDP mechanism improves upon conventional local differential privacy techniques by redistributing the privacy budget across feature dimensions based on their importance, allowing for lower noise in critical features and higher noise in less important ones. The HGAT component captures spatial and thermal dependencies in the manufacturing process, enabling more accurate predictions. Experimental results demonstrate that FI-LDP-HGAT achieves significant utility recovery and defect recall while maintaining strict privacy guarantees, outperforming classical machine learning models and other privacy mechanisms.
Methodology
The methodology combines two main components: (1) FI-LDP, which employs an anisotropic Gaussian mechanism for local feature privatization, redistributing privacy budgets based on feature importance; and (2) a stratified HGAT that constructs a hybrid graph to model spatial and thermal dependencies in the manufacturing process, facilitating context-aware inference.
Results
The experiments conducted on a Directed Energy Deposition (DED) porosity dataset show that FI-LDP-HGAT achieves 81.5% utility recovery at a moderate privacy budget (Ο΅ = 4) and maintains a defect recall of 0.762 under strict privacy (Ο΅ = 2). The framework outperforms classical machine learning models, standard GNNs, and alternative privacy mechanisms across all evaluated metrics.
Implications
The proposed framework has significant implications for the field of metal additive manufacturing, particularly in enhancing data-driven quality assurance while protecting proprietary information. It enables collaborative data sharing without compromising sensitive process information, which is crucial for advancing certification-grade deployments in safety-critical applications.
A comparative analysis of machine learning models in SHAP analysis
Interpretability
- SHAP analysis provides a framework for interpreting predictions from complex machine learning models.
- The paper investigates the variability of SHAP values across different machine learning models.
- A novel high-dimensional waterfall plot is introduced for visualizing SHAP values in multi-classification scenarios.
- The study aims to enhance the understanding of model decision-making processes through SHAP analysis.
Read more
A comparative analysis of machine learning models in SHAP analysis
Summary
This paper presents a comparative analysis of SHapley Additive exPlanations (SHAP) across various machine learning models and datasets, addressing the challenge of interpreting predictions from complex black-box models. The authors emphasize the importance of explainable AI (XAI) methods, particularly SHAP, which quantifies the contribution of each feature to a model's predictions. The study investigates how SHAP values differ across different models and how these differences impact the analysis process. The authors explore three distinct machine learning models with varying complexities and apply them to three datasets, including the UCI adult income dataset. A novel generalization of the waterfall plot is introduced to visualize SHAP values for multi-classification problems, enhancing the interpretability of model predictions. The findings aim to empower analysts by providing insights into the nuances of SHAP analysis, ultimately facilitating the development of data-driven solutions tailored to specific subgroups.
Methodology
The authors conduct a comparative analysis of SHAP values derived from three different machine learning models applied to three datasets. They introduce a generalized waterfall plot for multi-classification problems to aid in the interpretation of SHAP values. The analysis is sample-by-sample, allowing for a detailed understanding of feature contributions to predictions.
Results
The study reveals that SHAP values vary significantly across different models, affecting the interpretation of model predictions. The introduction of the high-dimensional waterfall plot facilitates better visualization and understanding of SHAP values in multi-class settings, providing clearer insights into model behavior.
Implications
The findings of this research can enhance the trustworthiness of machine learning models in high-stakes applications by providing clearer explanations of model predictions. The insights gained from SHAP analysis can inform the development of personalized solutions in fields such as healthcare, finance, and beyond.
Bivariate Causal Discovery Using Rate-Distortion MDL: An Information Dimension Approach
Theory
Graph Learning
Optimization
- Introduces a new method for bivariate causal discovery called rate-distortion MDL (RDMDL).
- Addresses limitations in existing MDL-based methods regarding the estimation of the cause variable's description length.
- Utilizes rate-distortion theory and histogram-based density estimation for improved causal direction determination.
- Demonstrates competitive performance of RDMDL on the TΓΌbingen dataset.
Read more
Bivariate Causal Discovery Using Rate-Distortion MDL: An Information Dimension Approach
Summary
This paper presents a novel approach to bivariate causal discovery using the minimum description length (MDL) principle, specifically addressing the estimation of the description length of the cause variable. The authors argue that existing MDL-based methods inadequately estimate this length, leading to biased causal direction decisions. They propose a new method, rate-distortion MDL (RDMDL), which incorporates rate-distortion theory to measure the description length of the cause variable. This involves determining the minimum rate required to achieve a distortion level representative of the underlying distribution, utilizing histogram-based density estimation. The RDMDL method combines this new approach with traditional methods for estimating the causal mechanism. Experimental results demonstrate that RDMDL performs competitively on the TΓΌbingen dataset, showcasing its effectiveness in causal discovery tasks.
Methodology
The authors develop the RDMDL method by applying rate-distortion theory to measure the description length of the cause variable. They compute the minimum rate necessary to achieve a distortion level that reflects the underlying distribution, while also employing traditional approaches to estimate the causal mechanism. This combination allows for a more accurate assessment of causal direction in bivariate data.
Results
The experimental evaluation of RDMDL on the TΓΌbingen dataset indicates that it achieves competitive performance compared to existing state-of-the-art methods in bivariate causal discovery, validating the effectiveness of the proposed approach.
Implications
The findings suggest that RDMDL can enhance the accuracy of causal discovery in observational data, particularly in scenarios where traditional experimental methods are not feasible. This has potential applications in various fields, including social sciences, epidemiology, and economics, where understanding causal relationships is crucial.
Efficient Quantization of Mixture-of-Experts with Theoretical Generalization Guarantees
Efficient ML
Large Language Models
Theory
- Introduces a theoretically grounded metric for expert-wise mixed-precision quantization based on router's L2 norm changes.
- Demonstrates that experts capturing less prevalent features require higher precision to maintain model performance.
- Empirical results show improved accuracy and reduced inference costs on large MoE models compared to existing methods.
- The proposed method incurs negligible computational overhead for determining expert bit-widths.
Read more
Efficient Quantization of Mixture-of-Experts with Theoretical Generalization Guarantees
Summary
This paper addresses the challenge of efficiently quantizing Sparse Mixture-of-Experts (MoE) models, which are increasingly used in language and vision tasks due to their ability to scale without incurring high training costs. The authors propose a novel expert-wise mixed-precision quantization strategy grounded in theoretical analysis, which assigns varying bit-widths to experts based on their sensitivity to quantization. Specifically, the method utilizes the change in the router's L2 norm during training as a metric to determine the importance of each expert. Experts that exhibit smaller changes are deemed to capture critical but less frequent features, thus requiring higher precision to maintain model performance. The approach also considers the maximum intra-neuron variance to further optimize bit-width allocation. Empirical evaluations on large-scale MoE models, including Switch Transformer and Mixtral, demonstrate that this method achieves superior accuracy compared to existing quantization strategies while significantly reducing inference costs and incurring minimal overhead for bit-width assignment.
Methodology
The authors developed a mixed-precision quantization strategy that assigns bit-widths to experts based on their sensitivity to quantization, measured by the change in the router's L2 norm during training. This theoretical framework was validated through empirical experiments on large MoE models, comparing the proposed method against existing heuristics for bit-width allocation.
Results
The proposed expert-wise mixed-precision quantization method outperformed existing uniform and heuristic-based approaches in terms of accuracy while allowing for significant reductions in inference costs. The method was shown to maintain model performance even at ultra-low-bit quantization levels (below 3-bit). Additionally, it demonstrated that using the router's L2 norm from pretrained models could yield comparable test accuracy without requiring extensive fine-tuning.
Implications
This research has significant implications for deploying large-scale MoE models in resource-constrained environments, as it allows for efficient memory usage and computational savings without sacrificing performance. The findings could enhance the practicality of MoE architectures in real-world applications, particularly in NLP and computer vision tasks.
Mixture Proportion Estimation and Weakly-supervised Kernel Test for Conditional Independence
Theory
- Introduction of new assumptions for MPE based on conditional independence, enhancing identifiability.
- Development of method of moments estimators with established asymptotic properties.
- Creation of weakly-supervised kernel tests for validating CI assumptions using unlabeled data.
- Empirical validation showing improved performance of proposed methods over existing approaches.
Read more
Mixture Proportion Estimation and Weakly-supervised Kernel Test for Conditional Independence
Summary
This paper addresses the problem of Mixture Proportion Estimation (MPE), which is crucial for weakly supervised learning tasks such as positive-unlabeled learning and domain adaptation. Traditional MPE methods rely on the irreducibility assumption for identifiability, which can be restrictive in practical applications. The authors propose new assumptions based on conditional independence (CI) given class labels, allowing for identifiability even when irreducibility does not hold. They develop method of moments estimators under these new assumptions and analyze their asymptotic properties. Additionally, the paper introduces weakly-supervised kernel tests to validate the CI assumptions, which are significant for applications in causal discovery and fairness evaluation. The empirical results demonstrate that the proposed estimators outperform existing methods and effectively control type I and type II errors in hypothesis testing.
Methodology
The authors propose method of moments estimators based on conditional independence and multivariate conditional independence assumptions. They also develop kernel tests for CI and MCI using unlabeled data, which are consistent under mild conditions. The testing methods are analyzed under scenarios with known and unknown mixture proportions, with gamma approximation methods derived for statistical inference.
Results
The proposed estimators show asymptotic normality and outperform existing MPE methods in empirical tests. The kernel tests successfully control both type I and type II errors, demonstrating their effectiveness in validating the CI assumptions.
Implications
The findings suggest that the new MPE methods can be applied in various weakly supervised learning scenarios, improving the robustness of classifiers trained on unlabeled data. The kernel tests also provide a valuable tool for researchers in causal inference and fairness assessment in machine learning.
RAGEN-2: Reasoning Collapse in Agentic RL
Reinforcement Learning
Large Language Models
NLP
- Identification of template collapse in multi-turn agent RL, where reasoning appears diverse but is input-agnostic.
- Introduction of a mutual information proxy for diagnosing reasoning quality, which outperforms traditional entropy measures.
- Explanation of template collapse through a signal-to-noise ratio mechanism, highlighting the impact of low reward variance.
- Development of SNR-Aware Filtering to enhance input dependence and task performance during training.
Read more
RAGEN-2: Reasoning Collapse in Agentic RL
Summary
The paper addresses the instability in training multi-turn large language model (LLM) agents using reinforcement learning (RL), particularly focusing on the phenomenon termed 'template collapse'. This occurs when models exhibit high within-input diversity (as measured by entropy) but fail to adapt their reasoning across different inputs, relying instead on fixed templates. The authors propose a novel diagnostic framework that decomposes reasoning quality into two components: within-input diversity and cross-input distinguishability, the latter measured using mutual information (MI). They introduce a mutual information proxy for online diagnosis, which correlates more strongly with task performance than entropy. The paper explains the underlying causes of template collapse through a signal-to-noise ratio (SNR) mechanism, where low reward variance diminishes task gradients, allowing regularization to dominate and erase input-dependent reasoning. To mitigate this issue, the authors propose 'SNR-Aware Filtering', which selects high-signal prompts based on reward variance during training. The proposed methods are validated across various tasks, demonstrating consistent improvements in both input dependence and overall task performance.
Methodology
The authors decompose reasoning quality into within-input diversity and cross-input distinguishability, using mutual information as a proxy for the latter. They analyze the effects of reward variance on task gradients and propose SNR-Aware Filtering to select high-signal prompts based on reward variance. The methodology is validated through experiments across various tasks, including planning, math reasoning, web navigation, and code execution.
Results
The mutual information proxy was found to correlate significantly better with task performance than entropy, validating its effectiveness in diagnosing template collapse. The SNR-Aware Filtering method consistently improved both input dependence and task performance across multiple tasks, algorithms, and model scales.
Implications
The findings suggest that improving the diagnostic tools for reasoning quality in RL can lead to more reliable and effective multi-turn LLM agents. The proposed methods could be applied to enhance the performance of agents in various applications, including planning, reasoning, and interactive systems.
Contraction-Aligned Analysis of Soft Bellman Residual Minimization with Weighted Lp-Norm for Markov Decision Problem
Reinforcement Learning
Optimization
Theory
- Introduces a soft Bellman residual minimization framework using weighted Lp-norms.
- Establishes a connection between BRM and the contraction properties of the Bellman operator.
- Derives performance error bounds that improve error control in reinforcement learning.
- Demonstrates that the proposed method is compatible with gradient-based optimization.
Read more
Contraction-Aligned Analysis of Soft Bellman Residual Minimization with Weighted Lp-Norm for Markov Decision Problem
Summary
This paper addresses the challenges of solving Markov Decision Processes (MDPs) under function approximation, particularly focusing on the geometric mismatch between the Bellman optimality operator and commonly used objectives in reinforcement learning. The authors propose a soft formulation of Bellman residual minimization (BRM) that utilizes a generalized weighted Lp-norm, which aligns the optimization objective with the contraction geometry of the Bellman operator as the parameter p increases. The analysis reveals that while traditional DP-based methods face difficulties due to the L2-norm projection, the proposed weighted Lp-norm formulation is compatible with BRM methods. The authors derive performance error bounds and demonstrate that the alignment improves error control, facilitating gradient-based optimization. Empirical results support the theoretical findings, showcasing the effectiveness of the proposed approach in managing error propagation in MDPs.
Methodology
The authors analyze the contraction properties of the soft Bellman operator under weighted Lp-norms and develop a soft Bellman residual minimization framework. They derive performance error bounds and propose a gradient-based optimization algorithm that can handle large values of p, ensuring stable optimization.
Results
The study shows that as the parameter p increases, the optimization objective aligns progressively with the Lβ-norm Bellman error. The proposed method demonstrates improved error control and stability in optimization, validated through empirical experiments.
Implications
This work has significant implications for reinforcement learning, particularly in environments with large state or action spaces where function approximation is necessary. The findings can enhance the stability and performance of algorithms used in various applications of MDPs.
Vehicle-as-Prompt: A Unified Deep Reinforcement Learning Framework for Heterogeneous Fleet Vehicle Routing Problem
Reinforcement Learning
Optimization
- Introduces a unified DRL framework for solving HFVRP and its variants.
- Develops the Vehicle-as-Prompt mechanism for efficient decision-making.
- Achieves superior performance compared to state-of-the-art DRL methods and traditional heuristics.
- Demonstrates strong zero-shot generalization across diverse problem settings.
Read more
Vehicle-as-Prompt: A Unified Deep Reinforcement Learning Framework for Heterogeneous Fleet Vehicle Routing Problem
Summary
This paper addresses the Heterogeneous Fleet Vehicle Routing Problem (HFVRP), which presents unique challenges due to varying fixed costs, variable travel costs, and capacity constraints across a heterogeneous fleet. Traditional methods often struggle with the complexity and computational demands of HFVRP, particularly when real-world logistics constraints are considered. The authors propose a novel Deep Reinforcement Learning (DRL) framework, termed Vehicle-as-Prompt (VaP), which formulates the routing problem as a single-stage autoregressive decision process. This framework integrates a cross-semantic encoder and a multi-view decoder to effectively manage the complexities of vehicle heterogeneity and customer attributes. The proposed VaP-CSMV framework demonstrates significant improvements over existing DRL methods and traditional heuristic solvers, achieving competitive solution quality while drastically reducing inference time. Furthermore, it showcases strong zero-shot generalization capabilities across various problem scales and previously unseen variants, indicating its robustness and adaptability. Ablation studies confirm the importance of each component within the framework, highlighting its potential for practical applications in logistics and transportation optimization.
Methodology
The authors propose a unified DRL framework that models HFVRP as a single-stage autoregressive decision process. The framework includes a cross-semantic encoder and a multi-view decoder, utilizing a dual-attention mechanism to capture relationships between vehicle attributes and customer node topology. This approach allows for joint optimization of vehicle dispatching and routing under complex constraints.
Results
VaP-CSMV significantly outperforms existing DRL-based neural solvers and achieves competitive solution quality compared to traditional heuristic solvers. It reduces inference time to seconds and exhibits strong zero-shot generalization capabilities on large-scale and previously unseen problem variants. Ablation studies confirm the effectiveness of each component in the framework.
Implications
The proposed framework has significant implications for logistics and transportation industries, enabling more efficient routing and dispatching in heterogeneous fleets. Its ability to generalize across various problem settings suggests potential applications in real-time logistics optimization and adaptive routing solutions.
ReLU Networks for Exact Generation of Similar Graphs
Generative Models
Graph Learning
Theory
- Introduces ReLU networks for exact graph generation within specified edit distances.
- Eliminates reliance on training data, ensuring validity of generated graphs.
- Demonstrates scalability with successful generation of graphs with up to 1400 vertices.
- Outperforms existing models like GraphRNN and GraphGDP in meeting edit distance constraints.
Read more
ReLU Networks for Exact Generation of Similar Graphs
Summary
This paper addresses the challenge of generating graphs that maintain a specified graph edit distance from a source graph, which is crucial for applications in cheminformatics, network anomaly synthesis, and structured data augmentation. The authors propose a theoretical framework for ReLU neural networks that can deterministically generate graphs within a bounded edit distance, thus eliminating the reliance on training data that characterizes existing generative models. They demonstrate the existence of constant depth and O(n^2d) size ReLU networks capable of generating valid graphs with n vertices and an edit distance d. Experimental results show that these networks can generate valid graphs for instances with up to 1400 vertices and edit distance bounds up to 140, outperforming baseline models like GraphRNN and GraphGDP, which fail to meet the desired constraints. This work provides a new paradigm for graph generation, moving from probabilistic sampling to exact synthesis under similarity constraints, and offers a theoretical foundation for constructing compact generative models with guaranteed validity.
Methodology
The authors theoretically characterize ReLU neural networks that can generate graphs within a specified graph edit distance. They establish the existence of networks with constant depth and polynomial size that deterministically produce valid graphs, thus ensuring compliance with the edit distance constraints without the need for training data.
Results
The proposed ReLU networks successfully generated valid graphs for instances with up to 1400 vertices and edit distance bounds of up to 140. In contrast, baseline models such as GraphRNN and GraphGDP were unable to generate graphs that satisfied the desired edit distance constraints.
Implications
This research has significant implications for applications requiring exact graph generation, such as molecule design and network analysis, where maintaining structural validity is critical. The proposed method offers a new approach to generative modeling that guarantees the validity of outputs, potentially transforming practices in fields reliant on graph structures.
Graph Topology Information Enhanced Heterogeneous Graph Representation Learning
Graph Learning
- ToGRL improves the quality of graph structures for heterogeneous graphs by utilizing task-relevant topology information.
- The two-stage GSL approach separates adjacency matrix optimization from node representation learning, reducing memory usage.
- ToGRL incorporates prompt tuning to enhance the adaptability of learned representations for downstream tasks.
- Extensive experiments show ToGRL outperforms existing methods on five real-world datasets.
Read more
Graph Topology Information Enhanced Heterogeneous Graph Representation Learning
Summary
This paper addresses the challenges faced by Graph Neural Networks (GNNs) when applied to heterogeneous graphs, particularly the impact of noisy input graph structures on downstream tasks. The authors propose a novel framework called ToGRL (Graph Topology Learning Enhanced Heterogeneous Graph Representation Learning) that enhances the quality of graph structures and representations by incorporating task-relevant latent topology information. The framework introduces a two-stage Graph Structure Learning (GSL) module that first extracts topology information from raw graphs and projects it into topology embeddings. These embeddings are then used to construct a new graph with smoother signals, which helps in reducing memory consumption issues prevalent in existing GSL models. Following this, a representation learning module learns embeddings from the newly constructed graph for various downstream tasks. The authors also implement prompt tuning to leverage the knowledge embedded in the learned representations, enhancing adaptability to different tasks. Extensive experiments on five real-world datasets demonstrate that ToGRL significantly outperforms state-of-the-art methods while also addressing memory consumption challenges.
Methodology
The methodology involves a two-stage approach where the first stage focuses on extracting task-related topology information from the raw graph structure to create topology embeddings. These embeddings are then used to construct a new graph with improved signal smoothness. The second stage employs a representation learning module that learns embeddings from this new graph. Additionally, prompt tuning is applied to enhance the adaptability of the learned representations.
Results
The results indicate that ToGRL outperforms state-of-the-art heterogeneous graph representation learning methods across five real-world datasets, demonstrating significant improvements in representation quality and memory efficiency.
Implications
The proposed framework has potential applications in various domains that utilize heterogeneous graphs, such as social networks, recommendation systems, and knowledge graphs. By improving the quality of graph representations, ToGRL can enhance the performance of downstream tasks like node classification and link prediction.