AI-generated summaries
Today's ML research,
without the noise.
Daily summaries of the latest machine learning papers from arXiv, processed every 8 hours.
58
Papers today
8h
Update frequency
7
Days of history
Reinforcement Learning with LLM-Guided Action Spaces for Synthesizable Lead Optimization
Reinforcement Learning
Large Language Models
Optimization
- MOLREACT bridges the gap between property optimization and synthetic feasibility in drug discovery.
- The framework uses a tool-augmented LLM to propose feasible chemical transformations dynamically.
- A dedicated policy model optimizes multi-step reaction trajectories to maximize long-term rewards.
- The SMILES-based caching mechanism significantly reduces optimization time.
Read more
Reinforcement Learning with LLM-Guided Action Spaces for Synthesizable Lead Optimization
Summary
The paper presents MOLREACT, a novel framework for lead optimization in drug discovery that integrates reinforcement learning (RL) with large language models (LLMs) to ensure synthesizability of proposed molecular modifications. Traditional methods often fail to balance property improvement with synthetic feasibility, leading to chemically invalid structures. MOLREACT formulates lead optimization as a Markov Decision Process (MDP) over a synthesis-constrained action space defined by validated reaction templates. A tool-augmented LLM agent dynamically identifies reactive sites and proposes a targeted set of transformations, while a dedicated policy model trained via Group Relative Policy Optimization (GRPO) selects actions to maximize long-term rewards. The framework incorporates a SMILES-based caching mechanism to reduce computational costs during exploration. Evaluated on 13 property optimization tasks and one structure-based docking task, MOLREACT achieved an average Top-10 score of 0.563, outperforming the best synthesizable baseline by 10.4% and demonstrating superior sample efficiency across multiple tasks. The results indicate that the integration of tool-augmented reaction proposals and trajectory-level policy optimization significantly enhances the optimization process, producing chemically valid and property-improved molecules.
Methodology
MOLREACT formulates lead optimization as a Markov Decision Process, utilizing a tool-augmented LLM to identify feasible reactions based on validated templates. A dedicated policy model is trained using Group Relative Policy Optimization to select actions that maximize long-term rewards, while a caching mechanism reduces computational costs during exploration.
Results
MOLREACT achieved an average Top-10 score of 0.563 across 13 property optimization tasks, outperforming the strongest synthesizable baseline by 10.4% in relative improvement. It demonstrated the best sample efficiency on 10 out of 14 tasks, confirming the effectiveness of its approach.
Implications
The findings suggest that integrating LLMs with reinforcement learning can significantly enhance the efficiency and effectiveness of molecular optimization in drug discovery, potentially leading to faster development of viable therapeutic candidates.
Bayesian Optimization for Mixed-Variable Problems in the Natural Sciences
Optimization
- Introduces a generalized probabilistic reparameterization method for mixed-variable optimization.
- Demonstrates the effectiveness of Bayesian optimization in handling non-equidistant discrete variables.
- Conducts extensive benchmarks to optimize kernel formulations and validate the proposed method.
- Shows that the approach can efficiently optimize complex objective landscapes in real-world scenarios.
Read more
Bayesian Optimization for Mixed-Variable Problems in the Natural Sciences
Summary
This paper addresses the challenge of optimizing expensive black-box objectives in mixed-variable search spaces, which is prevalent in the natural sciences. The authors propose a generalized probabilistic reparameterization (PR) approach that extends existing methods to handle non-equidistant discrete variables, enabling gradient-based optimization in fully mixed-variable settings using Gaussian process (GP) surrogates. The study includes systematic benchmarks on both synthetic and experimental objectives to optimize kernel formulations and validate the robustness of the generalized PR method. The results demonstrate that when combined with a modified Bayesian optimization (BO) workflow, the proposed approach can efficiently optimize highly discontinuous and discretized objective landscapes. This work establishes a practical BO framework tailored for fully mixed optimization problems in scientific research, particularly suitable for autonomous laboratory environments where noise, discretization, and limited data are common.
Methodology
The authors extend the probabilistic reparameterization (PR) framework to support mixed-variable optimization, incorporating Gaussian process (GP) surrogates. They benchmark various acquisition functions, including Expected Improvement (EI) and Upper/Lower Confidence Bound (UCB/LCB), to evaluate their performance across synthetic and real-world optimization tasks.
Results
The proposed generalized PR method successfully handles mixed-variable optimization, showing improved performance in optimizing discontinuous and discretized objective landscapes. The benchmarks reveal that the choice of acquisition function and kernel formulation significantly impacts the optimization process, with the method demonstrating robustness across diverse scenarios.
Implications
The findings have significant implications for optimizing experimental and simulation-based tasks in the natural sciences, particularly in autonomous laboratory settings. The developed framework can enhance the efficiency of material discovery and other scientific optimizations by reducing the number of costly evaluations needed.
Joint Task Offloading, Inference Optimization and UAV Trajectory Planning for Generative AI Empowered Intelligent Transportation Digital Twin
Reinforcement Learning
Generative Models
Optimization
- Integration of GAI with ITDT enhances data processing and fidelity.
- Joint optimization of task offloading, inference, and UAV trajectories is crucial for system performance.
- The SU-HATD3 algorithm effectively addresses the challenges of dynamic network environments.
- Numerical results indicate significant improvements in system utility and convergence compared to baseline algorithms.
Read more
Joint Task Offloading, Inference Optimization and UAV Trajectory Planning for Generative AI Empowered Intelligent Transportation Digital Twin
Summary
This paper presents a novel framework for implementing an Intelligent Transportation Digital Twin (ITDT) that leverages Unmanned Aerial Vehicles (UAVs) to process sensing data from roadside sensors using Generative Artificial Intelligence (GAI) technologies, specifically diffusion models. The authors address the challenges of task offloading, inference optimization, and UAV trajectory planning as a joint optimization problem aimed at maximizing system utility while balancing fidelity and delay in data processing. The proposed solution is modeled as a heterogeneous-agent Markov decision process and utilizes a new algorithm called Sequential Update-based Heterogeneous-Agent Twin Delayed Deep Deterministic Policy Gradient (SU-HATD3). This algorithm enables efficient learning of near-optimal solutions in dynamic environments. Numerical experiments demonstrate that the SU-HATD3 algorithm significantly outperforms several baseline methods in terms of system utility and convergence rate, showcasing its effectiveness in real-time applications for intelligent transportation systems.
Methodology
The authors model the joint optimization problem as a heterogeneous-agent Markov decision process and propose the SU-HATD3 algorithm, which employs deep reinforcement learning techniques to learn optimal policies for task offloading, inference optimization, and UAV trajectory planning under dynamic conditions.
Results
The proposed SU-HATD3 algorithm demonstrated superior performance in improving system utility and convergence rates compared to several baseline algorithms, indicating its effectiveness in managing the complex dynamics of intelligent transportation systems.
Implications
The findings suggest that the GAI-empowered ITDT framework can significantly enhance the efficiency and accuracy of intelligent transportation systems, paving the way for more responsive and adaptive urban mobility solutions. This approach could be applied in various real-time data processing scenarios, improving decision-making and operational efficiency in smart cities.
Persistence-Augmented Neural Networks
Computer Vision
Graph Learning
Interpretability
- Introduces a persistence-based data augmentation framework for deep learning.
- Utilizes the Morse–Smale complex to retain local topological information.
- Demonstrates efficiency with a computational complexity of O(n log n).
- Achieves superior performance on histopathology image classification and 3D porous material regression compared to existing methods.
Read more
Persistence-Augmented Neural Networks
Summary
This paper addresses the challenge of integrating topological features from Topological Data Analysis (TDA) into deep learning frameworks, particularly focusing on preserving local geometric structures. The authors propose a persistence-based data augmentation framework that utilizes the Morse–Smale complex to encode local gradient flow regions and their hierarchical evolution. This approach is compatible with both Convolutional Neural Networks (CNNs) and Graph Neural Networks (GNNs), allowing for the retention of spatially localized topological information across multiple scales. The proposed augmentation method is computationally efficient, with a complexity of O(n log n), making it suitable for large datasets. The authors evaluate their method on histopathology image classification and 3D porous material regression tasks, demonstrating consistent performance improvements over baseline methods and global TDA descriptors. Additionally, they find that pruning the base level of the hierarchy can reduce memory usage while maintaining competitive performance. The results underscore the potential of local, structured topological augmentation for enhancing scalability and interpretability in machine learning across various data modalities.
Methodology
The authors compute the Morse–Smale complex for grayscale images or graphs with scalar functions, constructing a dual representation that encodes persistence information. This results in a hierarchical topological simplification, allowing for the integration of topological features into CNNs and GNNs. The framework is designed to be general-purpose, applicable to both image and graph data.
Results
The proposed method consistently outperforms baseline models and global TDA descriptors in both histopathology image classification and 3D porous material regression tasks. The ability to prune the base level of the hierarchy effectively reduces memory usage while maintaining competitive performance levels.
Implications
The findings suggest that incorporating local topological information can significantly enhance the performance of deep learning models across various domains, paving the way for more interpretable and scalable machine learning applications.
ADAPTive Input Training for Many-to-One Pre-Training on Time-Series Classification
Time Series
- Introduction of the ADAPT framework for efficient many-to-one pre-training in time-series classification.
- Demonstrated ability to train on 162 diverse time-series datasets simultaneously.
- Achieved state-of-the-art performance on classification benchmarks.
- Framework designed to be model agnostic, allowing for future improvements in model architectures.
Read more
ADAPTive Input Training for Many-to-One Pre-Training on Time-Series Classification
Summary
This paper introduces a novel pre-training paradigm called ADAPT for time-series classification, addressing the limitations of existing self-supervised training methods that struggle with generalization across multiple datasets. The authors highlight the challenges faced in many-to-one pre-training scenarios, where models often fail to adapt when exposed to diverse datasets. ADAPT aligns the physical properties of time-series data, enabling mixed-batch pre-training despite variations in input sizes and channel dimensions. The framework was tested on 162 time-series classification datasets, achieving state-of-the-art performance benchmarks. This approach not only facilitates the training of a single model across various time-series types but also lays the groundwork for developing generalist foundation models in the time-series domain.
Methodology
The authors developed the ADAPT framework, which employs average adaptive pooling during data loading to facilitate mixed-batch training. This model-agnostic approach allows the framework to handle varying input dimensions and modalities, leveraging modern parallel computing strategies to train large models efficiently.
Results
The ADAPT framework set new state-of-the-art performance on time-series classification benchmarks, demonstrating significant improvements in model performance when trained on a wide range of datasets simultaneously. The results indicate that the proposed method effectively overcomes the limitations of previous pre-training strategies in the time-series domain.
Implications
The successful implementation of the ADAPT framework has the potential to revolutionize time-series analysis across various applications, including medical, financial, and environmental domains. It paves the way for the development of generalist foundation models that can adapt to diverse tasks and datasets, enhancing the applicability of machine learning in real-world scenarios.
Prediction Arena: Benchmarking AI Models on Real-World Prediction Markets
Reinforcement Learning
- Prediction Arena benchmarks AI models in live prediction markets with real capital.
- Cohort 1 models showed significant performance differences between Kalshi and Polymarket.
- The model grok-4-20-checkpoint achieved the highest settlement win rate across platforms.
- Initial prediction accuracy is crucial for model success, while research volume does not correlate with outcomes.
Read more
Prediction Arena: Benchmarking AI Models on Real-World Prediction Markets
Summary
The paper introduces Prediction Arena, a novel benchmark designed to evaluate AI models' predictive accuracy and decision-making capabilities by allowing them to autonomously trade on real prediction markets using actual capital. This benchmark addresses the limitations of synthetic benchmarks by testing models in live environments, providing objective ground truth that cannot be manipulated. The evaluation spans 57 days, tracking two cohorts of models: six frontier models engaged in live trading and four next-generation models in paper trading. Results indicate a significant performance disparity between platforms, with Cohort 1 models experiencing returns ranging from -16.0% to -30.8% on Kalshi, while performing better on Polymarket with an average loss of -1.1%. Notably, the model grok-4-20-checkpoint achieved a 71.4% settlement win rate on Polymarket, the highest across all platforms. The findings highlight that initial prediction accuracy and the ability to leverage correct predictions are critical for success, while research volume showed no correlation with outcomes. Additionally, the study provides insights into computational efficiency and trading behavior, revealing that the most capital-efficient models were not necessarily the most computationally intensive. Overall, Prediction Arena offers a comprehensive framework for assessing AI models in real-world financial contexts.
Methodology
The study involved deploying AI models as autonomous traders in live prediction markets (Kalshi and Polymarket) over a 57-day period. Two cohorts were evaluated: Cohort 1 with six frontier models in live trading and Cohort 2 with four next-generation models in paper trading. Performance metrics included account value, profit and loss (PnL), and win rates, alongside analyses of computational efficiency and trading behavior.
Results
Cohort 1 models experienced returns on Kalshi ranging from -16.0% to -30.8%, while on Polymarket, they averaged -1.1%. The model grok-4-20-checkpoint had a 71.4% win rate on Polymarket, and gemini-3.1-pro-preview achieved a +6.02% return on Polymarket in just 3 days despite executing no trades on Kalshi.
Implications
Prediction Arena provides a robust framework for evaluating AI models in real-world scenarios, potentially influencing the development of more effective AI trading strategies and enhancing the understanding of model behavior under financial pressures. It also highlights the importance of platform design in model performance.
Fraud Detection System for Banking Transactions
Theory
Optimization
Efficient ML
- The framework utilizes the PaySim synthetic dataset to model fraudulent transactions.
- Employs CRISP-DM methodology for structured analysis and model development.
- Implements SMOTE to address class imbalance in the dataset.
- Compares multiple machine learning models, highlighting the effectiveness of ensemble methods.
Read more
Fraud Detection System for Banking Transactions
Summary
This paper addresses the growing challenge of fraud detection in digital banking transactions, exacerbated by the increasing complexity and volume of online financial activities. The authors propose a machine learning-based framework that utilizes the PaySim synthetic financial transaction dataset, following the CRISP-DM methodology. The study includes hypothesis-driven exploratory analysis, feature refinement, and a comparative assessment of various classification models, including Logistic Regression, Decision Tree, Random Forest, and XGBoost. To combat class imbalance, the Synthetic Minority Over-sampling Technique (SMOTE) is employed, and model performance is optimized through hyperparameter tuning using GridSearchCV. The results indicate that the proposed framework significantly enhances fraud detection capabilities in FinTech environments, providing a scalable solution that addresses the challenges of evolving fraud strategies and imbalanced data. The study emphasizes the importance of behavioral modeling and the integration of advanced machine learning techniques to improve detection accuracy and reduce false-positive rates.
Methodology
The study follows the CRISP-DM framework, which includes stages such as Business Understanding, Data Understanding, Data Preparation, Modeling, Evaluation, and Deployment. It involves exploratory data analysis, feature refinement, and the application of SMOTE for class imbalance, followed by the evaluation of various classification models and hyperparameter tuning.
Results
The proposed fraud detection framework demonstrated improved performance metrics, particularly in recall and precision, when compared to baseline models. The use of ensemble methods, especially XGBoost, yielded the best results in detecting fraudulent transactions, effectively addressing the challenges posed by class imbalance.
Implications
The findings suggest that machine learning-based approaches can significantly enhance fraud detection in financial transactions, providing a scalable solution for FinTech companies. The research highlights the need for continuous adaptation of fraud detection systems to keep pace with evolving fraudulent strategies.
Guardian-as-an-Advisor: Advancing Next-Generation Guardian Models for Trustworthy LLMs
NLP
Large Language Models
Reinforcement Learning
- Introduction of Guardian-as-an-Advisor (GaaA) as a soft-gating alternative to traditional hard-gated safety checkers.
- Development of GuardSet, a large-scale dataset with over 208,000 examples for training guardian models.
- GuardAdvisor model achieves competitive performance while reducing unnecessary refusals and maintaining low latency.
- The framework enhances the utility of LLMs by providing interpretable risk assessments without blocking generation.
Read more
Guardian-as-an-Advisor: Advancing Next-Generation Guardian Models for Trustworthy LLMs
Summary
This paper introduces the Guardian-as-an-Advisor (GaaA) framework, which aims to enhance the trustworthiness of large language models (LLMs) by addressing the limitations of traditional hard-gated safety checkers. The authors argue that existing models often over-refuse queries and misalign with vendor specifications, leading to systems that are safer in theory but less useful in practice. GaaA employs a soft-gating mechanism where a guardian model predicts a binary risk label and provides a concise explanation, which is then prepended to the original user query for re-inference. This approach maintains the base model's operational integrity while improving its utility. To support this framework, the authors developed GuardSet, a comprehensive dataset containing over 208,000 multi-domain examples, designed to train and evaluate guardian models with a focus on robustness and honesty. The GuardAdvisor model, trained through supervised fine-tuning followed by reinforcement learning, demonstrates competitive detection accuracy and improved response quality when augmenting inputs. The study also highlights that the advisor's inference incurs minimal latency, making it a practical solution for real-world applications.
Methodology
The authors constructed a three-stage pipeline for creating the GuardSet dataset, which includes collection, processing (label mapping and explanation synthesis), and validation (using LLMs for filtering). The GuardAdvisor model was trained using a two-stage approach: supervised fine-tuning for structured outputs followed by reinforcement learning to ensure consistency between risk labels and explanations.
Results
GuardAdvisor demonstrated detection performance comparable to proprietary models while significantly reducing unnecessary refusals. The model also improved output quality related to robustness and honesty scenarios, with an added latency of only 2-10% under realistic harmful-input rates.
Implications
The GaaA framework has the potential to enhance the deployment of LLMs in various applications, such as search, coding, and healthcare, by providing a more trustworthy interaction model. It allows for safer and more useful outputs, addressing key concerns about the reliability of AI systems.
Zero-shot Multivariate Time Series Forecasting Using Tabular Prior Fitted Networks
Time Series
- Introduces a framework for zero-shot multivariate time series forecasting using tabular foundation models.
- Addresses the limitation of treating MTS as independent univariate problems by modeling inter-channel dependencies.
- Reformulates MTS forecasting as scalar regression problems, enabling the use of existing tabular models without retraining.
- Empirical results indicate improved performance over traditional methods and competitive results against specialized time series models.
Read more
Zero-shot Multivariate Time Series Forecasting Using Tabular Prior Fitted Networks
Summary
This paper presents a novel framework for zero-shot multivariate time series forecasting using tabular foundation models, specifically Prior-data Fitted Networks (TabPFN). The authors address the common limitation in existing approaches that treat multivariate time series (MTS) forecasting as a series of independent univariate problems, thereby neglecting inter-channel dependencies. By reformulating the MTS forecasting task into a series of scalar regression problems, the proposed method allows for the effective modeling of both temporal and spatial dependencies without requiring retraining or architectural modifications. The authors demonstrate that their approach can leverage the capabilities of tabular foundation models to produce zero-shot predictions, thereby enhancing forecasting performance in scenarios with limited training data. Empirical evaluations show that their method outperforms traditional univariate decomposition approaches and competes favorably against specialized architectures designed for time series forecasting.
Methodology
The authors reformulate multivariate time series forecasting as a series of scalar regression problems by transforming the multivariate structure into a 'rolled out' tabular format. This involves flattening the multivariate vector into multiple rows, where each row contains the timestamp, covariate index, and value, with the regression target being the subsequent value in the sequence. This approach allows for the modeling of intra-sample dependencies while utilizing existing tabular foundation models for zero-shot predictions.
Results
The proposed method was benchmarked against the univariate decomposition baseline established by TabPFN-TS and other specialized architectures. The results demonstrated that the new framework not only outperformed the traditional univariate decomposition methods but also showed competitive performance against dedicated time series forecasting models, indicating its effectiveness in capturing the complexities of multivariate time series data.
Implications
This work has significant implications for various domains where multivariate time series forecasting is critical, such as finance, meteorology, and engineering. By enabling effective forecasting with limited data and without the need for extensive model retraining, this approach can facilitate more efficient and accurate decision-making processes in real-world applications.
DMax: Aggressive Parallel Decoding for dLLMs
NLP
Large Language Models
Generative Models
- DMax mitigates error accumulation in parallel decoding of dLLMs.
- Introduces On-Policy Uniform Training (OPUT) for effective self-correction.
- Proposes Soft Parallel Decoding (SPD) for robust intermediate state representation.
- Demonstrates significant improvements in TPF on multiple benchmarks.
Read more
DMax: Aggressive Parallel Decoding for dLLMs
Summary
DMax introduces a novel approach to enhance the efficiency of diffusion language models (dLLMs) by addressing the issue of error accumulation during parallel decoding. Traditional dLLMs utilize a binary mask-to-token decoding process, which can lead to cascading errors when predictions are incorrect. DMax reformulates this process into a self-revising mechanism that transitions from mask embeddings to token embeddings. The core innovation is the On-Policy Uniform Training (OPUT), which allows the model to learn from its own predictions, effectively bridging the gap between training and inference. Additionally, the Soft Parallel Decoding (SPD) method represents intermediate decoding states as interpolations between predicted token embeddings and mask embeddings, enabling iterative self-correction. Experimental results demonstrate that DMax significantly improves tokens per forward (TPF) on benchmarks like GSM8K and MBPP while maintaining accuracy, establishing a new baseline for future research in parallel decoding for dLLMs.
Methodology
The methodology involves two key components: On-Policy Uniform Training (OPUT), which allows the model to learn from its own predictions during training, and Soft Parallel Decoding (SPD), which enables the representation of intermediate decoding states as hybrid embeddings. This approach allows for iterative self-revision and reduces the impact of erroneous predictions during parallel decoding.
Results
DMax improves the tokens per forward (TPF) from 2.04 to 5.48 on the GSM8K benchmark while preserving high accuracy. On the MBPP benchmark, TPF increases from 2.71 to 5.86, maintaining comparable performance to the original model. The model achieves an average of 1,338 tokens per second (TPS) on two H200 GPUs at a batch size of 1.
Implications
The DMax framework has the potential to enhance the efficiency and accuracy of dLLMs in various applications, including text generation and code synthesis. Its self-corrective capabilities may lead to more robust models that can handle complex tasks with higher parallelism, paving the way for advancements in real-time applications.
GAN-based Domain Adaptation for Image-aware Layout Generation in Advertising Poster Design
Generative Models
Computer Vision
- Introduction of the CGL-Dataset for training image-aware layout generation models.
- Development of two GAN-based models: CGL-GAN and PDA-GAN, with the latter utilizing unsupervised domain adaptation.
- Proposal of three novel content-aware metrics for evaluating layout generation quality.
- PDA-GAN demonstrates significant improvements over CGL-GAN in generating aesthetically pleasing layouts.
Read more
GAN-based Domain Adaptation for Image-aware Layout Generation in Advertising Poster Design
Summary
This paper presents a novel approach to generating image-aware graphic layouts for advertising posters using Generative Adversarial Networks (GANs). The authors introduce the Content-aware Graphic Layout Dataset (CGL-Dataset), which consists of 60,548 paired inpainted posters and 121,000 clean product images. The challenge addressed is the domain gap between inpainted posters and clean images, which can hinder the quality of generated layouts. To bridge this gap, two GAN-based models are proposed: CGL-GAN, which applies Gaussian blur to inpainted regions, and PDA-GAN, which incorporates unsupervised domain adaptation with a pixel-level discriminator for improved layout generation. The paper also introduces three novel content-aware metrics for evaluating the relationship between graphic elements and image content. Experimental results show that PDA-GAN outperforms CGL-GAN, achieving state-of-the-art performance in generating high-quality layouts that are visually coherent with the product images.
Methodology
The authors developed two GAN-based models to generate advertising poster layouts. CGL-GAN employs Gaussian blurring to reduce the domain gap between inpainted posters and clean images. PDA-GAN enhances this approach by introducing a pixel-level discriminator that aligns feature spaces of the source and target domains, allowing for better modeling of visual textures. The paper also introduces new content-aware metrics to evaluate the generated layouts.
Results
PDA-GAN achieved state-of-the-art performance, outperforming CGL-GAN across various metrics, including background complexity and occlusion degrees. Notable relative improvements were reported, such as 19.07% on the content-aware Fréchet Inception Distance (cFID) metric, indicating enhanced visual quality in the generated layouts.
Implications
The findings suggest that GAN-based models can significantly improve the quality of graphic design layouts in advertising, making them more relevant and visually appealing. The proposed methods and dataset can be applied to other domains requiring layout generation, such as web design and marketing materials.
DSPR: Dual-Stream Physics-Residual Networks for Trustworthy Industrial Time Series Forecasting
Time Series
Graph Learning
Interpretability
- DSPR effectively decouples stable trends from regime-dependent dynamics in industrial time series forecasting.
- The framework incorporates an Adaptive Window module for flow-dependent transport delays and a Physics-Guided Dynamic Graph for learning time-varying interactions.
- DSPR achieves state-of-the-art predictive performance with high Mean Conservation Accuracy and Total Variation Ratio.
- The model provides interpretable insights into physical mechanisms, enhancing understanding beyond mere prediction.
Read more
DSPR: Dual-Stream Physics-Residual Networks for Trustworthy Industrial Time Series Forecasting
Summary
The paper introduces DSPR (Dual-Stream Physics-Residual Networks), a novel framework designed to enhance the accuracy and physical plausibility of industrial time series forecasting. Traditional data-driven models often excel in statistical performance but fail to account for the complex, regime-dependent dynamics present in real-world systems. DSPR addresses this by decoupling stable temporal patterns from regime-dependent residual dynamics through a dual-stream architecture. The first stream captures the statistical evolution of individual variables, while the second stream focuses on residual dynamics, utilizing an Adaptive Window module to estimate flow-dependent transport delays and a Physics-Guided Dynamic Graph to incorporate physical priors. The framework is validated on four industrial benchmarks, demonstrating significant improvements in forecasting accuracy and robustness during regime shifts, achieving state-of-the-art performance metrics. Additionally, DSPR provides interpretable insights into learned interaction structures and adaptive lags, aligning with known domain mechanisms. This work suggests that integrating physics-informed inductive biases into forecasting models can bridge the gap between advanced predictive capabilities and trustworthy autonomous control systems.
Methodology
DSPR employs a dual-stream architecture that separates the forecasting process into a Trend Stream for stable patterns and a Residual Stream for regime-dependent dynamics. It integrates an Adaptive Window module to learn transport delays and a Physics-Guided Dynamic Graph to model time-varying interactions, ensuring that the model respects physical laws while maintaining predictive accuracy.
Results
DSPR was validated on four diverse datasets, achieving a Mean Conservation Accuracy exceeding 99% and a Total Variation Ratio of up to 97.2%. The framework consistently outperformed existing models, particularly in scenarios with regime shifts, while maintaining strong physical plausibility and providing interpretable insights into the underlying dynamics.
Implications
The findings suggest that DSPR can be effectively utilized in safety-critical industrial applications where both predictive accuracy and physical fidelity are essential. The model's ability to provide interpretable insights into physical mechanisms could facilitate better decision-making in autonomous control systems and enhance the trustworthiness of AI in industrial settings.
Less Approximates More: Harmonizing Performance and Confidence Faithfulness via Hybrid Post-Training for High-Stakes Tasks
NLP
Large Language Models
Reinforcement Learning
- Introduces HyTuning, a hybrid post-training framework for LLMs.
- Proposes Progressive Reasoning Gain (PRG) to measure the reliability of reasoning steps.
- Addresses challenges of data scarcity, overconfidence, and erroneous updates in high-stakes tasks.
- Demonstrates significant improvements in accuracy and confidence faithfulness through extensive experiments.
Read more
Less Approximates More: Harmonizing Performance and Confidence Faithfulness via Hybrid Post-Training for High-Stakes Tasks
Summary
This paper addresses the critical issue of confidence faithfulness in large language models (LLMs) deployed in high-stakes tasks, where incorrect confident inferences can lead to severe consequences. The authors propose a hybrid post-training framework called HyTuning, which integrates Reinforcement Learning from Internal Feedback (RLIF) with Reasoning Distillation (RD). The key innovation is the introduction of Progressive Reasoning Gain (PRG), a metric that measures whether reasoning steps progressively strengthen the support for the final answer. This approach aims to balance the performance of LLMs with the need for confidence faithfulness by adaptively reweighing RLIF and RD based on the PRG metric. The methodology addresses challenges such as the scarcity of high-quality training data, the tendency for overconfidence in LLM outputs, and the risk of amplifying erroneous updates through indiscriminate fusion of training signals. The experimental results demonstrate that HyTuning significantly improves accuracy and confidence faithfulness across various benchmarks, supporting the notion that 'Less Approximates More' in high-stakes contexts.
Methodology
The methodology involves a hybrid approach that combines RLIF, which uses self-certainty as a reward signal, with RD, which employs high-quality reasoning traces. The PRG metric is used to adaptively weigh the contributions of RLIF and RD, ensuring that the model aligns its confidence with accurate reasoning paths while mitigating overconfidence and erroneous updates.
Results
The experiments conducted on domain-specific and general benchmarks show that HyTuning leads to significant improvements in both accuracy and confidence faithfulness, validating the effectiveness of the proposed framework in high-stakes applications.
Implications
The findings suggest that HyTuning can enhance the reliability of LLMs in critical applications, reducing the risk of harmful outcomes due to overconfident but incorrect predictions. This has important implications for fields such as cybersecurity, finance, and medicine, where accurate and trustworthy AI systems are essential.
Provably Adaptive Linear Approximation for the Shapley Value and Beyond
Theory
Efficient ML
Interpretability
- Introduces a theoretical framework for approximating semi-values with improved query complexities.
- Develops Adalina, an adaptive algorithm that achieves linear-time and linear-space efficiency.
- Establishes a connection between existing approximation algorithms and provides insights on paired sampling benefits.
- Demonstrates that the proposed methods can significantly reduce the number of utility queries required for accurate approximations.
Read more
Provably Adaptive Linear Approximation for the Shapley Value and Beyond
Summary
This paper addresses the challenge of efficiently approximating the Shapley value and its broader family of semi-values, which are crucial in various attribution problems but typically require an exponential number of utility queries for exact computation. The authors propose a theoretical framework that leverages a vector concentration inequality to achieve sharper query complexities for existing unbiased randomized algorithms under a Θ(n) space constraint. They develop a linear-space algorithm that requires O(n/ε² log(1/δ)) utility queries to ensure a specified accuracy level. This framework integrates various existing methods, including OFA and kernelSHAP, and characterizes the benefits of paired sampling. The authors introduce Adalina, the first adaptive, linear-time, linear-space randomized algorithm that minimizes mean square error for specific utility functions. The theoretical findings are supported by experimental validation, demonstrating the algorithm's efficiency and effectiveness in approximating semi-values.
Methodology
The authors utilize a vector concentration inequality to derive sharper query complexities for unbiased randomized algorithms. They systematically develop a linear-space algorithm, Adalina, which minimizes mean square error while adhering to a Θ(n) space constraint. The framework bridges existing algorithms and characterizes conditions under which paired sampling is advantageous.
Results
The proposed algorithm, Adalina, achieves a query complexity of O(n/ε² log(1/δ)) for approximating semi-values, significantly improving upon previous methods that required Θ(n²) space. The theoretical framework and algorithmic developments are experimentally validated, demonstrating their effectiveness in practical applications.
Implications
The findings have significant implications for large-scale applications in machine learning where efficient attribution methods are required. The ability to approximate semi-values with reduced query complexity and space requirements can enhance the scalability and applicability of these methods in real-world scenarios.
Learning is Forgetting: LLM Training As Lossy Compression
NLP
Large Language Models
Theory
- LLMs are conceptualized as instances of lossy compression, retaining only relevant information from training data.
- Pre-training follows a two-phase trajectory consistent with Information Bottleneck theory, with models approaching optimal compression over time.
- The degree of optimal compression correlates significantly with performance across multiple benchmarks for various LLM families.
- Quantifying preference information in models predicts downstream performance, indicating alignment with human-like preferences.
Read more
Learning is Forgetting: LLM Training As Lossy Compression
Summary
This paper presents a novel perspective on the training of large language models (LLMs) by framing it as a process of lossy compression. The authors argue that LLMs retain only the information relevant to their objectives during training, akin to how lossy compression techniques discard less critical data to optimize storage. They demonstrate that pre-training dynamics in LLMs align with theoretical predictions from Information Bottleneck theory, showing a two-phase trajectory where models first increase mutual information before compressing input information. The study reveals that different LLMs compress information differently based on their training data and methodologies, yet the optimality of compression correlates significantly with downstream performance across various benchmarks. By quantifying the preference information retained in models, the authors establish a strong predictive relationship between representation structure and model performance. This work provides a unified information-theoretic framework for understanding LLM training and offers insights into how these models achieve their impressive results.
Methodology
The authors employed Information Theory and the Information Bottleneck framework to analyze the pre-training dynamics of LLMs. They conducted empirical evaluations across multiple open-weight models, measuring mutual information and compression efficiency during training to establish correlations with downstream performance metrics.
Results
The study found that LLMs, particularly those above 7 billion parameters, achieve optimal compression as training progresses, with significant correlations between compression efficiency and performance on six different benchmarks. The analysis of 47 LLMs revealed a strong predictive relationship (r = 0.76, p < 0.001) between the amount of preference information retained and model performance.
Implications
This research provides a deeper understanding of LLM training processes, suggesting that insights from information theory can guide the development of more efficient and interpretable models. It also opens avenues for enhancing model performance by focusing on the retention of relevant information during training.
Optimal Decay Spectra for Linear Recurrences
NLP
Large Language Models
Theory
- Introduces Position-Adaptive Spectral Tapering (PoST) to improve long-range memory in linear recurrent models.
- Establishes a design blueprint for optimal memory channel distribution based on logarithmic equipartition.
- Demonstrates minimax optimality through Spectral Reparameterization for geometrically spaced decay rates.
- Implements Position-Adaptive Scaling to eliminate scale mismatch and enhance approximation bounds.
Read more
Optimal Decay Spectra for Linear Recurrences
Summary
This paper addresses the limitations of linear recurrent models in sequence processing, particularly their suboptimal long-range memory due to decay spectrum issues. The author introduces Position-Adaptive Spectral Tapering (PoST), an architecture-agnostic framework that combines two mechanisms: Spectral Reparameterization and Position-Adaptive Scaling. Spectral Reparameterization ensures geometrically spaced log-decay rates, achieving minimax optimality for long-range dependencies. Position-Adaptive Scaling addresses the scale mismatch in static spectra, enhancing the model's performance by sharpening approximation bounds. The proposed framework integrates seamlessly into various architectures, including Mamba-2, RWKV-7, Gated DeltaNet, Gated Linear Attention, and RetNet. Empirical evaluations demonstrate significant improvements in zero-shot language modeling and long-context retrieval tasks, showcasing PoST's effectiveness in enhancing memory retention and processing efficiency in linear recurrent models.
Methodology
The paper employs theoretical analysis to diagnose failure modes in existing linear recurrent models and proposes PoST as a solution. It combines Spectral Reparameterization to enforce optimal decay rates and Position-Adaptive Scaling to dynamically adjust the spectral contributions based on the position in the sequence, ensuring efficient memory utilization.
Results
The implementation of PoST across various architectures resulted in consistent improvements in zero-shot language modeling and significant gains in long-context retrieval tasks, particularly for Mamba-2. The framework demonstrated enhanced performance metrics compared to traditional linear recurrent models, validating its effectiveness.
Implications
The findings suggest that PoST can be widely applied to enhance the performance of linear recurrent models in various sequence processing tasks, potentially leading to advancements in natural language processing and other domains requiring efficient long-range memory retention.
A Systematic Framework for Tabular Data Disentanglement
Theory
Generative Models
Time Series
- Introduces a systematic framework for tabular data disentanglement.
- Modularizes the disentanglement process into four core components.
- Identifies limitations of existing methods and proposes a comprehensive view.
- Demonstrates the framework's applicability through a case study.
Read more
A Systematic Framework for Tabular Data Disentanglement
Summary
This paper addresses the challenges of disentangling tabular data, which is prevalent in various industries such as finance and industrial control systems. The authors propose a systematic framework that modularizes the disentanglement process into four core components: data extraction, data modeling, model analysis, and latent representation extrapolation. The need for a structured approach arises from the complex interrelationships among attributes in tabular data, which can lead to inefficiencies in existing methods like factor analysis, CT-GAN, and VAE. The proposed framework aims to provide a comprehensive understanding of tabular data disentanglement, identify research gaps, and facilitate the development of more robust techniques. A case study is presented to demonstrate the framework's applicability in synthetic tabular data generation, highlighting its potential for improving data synthesis tasks.
Methodology
The authors developed a framework that breaks down the disentanglement process into four essential components: data extraction, data modeling, model analysis, and latent representation extrapolation. They analyzed existing methods to identify their limitations and proposed a more comprehensive approach to tabular data disentanglement.
Results
The framework was successfully applied in a case study focused on synthetic tabular data generation, demonstrating its effectiveness in improving data synthesis tasks and providing insights into the complexities of tabular data relationships.
Implications
The proposed framework can enhance the understanding and processing of tabular data in various applications, including anomaly detection and real-time decision-making in industrial systems. It sets the stage for future research aimed at developing more efficient and scalable disentanglement techniques.
What Drives Representation Steering? A Mechanistic Case Study on Steering Refusal
NLP
Large Language Models
Interpretability
- Introduces a multi-token activation patching framework for analyzing steering vectors in LLMs.
- Finds that refusal steering primarily interacts with the OV circuit, with minimal impact from the QK circuit.
- Demonstrates that different steering methodologies leverage highly interchangeable circuits.
- Shows that refusal steering vectors can be sparsified by 90-99% while maintaining performance.
Read more
What Drives Representation Steering? A Mechanistic Case Study on Steering Refusal
Summary
This paper investigates the internal mechanisms of steering vectors applied to large language models (LLMs), focusing on their effectiveness in steering refusal responses. The authors propose a multi-token activation patching framework to analyze how different steering methodologies interact with model components, particularly the attention mechanism. They find that steering vectors primarily engage with the OV circuit while largely ignoring the QK circuit, leading to only a minor performance drop when attention scores are frozen. The study reveals that different steering methodologies utilize functionally interchangeable circuits with over 90% overlap. Furthermore, the authors introduce a steering value vector decomposition that provides semantic interpretability, even when the steering vector lacks clarity. They demonstrate that refusal steering vectors can be sparsified by 90-99% without significant performance loss, indicating a convergence on a small subset of important dimensions across methodologies. This mechanistic understanding of steering vectors not only advances the scientific knowledge of LLMs but also aids practitioners in assessing robustness and designing better steering interventions.
Methodology
The authors employ a multi-token activation patching framework to analyze the interactions of steering vectors with model components during inference. They conduct a case study focused on refusal steering and utilize mathematical decomposition to reveal semantically interpretable concepts.
Results
The study finds that different steering methodologies have over 90% overlap in the circuits they utilize. Freezing attention scores results in only an 8.75% performance drop, indicating that the OV circuit is crucial for steering effectiveness. The authors also demonstrate that refusal steering vectors can be significantly sparsified without losing performance.
Implications
The findings provide insights into the mechanisms of steering vectors, which can enhance the alignment of LLMs with human intent. This understanding can help in diagnosing failures, improving robustness, and designing more effective steering interventions in LLM applications.
Bit-by-Bit: Progressive QAT Strategy with Outlier Channel Splitting for Stable Low-Bit LLMs
Large Language Models
Efficient ML
Optimization
- Introduces a progressive QAT framework that enhances stability during low-bit training.
- Utilizes outlier channel splitting to reduce quantization errors.
- Enables a 'train once, deploy any precision' capability through nested quantization grids.
- Achieves significant performance improvements over existing QAT methods on LLaMA models.
Read more
Bit-by-Bit: Progressive QAT Strategy with Outlier Channel Splitting for Stable Low-Bit LLMs
Summary
The paper presents BIT-BY-BIT, a novel progressive quantization-aware training (QAT) framework designed to address the challenges of training large language models (LLMs) at ultra-low precision. Traditional low-bit QAT methods often struggle with convergence instability and high training costs due to quantization noise and error accumulation. BIT-BY-BIT introduces a three-pronged approach: (1) a block-wise progressive training strategy that gradually reduces precision, ensuring stable initialization for low-bit optimization; (2) a nested structure of integer quantization grids that allows a single model to support multiple bit-widths without retraining; and (3) rounding-aware outlier channel splitting to mitigate quantization errors while preserving output integrity. The framework also incorporates microscaling groups aligned with OCP/NVIDIA standards to capture dynamic activation ranges. Custom operators for W2A2 and W2A16 configurations were developed, achieving significant speedups. Evaluations on LLaMA-2/3 demonstrate that BIT-BY-BIT outperforms existing QAT methods, achieving a minimal increase in perplexity compared to full-precision models, thus showcasing its effectiveness in ultra-low-bit regimes.
Methodology
The methodology involves a progressive training strategy that first quantizes weights and then activations, coupled with a nested quantization grid structure. Rounding-aware outlier channel splitting is employed to minimize quantization errors. The framework is evaluated using custom operators for low-bit configurations, demonstrating its efficiency and effectiveness in training LLMs.
Results
BIT-BY-BIT significantly outperformed baseline methods like BitDistiller and EfficientQAT on LLaMA-2/3, achieving only a +2.25 increase in perplexity on WikiText2 compared to full-precision models. The method also demonstrated up to 11× speedup over BF16 in specific configurations.
Implications
The proposed framework has the potential to enhance the deployment of large language models in resource-constrained environments by enabling efficient low-bit quantization without sacrificing performance. This could lead to broader accessibility and application of LLMs in various domains.
Data Warmup: Complexity-Aware Curricula for Efficient Diffusion Training
Generative Models
Efficient ML
- Identifies the mismatch between data complexity and model readiness as a source of inefficiency in diffusion training.
- Introduces a semantic-aware image complexity metric that combines foreground dominance and typicality.
- Demonstrates significant improvements in IS and FID on ImageNet with a simple-to-complex training curriculum.
- Confirms that the order of image complexity is critical for performance, as reversing the curriculum harms results.
Read more
Data Warmup: Complexity-Aware Curricula for Efficient Diffusion Training
Summary
The paper addresses inefficiencies in diffusion training caused by randomly initialized networks encountering gradients from a full complexity spectrum. The authors propose 'Data Warmup', a curriculum learning strategy that schedules training images from simple to complex based on a semantic-aware complexity metric. This metric combines foreground dominance and typicality to score images, allowing a temperature-controlled sampler to prioritize low-complexity images initially, gradually transitioning to uniform sampling. Experiments on ImageNet 256×256 with various SiT backbones demonstrate that Data Warmup significantly improves Inception Score (IS) by up to 6.11 and Fréchet Inception Distance (FID) by up to 3.41, achieving baseline quality much earlier in training. The study also finds that reversing the curriculum order degrades performance, confirming the importance of the simple-to-complex progression. The method requires minimal preprocessing time and can be combined with other accelerators like REPA, highlighting its efficiency and effectiveness in enhancing diffusion model training.
Methodology
The authors developed a semantic-aware complexity metric that scores images based on foreground dominance and typicality. A temperature-controlled sampler is then used to prioritize low-complexity images during the initial training phase, transitioning to uniform sampling as training progresses. The method is implemented without modifying the model or loss function, and it involves a one-time preprocessing step for image scoring.
Results
Data Warmup led to improvements in IS by up to 6.11 and FID by up to 3.41 on ImageNet 256×256 across various SiT scales. The method allowed models to reach baseline quality significantly faster, with the simple-to-complex ordering proving essential for these gains.
Implications
The findings suggest that curriculum learning can be effectively applied to generative models, particularly in diffusion training, to enhance efficiency and performance. This approach could be beneficial in other domains where training costs are high and model readiness varies.
MIPT-SSM: Scaling Language Models with O(1) Inference Cache via Phase Transitions
NLP
Large Language Models
Theory
- Introduces a learned measurement rate for dynamic computation routing in sequence models.
- Proves the incompatibility of norm-preserving and selective forgetting in linear operators.
- Achieves significant performance improvements over Transformers in text classification tasks.
- Demonstrates a 42.8x reduction in memory usage compared to traditional Transformers.
Read more
MIPT-SSM: Scaling Language Models with O(1) Inference Cache via Phase Transitions
Summary
The paper introduces MIPT-SSM, a novel neural sequence architecture inspired by Measurement-Induced Phase Transitions (MIPT). The architecture employs a learned measurement rate, pt, which dynamically routes computation between two distinct regimes: the wave phase, where information is distributed and complex, and the particle phase, where information is localized to the current token. This approach addresses the inherent incompatibility between norm-preserving and selective forgetting operations in sequence modeling, as proven in the paper. MIPT-SSM is predicted to undergo a phase transition at a critical sequence length of approximately 1024, aligning with observed memory scaling behaviors. Empirical results demonstrate that MIPT-SSM outperforms traditional Transformers in various tasks, achieving 90.5% accuracy on AG News compared to 73.6% for Transformers, while also significantly reducing memory requirements. The model's causal sparse key-value (KV) cache shows high efficiency, achieving a 99.8% sparsity rate and maintaining high accuracy in exact recall tasks. The findings suggest that MIPT-SSM can effectively balance memory efficiency and performance in language modeling tasks.
Methodology
The MIPT-SSM architecture utilizes a learned measurement rate, pt, to switch between wave and particle modes of information processing. The model's state update is defined through a recurrence relation that incorporates both wave-like and particle-like behaviors. The training employs a parallel scan algorithm for efficiency, while inference operates in constant time per token. The model's performance is validated through empirical tests across various tasks, including text classification and language modeling.
Results
MIPT-SSM achieved an accuracy of 90.5% on the AG News dataset, surpassing the Transformer model's 73.6%. It demonstrated a memory requirement of 810 MB compared to 34,651 MB for Transformers, indicating a 42.8x reduction. In exact-recall tasks, the model reached an accuracy of 96.8% with a highly sparse KV cache, averaging only 1.0 out of 512 slots used.
Implications
The MIPT-SSM architecture offers a promising approach to scaling language models by efficiently managing memory and improving performance. Its ability to dynamically adjust information processing modes could lead to advancements in various NLP applications, particularly in tasks requiring long-range dependencies and efficient memory usage.
Cognitive-Causal Multi-Task Learning with Psychological State Conditioning for Assistive Driving Perception
Multimodal
- Introduction of CauPsi, a cognitive science-grounded causal multi-task learning framework for ADAS.
- Implementation of a Causal Task Chain for hierarchical task dependency modeling.
- Incorporation of psychological state signals into multi-task learning through Cross-Task Psychological Conditioning.
- Achieved 82.71% mean accuracy on the AIDE dataset with only 5.05M parameters, surpassing prior work.
Read more
Cognitive-Causal Multi-Task Learning with Psychological State Conditioning for Assistive Driving Perception
Summary
This paper presents CauPsi, a novel cognitive-causal multi-task learning framework designed for advanced driver assistance systems (ADAS). The framework addresses the limitations of existing methods that treat recognition tasks as independent, failing to capture the cognitive causal structure of driving behavior. CauPsi incorporates two main mechanisms: a Causal Task Chain that facilitates the propagation of task predictions through a hierarchy of tasks—Traffic Context Recognition (TCR), Vehicle Context Recognition (VCR), Driver Emotion Recognition (DER), and Driver Behavior Recognition (DBR)—and Cross-Task Psychological Conditioning (CTPC), which integrates psychological state signals derived from driver facial expressions and body posture into all tasks. This approach models the influence of driver internal states on cognitive processes and decision-making. Evaluated on the AIDE dataset, CauPsi achieved a mean accuracy of 82.71% with only 5.05 million parameters, outperforming previous methods, particularly in DER and DBR. The study also includes ablation studies confirming the independent contributions of each component and demonstrates that the psychological state signal can learn task-specific patterns without explicit annotations.
Methodology
CauPsi employs a Causal Task Chain for soft-label propagation among tasks using learnable prototype embeddings, and Cross-Task Psychological Conditioning (CTPC) to inject psychological state signals into all tasks. The framework utilizes a bidirectional Cross-View Attention mechanism based on a MobileNetV3-Small backbone to enhance feature extraction from both driver and environmental perspectives.
Results
CauPsi achieved a mean accuracy of 82.71% on the AIDE dataset, surpassing previous models by 1.0%. Notable improvements were observed in Driver Emotion Recognition (+3.65%) and Driver Behavior Recognition (+7.53%). Ablation studies confirmed the effectiveness of each component in the framework.
Implications
The findings suggest that incorporating cognitive and psychological factors into multi-task learning can significantly enhance the performance of ADAS. This approach could lead to more adaptive and responsive driving assistance technologies, improving overall driver safety and experience.
CausalVAE as a Plug-in for World Models: Towards Reliable Counterfactual Dynamics
Generative Models
Graph Learning
Theory
- CausalVAE is introduced as a plug-in for latent world models to improve counterfactual dynamics.
- The model captures causal relationships among latent variables using a directed acyclic graph (DAG) structure.
- A staged training strategy is employed to stabilize sequential training and enhance interpretability.
- Significant improvements in counterfactual retrieval metrics, especially in the Physics benchmark.
Read more
CausalVAE as a Plug-in for World Models: Towards Reliable Counterfactual Dynamics
Summary
This paper introduces CausalVAE as a structural module integrated into latent world models to enhance counterfactual dynamics and robustness under distribution shifts. The authors argue that traditional world models, while effective in factual predictions, often fail to capture causal relationships among latent variables, leading to poor performance in counterfactual scenarios. By incorporating CausalVAE, the model learns a directed acyclic graph (DAG) structure that delineates causal dependencies among latent factors. The proposed methodology includes a staged training strategy that first establishes predictive dynamics before activating structural regularization, which stabilizes training and improves interpretability. The results demonstrate significant improvements in counterfactual retrieval, particularly on the Physics benchmark, where the model achieved a 102.5% increase in CF-H@1 metric compared to paired baselines. This work emphasizes the importance of causal representation in enhancing the generalization and robustness of world models in dynamic environments.
Methodology
The authors integrate a structured causal disentanglement module into a world-model pipeline, utilizing the CausalVAE causal layer and a differentiable DAG constraint to enforce causal relationships among latent factors. A staged training approach is implemented, where predictive dynamics are learned first, followed by the activation of structural regularization. This method allows for alignment-anchored identifiability analysis, enhancing the interpretability of the learned causal structure.
Results
The integration of CausalVAE into world models resulted in a 102.5% improvement in the CF-H@1 metric on the Physics benchmark, with a notable increase from 11.0 to 41.0 in a GNN-NLL setting, representing a 272.7% gain. The model demonstrated enhanced robustness under distribution shifts and improved counterfactual retrieval capabilities.
Implications
The findings suggest that incorporating causal structures into world models can significantly enhance their performance in dynamic environments, making them more reliable for applications requiring counterfactual reasoning and robust generalization. This could have implications for fields such as robotics, autonomous systems, and any domain where understanding causal relationships is critical for decision-making.
Fast Heterogeneous Serving: Scalable Mixed-Scale LLM Allocation for SLO-Constrained Inference
Large Language Models
Optimization
Efficient ML
- Introduction of two efficient heuristics (GH and AGH) for LLM inference allocation.
- Incorporation of constraint-aware mechanisms to ensure feasibility under resource and SLO constraints.
- AGH achieves over 260× speedup compared to traditional MILP approaches.
- Robust performance under stress tests, maintaining stable costs and controlled SLO violations.
Read more
Fast Heterogeneous Serving: Scalable Mixed-Scale LLM Allocation for SLO-Constrained Inference
Summary
This paper addresses the challenges of deploying large language model (LLM) inference at scale, focusing on the joint optimization of model selection, GPU provisioning, parallelism configuration, and workload distribution under strict service-level objectives (SLOs) related to latency, accuracy, and budget. The authors propose two novel constraint-aware heuristics: the Greedy Heuristic (GH) for single-pass allocation and the Adaptive Greedy Heuristic (AGH), which enhances GH through multi-start construction, relocate-based local search, and GPU consolidation. These heuristics incorporate three mechanisms to ensure feasibility under tightly coupled constraints: TP-aware feasibility selection, cost-per-effective-coverage ranking, and TP upgrade. The proposed methods significantly outperform traditional mixed-integer linear programming (MILP) approaches, achieving feasible solutions in under one second while maintaining near-optimal costs. The paper demonstrates that AGH provides over 260× speedup on large-scale instances and remains robust under out-of-sample stress tests, contrasting sharply with the performance degradation of exact solvers under similar conditions.
Methodology
The authors formulated the allocation problem as a mixed-integer linear program (MILP) and developed two heuristics: GH and AGH. GH uses a single-pass allocation strategy, while AGH enhances this with multi-start construction, local search, and GPU consolidation. Three constraint-aware mechanisms are integrated to ensure feasibility regarding memory, delay, error, and budget constraints.
Results
The heuristics produced feasible solutions in under one second, with AGH closely approaching optimal costs while achieving over 260× speedup on large-scale instances. Under stress tests with parameter inflation, AGH maintained controlled SLO violations and stable costs, whereas the exact solver's performance degraded significantly.
Implications
The proposed methods can enhance the efficiency and scalability of LLM inference services, making them more adaptable to varying workloads and resource availability. This has significant implications for AI service providers looking to optimize their operations while meeting stringent performance requirements.
Kuramoto Oscillatory Phase Encoding: Neuro-inspired Synchronization for Improved Learning Efficiency
Computer Vision
Efficient ML
Theory
- Introduction of Kuramoto Oscillatory Phase Encoding (KoPE) to Vision Transformers.
- KoPE enhances learning efficiency through neuro-inspired synchronization mechanisms.
- Demonstrated improvements in training, parameter, and data efficiency.
- KoPE excels in structured understanding tasks like semantic segmentation and few-shot reasoning.
Read more
Kuramoto Oscillatory Phase Encoding: Neuro-inspired Synchronization for Improved Learning Efficiency
Summary
This paper introduces Kuramoto Oscillatory Phase Encoding (KoPE), a novel approach that integrates an evolving phase state into Vision Transformers to enhance learning efficiency through neuro-inspired synchronization. Unlike traditional deep learning architectures that primarily utilize activation values, KoPE incorporates phase dynamics, which are crucial for spatiotemporal neural processing and feature binding. The authors argue that synchronization can serve as an inductive bias that improves training, parameter, and data efficiency. KoPE is shown to significantly benefit various tasks requiring structured understanding, such as semantic segmentation, vision-language representation alignment, and few-shot abstract visual reasoning. The theoretical and empirical analyses indicate that KoPE facilitates attention concentration, leading to improved learning efficiency. Overall, this work bridges neuro-inspired principles with scalable neural network architectures, suggesting a practical route for enhancing state-of-the-art models.
Methodology
The methodology involves incorporating phase representations for each token in the Vision Transformer architecture, which are updated using Kuramoto dynamics. This phase evolution is coupled with token representations through complex-form rotations in the attention module, allowing synchronization dynamics to promote structure formation from data.
Results
The experiments conducted across supervised and self-supervised learning tasks show that KoPE significantly improves training efficiency, parameter efficiency, and data efficiency. It also outperforms traditional methods in tasks requiring structured understanding, such as semantic segmentation and few-shot abstract visual reasoning.
Implications
The findings suggest that integrating synchronization-based dynamics into neural architectures can enhance learning efficiency and generalization capabilities, potentially leading to more effective models in computer vision and related fields.
GIRL: Generative Imagination Reinforcement Learning via Information-Theoretic Hallucination Control
Reinforcement Learning
Generative Models
Robotics
- GIRL addresses imagination drift in model-based reinforcement learning by using a cross-modal grounding signal.
- The framework employs an uncertainty-adaptive trust-region bottleneck to control the imagination process.
- Theoretical contributions include a new value-gap bound that remains valid as the discount factor approaches one.
- Empirical results show GIRL outperforms DreamerV3 and TD-MPC2 across various tasks, demonstrating improved sample efficiency.
Read more
GIRL: Generative Imagination Reinforcement Learning via Information-Theoretic Hallucination Control
Summary
The paper introduces GIRL (Generative Imagination Reinforcement Learning), a novel framework designed to enhance model-based reinforcement learning (MBRL) by addressing the issue of imagination drift during long-horizon planning. The authors identify that traditional MBRL approaches, such as DreamerV3, suffer from compounded model errors that lead to unreliable value estimates and policies. To mitigate this, GIRL incorporates two key innovations: a cross-modal grounding signal from a frozen foundation model (DINOv2) that anchors the latent transition prior to a semantically consistent embedding space, and an uncertainty-adaptive trust-region bottleneck that constrains the imagination process based on learned trust regions. The theoretical contributions include a re-derivation of the value-gap bound that connects the I-ELBO objective to real-environment regret. Empirical evaluations across multiple benchmark suites demonstrate that GIRL significantly reduces latent rollout drift and achieves higher returns with fewer environment interactions compared to existing methods.
Methodology
GIRL utilizes a latent world-model framework that integrates a cross-modal grounding vector derived from DINOv2 to maintain semantic consistency in imagined rollouts. It employs a trust-region bottleneck that formulates the KL regularizer as a Lagrange multiplier in a constrained optimization problem, allowing for controlled imagination drift based on Expected Information Gain and Relative Performance Loss signals. The model is evaluated using a recurrent state-space model and incorporates a cross-modal consistency loss to penalize physics-defying hallucinations.
Results
GIRL demonstrated a reduction in latent rollout drift by 38–61% compared to DreamerV3 across various tasks. It achieved higher asymptotic returns with 40–55% fewer environment steps on tasks with a horizon of 500 or more. Additionally, GIRL outperformed TD-MPC2 in sparse-reward and high-contact settings, as measured by Interquartile Mean (IQM) and Probability of Improvement (PI). The distilled-prior variant significantly reduced DINOv2 inference time from 22% to under 4% of wall-clock time.
Implications
The advancements presented in GIRL could lead to more robust and efficient reinforcement learning systems, particularly in environments where long-horizon planning is critical. The methods could be applied in robotics, autonomous systems, and any domain requiring reliable decision-making under uncertainty.
Sheaf-Laplacian Obstruction and Projection Hardness for Cross-Modal Compatibility on a Modality-Independent Site
Multimodal
Theory
Graph Learning
- Introduces a modality-independent site for evaluating cross-modal compatibility.
- Defines projection hardness and sheaf-Laplacian obstruction as key invariants for alignment.
- Establishes a connection between sheaf spectral gap and global alignment stability.
- Demonstrates non-transitivity in compatibility and the potential for bridging through intermediate modalities.
Read more
Sheaf-Laplacian Obstruction and Projection Hardness for Cross-Modal Compatibility on a Modality-Independent Site
Summary
This paper presents a unified framework for analyzing cross-modal compatibility in learned representations, focusing on a modality-independent neighborhood site equipped with a cellular sheaf of finite-dimensional real inner-product spaces. The author introduces two key concepts: projection hardness, which quantifies the minimal complexity required for a global map to align embeddings, and sheaf-Laplacian obstruction, which measures the spatial variation needed for local projections to achieve a target alignment error. The framework distinguishes between two failure modes: hardness failure, where no low-complexity global projection exists, and obstruction failure, where local projections cannot be made globally consistent. The author links the sheaf spectral gap to the stability of global alignment and derives bounds relating obstruction energy to excess global-map error under mild Lipschitz assumptions. Additionally, the paper demonstrates that compatibility is generally non-transitive and introduces bridging via composed projection families, showing that an intermediate modality can reduce effective hardness even when direct alignment is infeasible. The proposed framework is operational, allowing for direct computation of the obstruction invariant using established optimization procedures.
Methodology
The methodology involves defining a fixed neighborhood site for all modalities, using a nested projection formalism to evaluate global hardness, and employing a projection-parameter sheaf to compute the sheaf-Laplacian energy. The framework allows for the direct computation of obstruction invariants and utilizes established optimization techniques for practical implementation.
Results
The paper establishes that projection hardness and sheaf-Laplacian obstruction are crucial for understanding cross-modal compatibility. It provides structural theorems linking these concepts to alignment stability and demonstrates that compatibility is generally non-transitive. The author also presents explicit constructions showing how bridging through intermediate modalities can reduce alignment complexity.
Implications
The findings have significant implications for the design of multi-modal systems, particularly in improving alignment strategies across different data types. The framework can inform future research in cross-modal representation learning and enhance the robustness of models in practical applications.
TTVS: Boosting Self-Exploring Reinforcement Learning via Test-time Variational Synthesis
Reinforcement Learning
Large Language Models
- TTVS enhances self-exploring RL by dynamically augmenting training data from unlabeled test queries.
- The framework consists of two modules: Online Variational Synthesis and Test-time Hybrid Exploration.
- TTVS outperforms existing test-time adaptation methods and state-of-the-art supervised RL techniques.
- The approach is agnostic to policy optimization algorithms, allowing flexible integration with various methods.
Read more
TTVS: Boosting Self-Exploring Reinforcement Learning via Test-time Variational Synthesis
Summary
The paper introduces Test-Time Variational Synthesis (TTVS), a novel framework designed to enhance self-exploring reinforcement learning (RL) in Large Reasoning Models (LRMs) by dynamically augmenting training data from unlabeled test queries. Traditional reinforcement learning with verifiable rewards (RLVR) faces limitations in specialized domains where obtaining labeled data is costly or impractical. Existing test-time adaptation methods often rely on static query sets, risking overfitting to superficial patterns. TTVS addresses this by incorporating two key modules: Online Variational Synthesis, which generates diverse and semantically-equivalent variations of test queries to encourage deeper learning of underlying problem logic, and Test-time Hybrid Exploration, which balances accuracy-driven exploitation with consistency-driven exploration. The framework is agnostic to policy optimization algorithms, allowing it to be integrated with various RL methods. Extensive experiments demonstrate that TTVS significantly outperforms both traditional test-time adaptation methods and state-of-the-art supervised RL techniques, achieving superior performance in mathematical reasoning tasks using only unlabeled test-time data.
Methodology
The TTVS framework comprises two main components: (1) Online Variational Synthesis, which transforms static test queries into a dynamic stream of semantically-equivalent variations to promote learning of problem logic, and (2) Test-time Hybrid Exploration, which employs a dual-mode update strategy for balancing exploitation and exploration in the augmented data space. This approach allows models to adapt and improve continuously during test time without relying on labeled data.
Results
TTVS was tested across eight different model architectures, showing superior performance in mathematical reasoning tasks. The framework not only surpassed other test-time adaptation methods but also outperformed state-of-the-art RL-based post-training methods that depend on large-scale labeled datasets.
Implications
The TTVS framework has significant implications for domains where labeled data is scarce or expensive to obtain, such as clinical diagnostics and aerospace engineering. By enabling models to self-evolve and adapt to new, unlabeled data, TTVS could facilitate advancements in various specialized fields requiring robust reasoning capabilities.
Robust Length Prediction: A Perspective from Heavy-Tailed Prompt-Conditioned Distributions
Large Language Models
Efficient ML
Theory
- Output-length prediction is essential for efficient LLM serving and resource allocation.
- Existing methods treat output length as a deterministic scalar, which is statistically misaligned with the true nature of LLM outputs.
- The proposed ProD methods leverage multiple generations to create robust training targets, improving prediction accuracy.
- Empirical results show significant improvements in prediction quality over previous state-of-the-art methods.
Read more
Robust Length Prediction: A Perspective from Heavy-Tailed Prompt-Conditioned Distributions
Summary
This paper addresses the challenge of output-length prediction for large language models (LLMs), which is crucial for efficient serving in various applications. Traditional methods typically treat output length as a deterministic scalar, using a one-shot sampled length as the label. However, the authors demonstrate that output lengths are drawn from a prompt-conditioned distribution that exhibits heavy-tailed behavior, leading to significant variability in lengths generated from the same prompt. To improve prediction reliability, the authors propose a novel approach called Prompt-conditioned length Distribution (ProD), which utilizes multiple independent generations of the same prompt to construct training targets. Two variants of ProD are introduced: ProD-M, which uses the median of the generated lengths for robust point prediction, and ProD-D, which employs a distributional target that captures the full uncertainty of the prompt-conditioned output length. Theoretical analysis supports the effectiveness of these methods, showing that increasing the number of samples reduces estimation error. Experiments conducted on two LLMs, Qwen-2.5-7B and Llama-3-8B, across four benchmarks reveal that the proposed methods achieve up to a 25% reduction in average prediction error compared to state-of-the-art techniques.
Methodology
The authors introduce Prompt-conditioned length Distribution (ProD) methods, which involve two variants: ProD-M uses the median of multiple independent generations as a robust point prediction target, while ProD-D uses a distributional target based on a histogram of generated lengths. Both methods utilize the last-layer hidden states of the LLM for input representation, avoiding the need for auxiliary models.
Results
The proposed methods were tested on Qwen-2.5-7B and Llama-3-8B across four benchmarks, demonstrating a reduction in average prediction error by up to 25% compared to existing state-of-the-art methods.
Implications
The findings suggest that adopting a distributional approach to length prediction can enhance the efficiency of LLM serving, potentially leading to better resource management and reduced latency in applications that rely on LLMs.
Regret-Aware Policy Optimization: Environment-Level Memory for Replay Suppression under Delayed Harm
Reinforcement Learning
Optimization
Graph Learning
- Introduces the Replay Suppression Diagnostic (RSD) to analyze replay phenomena in RL.
- Establishes a theoretical framework showing that replay cannot be suppressed without changing action distributions.
- Proposes Regret-Aware Policy Optimization (RAPO) to modify transition dynamics based on historical harm.
- Demonstrates significant reduction in replay and retention of task performance in graph diffusion tasks.
Read more
Regret-Aware Policy Optimization: Environment-Level Memory for Replay Suppression under Delayed Harm
Summary
This paper addresses the challenge of safety in reinforcement learning (RL) systems, particularly in the context of delayed harm caused by harmful content in recommendation systems. The authors introduce the Replay Suppression Diagnostic (RSD), a protocol designed to isolate and analyze replay phenomena, where harmful cascades can re-emerge after a washout period due to stationary observable transitions. They establish a theoretical no-go result indicating that replay cannot be structurally suppressed without altering action distributions. To mitigate this issue, the authors propose Regret-Aware Policy Optimization (RAPO), which incorporates persistent harm-trace and scar fields to modify transition dynamics based on historical harm. The methodology is validated through experiments on graph diffusion tasks, demonstrating that RAPO significantly reduces replay while maintaining high task returns. The findings suggest that traditional safe RL methods may be insufficient in preventing replay under stationary conditions, highlighting the need for environment-level interventions to ensure safety in RL applications.
Methodology
The authors developed the Replay Suppression Diagnostic (RSD) to conduct controlled exposure-washout-replay experiments, isolating the effects of environment-side memory on replay phenomena. They proposed RAPO, which utilizes persistent harm-trace and scar fields to deform transition dynamics, thereby reducing the reachability of historically harmful states. The methodology was tested on graph diffusion tasks with varying node counts to evaluate the effectiveness of replay suppression.
Results
RAPO achieved a reduction in replay amplification from 0.98 (using a standard policy) to 0.33 while retaining 82% of the task return on 250-node graphs. The experiments showed that disabling the deformation mechanism during replay restored replay amplification, providing causal evidence for the effectiveness of the proposed method.
Implications
The findings suggest that RL systems, particularly in content recommendation and similar applications, require advanced safety mechanisms that account for historical interactions. The proposed methods could enhance the robustness of RL systems against delayed harmful effects, making them more suitable for real-world applications where safety is critical.
Cluster Attention for Graph Machine Learning
Graph Learning
- Introduction of Cluster Attention (CLATT) to enhance graph machine learning models.
- CLATT allows nodes to attend to all nodes within their clusters, improving receptive fields.
- Augmentation of MPNNs and Graph Transformers with CLATT leads to significant performance gains.
- Experimental validation on 12 real-world graph datasets demonstrates the effectiveness of CLATT.
Read more
Cluster Attention for Graph Machine Learning
Summary
This paper introduces Cluster Attention (CLATT), a novel attention mechanism designed to enhance graph machine learning (GML) models by integrating graph structure into the attention process. Traditional Message Passing Neural Networks (MPNNs) have a limited receptive field constrained by the number of message passing layers, while Graph Transformers utilize global attention but often neglect graph topology. CLATT addresses these limitations by clustering graph nodes using community detection algorithms, allowing nodes to attend to all other nodes within their respective clusters. This approach retains the inductive biases of graph structures while enabling longer-range dependencies. The authors demonstrate that augmenting both MPNNs and Graph Transformers with CLATT significantly improves performance across various real-world graph datasets, including those from the GraphLand benchmark. The paper discusses the implementation of CLATT, the selection of clustering algorithms, and presents experimental results that validate the effectiveness of the proposed method.
Methodology
The authors propose a graph-based attention mechanism called Cluster Attention (CLATT), which involves clustering graph nodes using community detection algorithms. Each node attends to other nodes within its cluster, facilitating information exchange while preserving graph-structure-based inductive biases. The methodology includes augmenting existing MPNNs and Graph Transformers with CLATT and evaluating the performance on diverse graph datasets.
Results
The experimental results show that models augmented with CLATT outperform baseline MPNNs and Graph Transformers across a range of 12 real-world graph datasets. The integration of cluster attention significantly enhances the models' ability to capture long-range dependencies while maintaining the advantages of graph structure.
Implications
The introduction of CLATT has potential applications in various domains that utilize graph-structured data, such as social networks, biological networks, and recommendation systems. By improving the performance of GML models, CLATT could lead to more effective solutions in real-world applications where understanding complex relationships within data is crucial.
Validated Synthetic Patient Generation for Small Longitudinal Cohorts: Coagulation Dynamics Across Pregnancy
Generative Models
Time Series
Theory
- Introduces multiplicity-weighted Stochastic Attention (SA) for synthetic patient generation.
- SA preserves the geometry of small longitudinal cohorts while generating new patient profiles.
- Synthetic patients were validated against real data and found to be statistically indistinguishable.
- SA enables targeted amplification of rare clinical subgroups without retraining.
Read more
Validated Synthetic Patient Generation for Small Longitudinal Cohorts: Coagulation Dynamics Across Pregnancy
Summary
This paper addresses the challenge of generating synthetic patient data for small longitudinal cohorts, particularly in maternal health where sample sizes are often limited. The authors introduce a novel generative framework called multiplicity-weighted Stochastic Attention (SA), which leverages modern Hopfield network theory to create synthetic patient profiles that maintain the statistical and structural integrity of real patient data. By embedding real patient profiles as memory patterns in a continuous energy landscape, SA generates new synthetic patients through Langevin dynamics, allowing for interpolation between existing profiles while preserving the original cohort's geometry. The method was validated using a longitudinal coagulation dataset from 23 pregnant patients, demonstrating that the synthetic patients produced were statistically and mechanistically indistinguishable from real patients across various validation tests. Furthermore, a mechanistic model calibrated on synthetic data performed comparably to one calibrated on real data, highlighting the utility of SA in augmenting small datasets for clinical modeling and analysis.
Methodology
The authors developed the multiplicity-weighted Stochastic Attention (SA) framework, which treats real patient profiles as memory patterns in a continuous energy landscape. Using Langevin dynamics, SA generates synthetic patients by interpolating between these stored patterns. The method incorporates multiplicity weights to amplify rare clinical subgroups during inference, allowing for effective data augmentation without the need for retraining.
Results
The synthetic patients generated by SA were validated through multiple independent tests, showing that they were statistically, structurally, and mechanistically indistinguishable from real patients. A downstream utility test indicated that a mechanistic model calibrated on synthetic patients could predict real patient outcomes as effectively as one calibrated on actual data.
Implications
The findings suggest that SA can significantly enhance the ability to conduct analyses and modeling in small longitudinal cohorts, particularly in fields like maternal health where data is scarce. This approach could lead to improved understanding and management of complex conditions associated with pregnancy, such as coagulation disorders.
SPAMoE: Spectrum-Aware Hybrid Operator Framework for Full-Waveform Inversion
Optimization
- SPAMoE effectively decouples high and low-frequency information flows in FWI.
- The Spectral-Preserving DINO Encoder maintains balanced frequency content, improving model stability.
- The Adaptive Spectral Mixture-of-Experts enhances multi-scale geological structure reconstruction.
- SPAMoE outperforms existing FWI methods, achieving a 54.1% reduction in average MAE.
Read more
SPAMoE: Spectrum-Aware Hybrid Operator Framework for Full-Waveform Inversion
Summary
The paper introduces SPAMoE (Spectrum-Aware Hybrid Operator Framework), a novel approach to Full-Waveform Inversion (FWI) that addresses the challenges of frequency entanglement in multi-scale geological features. Traditional deep learning methods, particularly Convolutional Neural Networks (CNNs) and single-paradigm Neural Operators (NOs), struggle with the complexity of FWI due to their inability to effectively separate high and low-frequency information. SPAMoE employs a Spectral-Preserving DINO Encoder that ensures a balanced frequency representation, thereby mitigating high-frequency collapse. Additionally, it features an Adaptive Spectral Mixture-of-Experts (MoE) mechanism that dynamically assigns frequency bands to a combination of different neural operators (FNO, MNO, and LNO). This framework effectively decouples frequency information, enhancing the model's ability to reconstruct complex geological structures. Experimental evaluations on the OpenFWI benchmark demonstrate that SPAMoE significantly outperforms existing methods, achieving a 54.1% reduction in average Mean Absolute Error (MAE) compared to the best reported baseline, showcasing its potential for high-resolution subsurface imaging.
Methodology
The methodology involves a two-pronged approach: first, the use of a Spectral-Preserving DINO Encoder that enforces a lower bound on the high-to-low frequency energy ratio of the encoded representation; second, the implementation of an Adaptive Spectral Mixture-of-Experts mechanism that includes frequency decomposition, routing, and operator modeling to effectively manage multi-scale geological features.
Results
SPAMoE was evaluated on ten OpenFWI sub-datasets and demonstrated a substantial improvement over single neural operators and existing learning-based inversion baselines, achieving a 54.1% reduction in average MAE relative to the strongest baseline.
Implications
The proposed SPAMoE framework has significant implications for improving the efficiency and accuracy of Full-Waveform Inversion in geophysical applications, particularly in complex geological settings. Its design may also inspire future research in other domains requiring multi-scale feature handling.
Bias Redistribution in Visual Machine Unlearning: Does Forgetting One Group Harm Another?
Computer Vision
Theory
Interpretability
- Bias redistribution occurs when a model forgets a demographic group, often amplifying bias in other groups.
- The study reveals that forgetting the Young Female group primarily benefits the Old Female group, indicating a gender-dominant structure in CLIP's embedding space.
- Current unlearning methods struggle to achieve perfect forgetting due to the geometric relationships between embeddings.
- A novel redistribution score is introduced to quantify bias redistribution in machine unlearning.
Read more
Bias Redistribution in Visual Machine Unlearning: Does Forgetting One Group Harm Another?
Summary
This paper investigates the phenomenon of bias redistribution in the context of machine unlearning, particularly focusing on visual models like CLIP. Machine unlearning allows models to forget specific training data, which is crucial for compliance with privacy regulations. However, the authors explore whether forgetting a demographic group leads to the neutralization of bias or its redistribution to other groups. Using the CelebA dataset, they analyze three unlearning methods—Prompt Erasure, Prompt Reweighting, and Refusal Vector—across different CLIP model variants. The study finds that unlearning does not eliminate bias but rather redistributes it, predominantly along gender lines rather than age. For instance, when the Young Female group is forgotten, the model's performance shifts significantly to the Old Female group, indicating a gender-dominant structure in the embedding space. The authors also demonstrate that current unlearning methods, particularly projection-based ones, cannot achieve perfect forgetting due to the geometric entanglement of embeddings. This research highlights the need for improved unlearning strategies that consider the geometry of embedding spaces to avoid amplifying bias in retained groups.
Methodology
The authors conducted a systematic empirical study using three zero-shot unlearning methods on CLIP models (ViT-B/32, ViT-L/14, ViT-B/16) applied to the CelebA dataset. They measured shifts in per-group accuracy, demographic parity gaps, and introduced a redistribution score to quantify bias redistribution.
Results
The results indicate that unlearning does not eliminate bias but redistributes it primarily along gender lines. The removal of the Young Female group led to a significant transfer of classification accuracy to the Old Female group. The Refusal Vector method, while reducing redistribution, did not achieve complete forgetting and degraded retained performance.
Implications
These findings have significant implications for the development of fair machine learning systems, particularly in sensitive applications like facial recognition. The research suggests that without addressing the geometric properties of embedding spaces, unlearning methods may inadvertently perpetuate or amplify biases.
PolicyLong: Towards On-Policy Context Extension
NLP
Large Language Models
- PolicyLong introduces an on-policy framework for long-context training, addressing the off-policy gap in traditional methods.
- The framework employs an iterative self-curriculum that adapts to the model's evolving capabilities.
- Both positive contexts and hard negatives are dynamically selected based on the current model's entropy landscape.
- Experiments show significant performance improvements over baseline methods, especially at longer context lengths.
Read more
PolicyLong: Towards On-Policy Context Extension
Summary
The paper introduces PolicyLong, a novel framework aimed at enhancing the training of large language models (LLMs) by addressing the limitations of static long-context data construction. Traditional methods synthesize long-context data using a fixed model, leading to an off-policy gap as the model evolves and its predictive capabilities change. PolicyLong shifts this paradigm to an on-policy approach, where data construction is iteratively refreshed based on the current model's performance. This dynamic process allows the model to develop an implicit self-curriculum, progressively focusing on more challenging long-range dependencies. The methodology involves a multi-stage training process where the model re-evaluates and retrieves data, ensuring that both positive contexts and hard negatives are aligned with its current learning state. Experimental results demonstrate that PolicyLong significantly outperforms existing methods like EntropyLong and NExtLong, particularly with longer context lengths, confirming the effectiveness of on-policy data evolution.
Methodology
PolicyLong employs a multi-stage iterative process where the current model re-executes data screening, including entropy computation, retrieval, and verification, at each stage of training. This allows the model to adaptively select training data that reflects its current capabilities, creating a self-curriculum that evolves as the model improves.
Results
Experiments conducted on datasets RULER, HELMET, and LongBench-v2 with the Qwen2.5-3B model show that PolicyLong consistently outperforms baseline methods like EntropyLong and NExtLong. Notably, performance gains increase with longer context lengths, exemplified by a +2.54 improvement at 128K tokens on the RULER dataset.
Implications
The findings suggest that on-policy data construction can significantly enhance the training of large language models, particularly in tasks requiring long-context understanding. This approach may lead to more effective LLMs capable of handling complex dependencies in various applications, including natural language processing and information retrieval.
Is your algorithm unlearning or untraining?
Theory
- Distinction between Unlearning and Untraining is crucial for clarity in research.
- Untraining removes the influence of specific examples, while Unlearning targets the underlying distribution.
- Misunderstanding these terms can lead to inappropriate metrics and hinder progress.
- The paper aims to initiate discussions on technical definitions and overlooked research questions.
Read more
Is your algorithm unlearning or untraining?
Summary
This paper addresses the growing interest in 'machine unlearning', which refers to the ability to delete specific data points or behaviors from a trained model. The authors argue that the term 'unlearning' has become overloaded, leading to confusion in the literature due to the existence of two distinct problem formulations: Unlearning and Untraining. Untraining focuses on reversing the effect of specific examples in a forget set, while Unlearning aims to remove the influence of the entire underlying distribution from which those examples were drawn. The paper highlights the technical definitions of these concepts, maps existing literature to each formulation, and discusses the implications of this distinction for research and practice. By clarifying these terms, the authors hope to enhance understanding and facilitate progress in the field of unlearning, particularly in light of regulatory requirements like the EU's General Data Protection Regulation (GDPR).
Methodology
The authors review existing literature on unlearning and untraining, providing technical definitions and mapping various problem settings to these definitions. They illustrate the differences between the two concepts through examples and discuss their implications for future research.
Results
The paper successfully establishes a clear distinction between Unlearning and Untraining, providing a framework for understanding the different approaches in the literature. This distinction is expected to improve the evaluation of algorithms and foster further research in the area of machine unlearning.
Implications
The findings have significant implications for the development of machine learning models that comply with privacy regulations, as well as for enhancing model safety by removing harmful behaviors. The clarification of these terms may lead to more effective methodologies for data deletion and model adjustment.
The Lifecycle of the Spectral Edge: From Gradient Learning to Weight-Decay Compression
Theory
Optimization
- The spectral edge transitions from a gradient-driven to a weight-decay-driven state during training.
- At grokking, gradient and weight decay align along the spectral edge, indicating a phase transition.
- Post-grok, the spectral edge's orientation is crucial for model performance, while displacements along it are not.
- Three universality classes of spectral edges are identified based on their functional content.
Read more
The Lifecycle of the Spectral Edge: From Gradient Learning to Weight-Decay Compression
Summary
This paper investigates the spectral edge, defined as the dominant direction of the Gram matrix of parameter updates in neural network training, and its role in phase transitions during learning. The author decomposes the spectral edge into gradient and weight-decay components across two sequence tasks: Dyck-1 balanced parentheses and SCAN compositional generalization. A two-phase lifecycle is identified: initially, the spectral edge is gradient-driven and carries task-relevant information, but at the point of 'grokking,' it transitions to a weight-decay-driven compression mode. This transition is characterized by a significant alignment of gradient and weight decay along the spectral edge, marking a phase transition. The post-grok edge exhibits a paradox where perturbations along it have minimal impact, yet its removal leads to catastrophic performance degradation. The paper introduces three universality classes based on the functional content of the edges and establishes causal mechanisms through various experiments, including grad-WD decomposition and perturbation analysis. The findings replicate across multiple seeds, supporting the central claim of the two-phase lifecycle and the importance of the spectral edge in neural network training dynamics.
Methodology
The study employs a combination of gradient-weight decay decomposition, ablation studies, perturbation analysis, and Hessian curvature measurements to explore the dynamics of the spectral edge. It analyzes two sequence tasks using neural network architectures and computes the Gram matrix of parameter updates to identify the spectral edge and its properties.
Results
The analysis reveals a sharp transition in the spectral edge's characteristics at grokking, with gradient energy dropping significantly while weight decay dominates. The post-grok edge is essential for maintaining model performance, with perturbations along it causing negligible changes, while its removal leads to a drastic drop in accuracy. The findings are consistent across multiple experimental seeds.
Implications
Understanding the lifecycle of the spectral edge can inform the design of more effective training strategies for neural networks, particularly in optimizing the balance between gradient learning and weight decay. This could lead to improved performance in various tasks, especially those requiring compositional generalization.
Implicit Regularization and Generalization in Overparameterized Neural Networks
Theory
Optimization
- Overparameterized neural networks can generalize well despite classical predictions of overfitting.
- Implicit regularization through optimization algorithms like SGD influences generalization performance.
- Smaller batch sizes lead to flatter minima and lower test errors.
- Sparse subnetworks can achieve performance comparable to full models, highlighting effective capacity constraints.
Read more
Implicit Regularization and Generalization in Overparameterized Neural Networks
Summary
This paper addresses the paradox of overparameterized neural networks, which often exhibit strong generalization performance despite classical statistical learning theory predicting severe overfitting. The study investigates the mechanisms behind this phenomenon, focusing on implicit regularization and optimization dynamics. Through controlled experiments using stochastic gradient descent (SGD) on datasets like CIFAR-10 and MNIST, the author explores the impact of batch sizes, loss landscape geometry, and theoretical frameworks such as the Neural Tangent Kernel (NTK). The findings reveal that smaller batch sizes lead to lower test errors and flatter minima, suggesting that the interaction between network architecture, optimization algorithms, and loss landscape geometry plays a crucial role in generalization. Additionally, sparse subnetworks retaining only 10% of parameters can achieve performance close to full models, indicating that effective capacity may be constrained during training. The results call for revised learning-theoretic frameworks to better explain generalization in high-dimensional model regimes.
Methodology
The study employs controlled computational experiments using stochastic gradient descent (SGD) across varying batch sizes. It analyzes the geometry of the loss landscape through Hessian eigenvalue estimation and weight perturbation, and explores theoretical perspectives from the Neural Tangent Kernel (NTK) framework. Experiments are conducted on CIFAR-10 and MNIST datasets with multiple random seeds.
Results
The experiments demonstrate that generalization in overparameterized models is significantly affected by the interaction of network architecture, optimization methods, and the geometry of the loss landscape. Smaller batch sizes consistently yield lower test errors and flatter minima. Furthermore, sparse subnetworks, retaining only 10% of the original parameters, achieve performance within 1.15 percentage points of the full model when retrained.
Implications
The findings suggest that understanding the mechanisms of implicit regularization and the geometry of loss landscapes can lead to improved training strategies for deep learning models. This could influence the design of neural network architectures and optimization algorithms, ultimately enhancing generalization in various applications.
Quantization Impact on the Accuracy and Communication Efficiency Trade-off in Federated Learning for Aerospace Predictive Maintenance
Federated Learning
Time Series
Efficient ML
- AeroConv1D model designed specifically for aerospace predictive maintenance.
- INT4 quantization achieves accuracy similar to FP32 while reducing communication costs by 8×.
- Non-IID evaluation reveals the true performance of quantization methods, contrasting with IID assumptions.
- INT2 quantization, while showing lower MAE, leads to significant instability in performance metrics.
Read more
Quantization Impact on the Accuracy and Communication Efficiency Trade-off in Federated Learning for Aerospace Predictive Maintenance
Summary
This paper explores the challenges of deploying federated learning (FL) for predictive maintenance in aerospace settings, particularly focusing on the trade-off between accuracy and communication efficiency when using symmetric uniform quantization. The study introduces AeroConv1D, a lightweight 1-D convolutional model designed for FPGA inference, and evaluates its performance under varying quantization levels (32, 8, 4, and 2 bits) using the NASA C-MAPSS benchmark. A rigorous multi-seed evaluation reveals that INT4 quantization achieves accuracy comparable to full precision (FP32) while significantly reducing communication costs (8× reduction). The paper highlights the methodological issue of IID client partitioning, which can misrepresent the performance of quantization techniques. The findings indicate that while INT2 quantization shows some improvement in mean absolute error (MAE), it suffers from instability in NASA scores, rendering it impractical. Additionally, FPGA resource projections confirm that INT4 quantization is feasible for deployment on hardware with limited resources, paving the way for efficient FL pipelines in aerospace applications.
Methodology
The study employs a multi-seed evaluation approach (N = 10) to assess the performance of AeroConv1D under different quantization levels on the NASA C-MAPSS dataset. Statistical significance is determined using paired t-tests, and the impact of IID versus Non-IID client partitioning on performance is analyzed.
Results
The results demonstrate that INT4 quantization maintains accuracy comparable to FP32 across different subsets of the dataset, with p-values indicating no statistically significant difference. In contrast, INT2 quantization shows a reduction in MAE but suffers from high instability in NASA scores, making it unsuitable for practical applications. FPGA resource analysis confirms that INT4 can be effectively implemented within the constraints of the Xilinx ZCU102 hardware.
Implications
The findings suggest that careful consideration of quantization methods is crucial for deploying federated learning in real-world aerospace applications. The ability to maintain accuracy while reducing communication overhead can facilitate the use of FL in bandwidth-constrained environments, enhancing predictive maintenance strategies.
A Direct Approach for Handling Contextual Bandits with Latent State Dynamics
Reinforcement Learning
Theory
Optimization
- Introduces a general model for linear contextual bandits with latent-state dynamics, allowing rewards to depend on contexts, actions, and hidden states.
- Achieves stronger high-probability regret bounds compared to previous work, which relied on simplified models.
- Demonstrates that the belief-dependent reward model is a significant simplification and does not capture the complexities of the problem.
- Provides a more direct and efficient methodology for handling contextual bandits with latent state dynamics.
Read more
A Direct Approach for Handling Contextual Bandits with Latent State Dynamics
Summary
This paper revisits the finite-armed linear bandit model introduced by Nelson et al. (2022), which incorporates contexts and rewards governed by a hidden Markov model (HMM). The authors critique the simplifications made by Nelson et al., particularly the reliance on posterior probabilities over hidden states instead of directly modeling the hidden states. They propose a more comprehensive model that captures direct dependencies on hidden states, leading to stronger high-probability regret bounds for an adaptive strategy that estimates HMM parameters online. The authors demonstrate that their approach not only generalizes the previous work but also provides a more efficient treatment of the simpler model as a special case. The paper emphasizes the importance of considering the actual latent states in reward modeling and presents a careful analysis of the stochastic properties of beliefs, resulting in a strategy that balances exploration and exploitation effectively.
Methodology
The authors develop a new framework for linear contextual bandits that incorporates direct dependencies on latent states. They perform a detailed analysis of the stochastic properties of beliefs and propose a staged strategy that balances exploration and exploitation. The methodology includes online estimation of HMM parameters, leading to high-probability regret bounds that are independent of the reward functions.
Results
The paper presents high-probability regret bounds that are sublinear in a complex model where rewards depend on states, contexts, and actions. The results indicate that the proposed approach outperforms previous methods by avoiding unnecessary complexities and providing a clearer understanding of the underlying dynamics.
Implications
The findings have significant implications for the design of algorithms in contextual bandit settings, particularly in environments where latent states change frequently. The approach can be applied to various applications in adaptive learning systems, recommendation systems, and dynamic decision-making processes.
The Impact of Dimensionality on the Stability of Node Embeddings
Graph Learning
- Dimensionality significantly affects the stability of node embeddings.
- Different embedding methods exhibit varying stability patterns with increased dimensions.
- Higher dimensionality does not guarantee better performance in downstream tasks.
- The study emphasizes the importance of selecting appropriate embedding dimensions.
Read more
The Impact of Dimensionality on the Stability of Node Embeddings
Summary
This paper investigates the influence of embedding dimensionality on the stability and performance of node embeddings generated by various methods. Previous research has shown that node embeddings can vary significantly even with identical training parameters due to randomness in the training process. The authors systematically evaluate five popular node embedding techniques—ASNE, DGI, GraphSAGE, node2vec, and VERSE—across multiple datasets and varying dimensions. They assess stability from both representational and functional perspectives, alongside performance metrics for downstream tasks such as node classification and link prediction. The findings reveal that embedding stability is highly dependent on dimensionality, with different methods exhibiting distinct patterns; for instance, node2vec and ASNE tend to become more stable with higher dimensions, while others do not follow this trend. Importantly, the study highlights that maximum stability does not always correlate with optimal task performance, emphasizing the need for careful selection of embedding dimensions. The authors provide code for reproducibility, contributing to the understanding of trade-offs in graph representation learning.
Methodology
The authors conducted a systematic evaluation of five node embedding methods (ASNE, DGI, GraphSAGE, node2vec, and VERSE) across multiple datasets. They varied the embedding dimensions and assessed stability from both representational and functional perspectives, while also measuring performance on downstream tasks like node classification and link prediction.
Results
The results indicate that embedding stability varies significantly with dimensionality, with methods like node2vec and ASNE showing increased stability at higher dimensions, while others do not. Furthermore, the study found that maximum stability does not necessarily align with optimal performance in downstream tasks.
Implications
These findings suggest that researchers and practitioners should carefully consider the dimensionality of node embeddings when designing models for graph representation learning. The insights provided can help in balancing stability, performance, and computational efficiency in various applications involving graph data.
Learning Markov Processes as Sum-of-Square Forms for Analytical Belief Propagation
Theory
Efficient ML
Time Series
- Introduces a functional modeling framework using Sum-of-Squares forms for analytical belief propagation in Markov processes.
- Provides a theoretical analysis of the limitations of SoS for conditional density estimation.
- Presents a novel functional form that alleviates restrictions of SoS while preserving theoretical attributes.
- Demonstrates a training method that ensures valid distribution constraints are met.
Read more
Learning Markov Processes as Sum-of-Square Forms for Analytical Belief Propagation
Summary
This paper addresses the challenge of propagating probability density functions (beliefs) through Markov process models, which is often analytically infeasible for continuous state spaces. The authors propose a novel functional modeling framework that utilizes sparse Sum-of-Squares (SoS) forms for valid conditional density estimation, allowing for analytical belief propagation. They analyze the theoretical limitations of using SoS forms and introduce a new functional form that overcomes these restrictions while maintaining desirable theoretical properties. The proposed architecture enables simultaneous learning of basis functions and coefficients, ensuring compliance with normalization and non-negativity constraints. Experimental results demonstrate that the method achieves accuracy comparable to state-of-the-art approaches while significantly reducing memory requirements in low-dimensional spaces and successfully scaling to 12D systems, where existing methods struggle.
Methodology
The authors leverage Sum-of-Squares theory to create a valid conditional density estimator that allows for the joint optimization of coefficients and basis functions. This approach is designed to ensure analytical belief propagation while adhering to normalization and non-negativity constraints.
Results
The proposed method achieves accuracy on par with state-of-the-art techniques while requiring significantly less memory in low-dimensional scenarios. It also scales effectively to 12-dimensional systems, overcoming limitations faced by existing methods that fail beyond 2D.
Implications
The findings suggest that the proposed framework can enhance the predictive capabilities of Markov process models in various applications, particularly in high-dimensional settings where traditional methods struggle. This could lead to improvements in fields such as robotics, time series analysis, and other areas requiring robust probabilistic modeling.
Tree-of-Evidence: Efficient "System 2" Search for Faithful Multimodal Grounding
Multimodal
Interpretability
Optimization
- Tree-of-Evidence (ToE) is introduced as a novel algorithm for improving interpretability in multimodal models.
- ToE employs a beam search strategy to identify minimal evidence sets necessary for model predictions.
- The algorithm retains high predictive performance while providing auditable evidence traces.
- ToE adapts its search strategy based on the ambiguity of the data, effectively integrating multiple modalities.
Read more
Tree-of-Evidence: Efficient "System 2" Search for Faithful Multimodal Grounding
Summary
The paper introduces Tree-of-Evidence (ToE), an innovative inference-time search algorithm designed to enhance the interpretability of Large Multimodal Models (LMMs) in high-stakes domains like healthcare. Traditional interpretability methods often fail to accurately represent the decision-making processes of these models, especially when dealing with heterogeneous data types such as time-series and text. ToE addresses this by framing interpretability as a discrete optimization problem, utilizing lightweight Evidence Bottlenecks to score groups of data and employing a beam search to identify the minimal evidence set necessary for reproducing model predictions. The authors evaluate ToE across six tasks involving three datasets (MIMIC-IV, eICU, and LEMMA-RCA) and demonstrate that it maintains over 98% of the full-model AUROC while using as few as five evidence units. The results indicate that ToE not only provides auditable evidence traces but also achieves higher decision agreement and lower probability fidelity error compared to existing methods. Qualitative analyses reveal that ToE adapts its search strategy based on the nature of the evidence, effectively balancing the use of vital signs and textual information. This work presents a practical approach for auditing multimodal models, ensuring that predictions can be traced back to specific, verifiable pieces of evidence.
Methodology
The methodology involves training modality-specific classifiers and lightweight Evidence Bottlenecks that score evidence units. During inference, ToE performs a beam search to construct a compact evidence set, optimizing for decision agreement, probability stability, and evidence sparsity. This structured approach allows the model to focus on dynamic evidence while considering the global context.
Results
ToE was evaluated across six tasks and demonstrated the ability to maintain over 98% AUROC with as few as five evidence units. It achieved higher decision agreement and lower probability fidelity error than existing interpretability methods, providing clear and auditable evidence traces.
Implications
The findings suggest that ToE can significantly enhance the interpretability of multimodal models in critical applications such as healthcare, where understanding the rationale behind predictions is essential. This could lead to more trustworthy AI systems capable of providing transparent decision-making processes.
Adversarial Label Invariant Graph Data Augmentations for Out-of-Distribution Generalization
Graph Learning
Optimization
Theory
- Introduction of RIA, a method for OoD generalization under covariate shift.
- Adversarial label invariant data augmentations are used to create diverse training environments.
- The methodology includes an alternating gradient descent-ascent algorithm for optimization.
- Extensive experiments show RIA outperforms existing OoD generalization approaches.
Read more
Adversarial Label Invariant Graph Data Augmentations for Out-of-Distribution Generalization
Summary
This paper addresses the challenge of out-of-distribution (OoD) generalization, particularly under covariate shift, where the input data distribution changes while the underlying concept remains the same. The authors propose a novel method called RIA (Regularization for Invariance with Adversarial training) that employs adversarial label invariant data augmentations to create diverse training environments. This approach is motivated by an analogy to Q-learning and aims to prevent the model from collapsing to an empirical risk minimization (ERM) solution, which is common when training environments are limited. The authors develop an alternating gradient descent-ascent algorithm to optimize the learning process and conduct extensive experiments on graph classification tasks with various synthetic and natural distribution shifts. The results demonstrate that RIA significantly improves accuracy compared to existing OoD generalization methods, showcasing its effectiveness in enhancing generalizability across different environments.
Methodology
The authors propose RIA, which utilizes adversarial training to generate counterfactual environments that are challenging for the model to learn from. This is achieved through adversarial label invariant data augmentations, which help maintain the model's performance across varying environments. An alternating gradient descent-ascent algorithm is employed to optimize the learning process.
Results
The experiments conducted on various synthetic and natural datasets for graph classification reveal that RIA achieves significantly higher accuracy compared to baseline OoD generalization methods, demonstrating its effectiveness in improving generalization under covariate shift.
Implications
The findings suggest that RIA can be a valuable tool for improving machine learning models' robustness to distribution shifts, particularly in applications involving graph data. This has potential implications for fields such as social network analysis, biological network modeling, and any domain where graph structures are prevalent.
SAGE: Sign-Adaptive Gradient for Memory-Efficient LLM Optimization
Large Language Models
Optimization
Efficient ML
- SAGE addresses the memory bottleneck of the AdamW optimizer in LLM training.
- The optimizer effectively manages the unique challenges posed by embedding layers' sparse, high-variance gradients.
- SAGE combines a Lion-style update with a memory-efficient adaptive scale for improved stability and convergence.
- The proposed method outperforms existing optimizers in terms of perplexity and memory efficiency.
Read more
SAGE: Sign-Adaptive Gradient for Memory-Efficient LLM Optimization
Summary
The paper introduces SAGE (Sign Adaptive GradiEnt), a novel optimizer designed to address the memory bottleneck associated with the AdamW optimizer in the training of Large Language Models (LLMs). The authors identify a critical issue with existing light-state optimizers, particularly in handling the sparse, high-variance gradients of embedding layers, which leads to a hybrid design that reverts to AdamW, negating memory efficiency gains. SAGE replaces AdamW in this hybrid structure by utilizing a Lion-style update direction combined with a new memory-efficient O(d) adaptive scale that stabilizes high-variance dimensions. This design allows SAGE to achieve better convergence and significantly reduce optimizer state memory. The authors demonstrate that their SAGE-based hybrid optimizer outperforms existing methods, including SinkGD and AdamW, achieving state-of-the-art perplexity on Llama models with up to 1.3 billion parameters while maintaining a lower memory footprint.
Methodology
SAGE employs a hybrid optimization approach that maintains a single O(V d) moment state while replacing the second-moment state of AdamW with a novel O(d) dimension-wise adaptive damper. This damper tracks the mean absolute gradient and is theoretically bounded, allowing for effective stabilization of high-variance gradients in embedding layers.
Results
The SAGE-based hybrid optimizer achieved new state-of-the-art perplexity on Llama models with up to 1.3 billion parameters, outperforming all baseline optimizers, including SinkGD and AdamW, while significantly reducing the memory required for optimizer states.
Implications
The development of SAGE has significant implications for the training of large-scale language models, enabling more efficient use of memory resources and potentially allowing for larger batch sizes and model scaling. This could lead to advancements in the capabilities and applications of LLMs in various NLP tasks.
Conservation Law Breaking at the Edge of Stability: A Spectral Theory of Non-Convex Neural Network Optimization
Optimization
Theory
- Gradient flow preserves conservation laws in L-layer ReLU networks, confining optimization trajectories.
- Discrete gradient descent breaks these conservation laws, leading to a drift characterized by a non-integer exponent α.
- A closed-form spectral crossover formula for drift is derived, explaining the observed behavior across different architectures.
- Cross-entropy loss is shown to induce exponential Hessian spectral compression, independent of training set size.
Read more
Conservation Law Breaking at the Edge of Stability: A Spectral Theory of Non-Convex Neural Network Optimization
Summary
This paper addresses the paradox of why gradient descent effectively finds good solutions in non-convex neural network optimization, despite the NP-hard nature of the problem. The author demonstrates that gradient flow on L-layer ReLU networks preserves L−1 conservation laws, which confine optimization trajectories to lower-dimensional manifolds. However, under discrete gradient descent, these conservation laws break, leading to a drift that scales as η^α, where α varies between 1.1 and 1.6 depending on the network architecture, loss function, and width. The paper introduces a spectral theory to explain this drift, decomposing it into a term η²·S(η), where S(η) has a closed-form spectral crossover formula. The author validates the derived mode coefficients for both linear and ReLU networks and shows that cross-entropy loss induces exponential Hessian spectral compression, with a timescale independent of dataset size. The study identifies two dynamical regimes in optimization, separated by a transition dependent on the width of the network. Overall, the findings provide insights into the mechanisms that allow practical neural networks to navigate complex loss landscapes effectively.
Methodology
The paper employs a theoretical approach, deriving conservation laws for gradient flow in neural networks and analyzing the drift caused by discrete gradient descent. It utilizes spectral analysis to formulate a closed-form expression for the drift and validates the findings through extensive experiments across various network architectures.
Results
The study confirms that conservation laws are maintained under gradient flow but are broken under discrete gradient descent, leading to a drift that scales with the learning rate. The derived spectral crossover formula accurately predicts the drift behavior, and the findings are validated through 23 experiments, demonstrating consistency across different network types and loss functions.
Implications
The insights from this research could inform the design of more effective optimization algorithms for training neural networks, particularly in understanding how to navigate non-convex landscapes. Additionally, the findings may contribute to the development of strategies for improving convergence rates and training stability in deep learning models.
Approximation of the Basset force in the Maxey-Riley-Gatignol equations via universal differential equations
Theory
Optimization
Time Series
- Introduces a neural network-based approximation for the Basset force in MaRGE.
- Transforms complex integro-differential equations into solvable ordinary differential equations.
- Compares FNN and LSTM architectures to effectively model the history effects.
- Demonstrates the applicability of universal differential equations in fluid dynamics.
Read more
Approximation of the Basset force in the Maxey-Riley-Gatignol equations via universal differential equations
Summary
This paper addresses the challenges associated with the Basset force in the Maxey-Riley-Gatignol equations (MaRGE), which model the motion of spherical inertial particles in a fluid. The Basset force, an integral term that accounts for history effects, complicates the numerical solution of MaRGE, leading to its frequent neglect despite its significant impact on particle movement patterns. The authors propose a novel approximation of the Basset force using universal differential equations (UDEs) and neural networks, transforming the integro-differential equations into a system of ordinary differential equations (ODEs) that can be solved with standard numerical methods like Runge-Kutta. They compare the performance of a feedforward neural network (FNN) and a long short-term memory (LSTM) network to capture the memory effects of the Basset force. The methodology is validated through numerical simulations in two distinct flow fields, demonstrating the effectiveness of the neural network approximation in simplifying the solution process while maintaining accuracy.
Methodology
The authors utilize universal differential equations to approximate the Basset force, employing both feedforward neural networks and long short-term memory networks to model the history effects. They generate training data using a numerical solver for the full MaRGE and validate their approach in two different flow fields, one analytical and one based on experimental data.
Results
The neural network approximation significantly reduces the complexity of solving the MaRGE while preserving the accuracy of the Basset force's effects on particle trajectories. The results indicate that both FNN and LSTM architectures can effectively capture the necessary historical information, with the LSTM showing improved performance in certain scenarios.
Implications
This work has potential implications for various fields, including environmental science, industrial applications, and any domain involving the transport of inertial particles in fluids. The proposed methodology could lead to more efficient simulations and better understanding of particle dynamics in complex fluid environments.
Multimodal Latent Reasoning via Predictive Embeddings
Multimodal
- PEARL eliminates the need for explicit tool invocation at inference time, reducing overhead.
- The framework supports multi-step reasoning and avoids training-inference mismatch.
- PEARL outperforms traditional supervised fine-tuning and reconstruction-based methods.
- Empirical analysis reveals that reconstruction-based methods focus on embedding learning rather than true latent transformations.
Read more
Multimodal Latent Reasoning via Predictive Embeddings
Summary
The paper introduces PEARL (Predictive Embedding Alignment for Reasoning in Latent space), a novel framework designed to enhance multimodal reasoning in visual language models (VLMs) by learning from expert tool-use trajectories in a latent space. Traditional tool-augmented approaches face challenges such as inference overhead, the need for specialized supervision, and the risk of erroneous tool calls. PEARL addresses these issues by eliminating explicit tool invocation during inference, allowing the model to predict trajectory embeddings directly from image-question pairs. This method preserves the standard vision-language generation pipeline while enabling multi-step reasoning without the training-inference mismatch seen in reconstruction-based methods. The authors demonstrate that PEARL matches or outperforms existing supervised fine-tuning and reconstruction-based approaches across various perception benchmarks, providing empirical evidence that reconstruction methods primarily learn embeddings rather than simulating visual transformations. The proposed framework thus offers a more principled alternative for multimodal reasoning tasks.
Methodology
PEARL is based on a JEPA-inspired framework that learns predictive representations from expert tool-use trajectories. It operates entirely in latent space during training, predicting trajectory embeddings from image-question pairs and optimizing a vision-language generation objective alongside a predictive embedding objective.
Results
PEARL consistently matches or surpasses the performance of standard supervised fine-tuning and reconstruction-based latent reasoning methods across multiple perception benchmarks, demonstrating its effectiveness in multimodal reasoning tasks.
Implications
The proposed framework could significantly enhance the efficiency and effectiveness of visual language models in applications requiring multimodal reasoning, such as image editing, object detection, and interactive AI systems. By reducing inference overhead and improving reasoning capabilities, PEARL may facilitate more advanced AI applications in various domains.
SOLAR: Communication-Efficient Model Adaptation via Subspace-Oriented Latent Adapter Reparametrization
Efficient ML
Computer Vision
NLP
- SOLAR significantly reduces the communication and storage costs of PEFT methods.
- The framework is model-agnostic and can be applied post-training without modifying existing fine-tuning processes.
- The method leverages subspace similarity to create compact and efficient adapter representations.
- Theoretical bounds on reconstruction error are established, allowing for controlled compression.
Read more
SOLAR: Communication-Efficient Model Adaptation via Subspace-Oriented Latent Adapter Reparametrization
Summary
The paper introduces SOLAR (Subspace-Oriented Latent Adapter Reparameterization), a novel framework aimed at reducing the communication and storage costs associated with Parameter-Efficient Fine-Tuning (PEFT) methods, such as LoRA. While PEFT techniques allow for scalable adaptation of foundation models by updating a small set of parameters, they still incur significant overhead in terms of communication and storage, especially in resource-constrained environments. SOLAR addresses this issue by representing PEFT updates as linear combinations of basis vectors derived from the foundation model's singular vectors, incorporating controlled random perturbations. This approach exploits the subspace similarity between the foundation model and task-specific updates, allowing for a decoupling of adapter size from the PEFT structure. The method is model-agnostic and compatible with existing PEFT techniques, enabling post-training compression without altering the fine-tuning process. The authors provide a theoretical analysis bounding the reconstruction error and demonstrate through experiments that SOLAR can reduce adapter sizes by up to 98% while maintaining competitive performance across various language and vision tasks.
Methodology
SOLAR employs a three-step framework for post-hoc adapter compression: (1) constructing a basis pool from the foundation model's singular vectors with random perturbations, (2) selecting a sparse set of significant basis vectors to meet a size budget, and (3) reconstructing the adapter using only the selected coefficients and a single random seed. This approach allows for efficient representation without the need for extensive training.
Results
The experiments conducted on various language and vision tasks using models such as LLaMA, GPT-2, and ViT demonstrate that SOLAR can reduce adapter sizes by up to 98% while preserving the performance levels of original LoRA adapters, indicating its effectiveness in maintaining accuracy while minimizing resource requirements.
Implications
SOLAR offers a promising solution for deploying large foundation models in distributed systems and edge devices, where communication and storage limitations are critical. Its compatibility with existing PEFT methods enhances its applicability across various domains, potentially leading to more efficient model adaptation strategies in real-world applications.
Physics-informed neural operators for the in situ characterization of locally reacting sound absorbers
Audio & Speech
- Introduces a novel physics-informed neural operator approach for estimating acoustic surface admittance.
- Utilizes deep operator networks to learn mappings from measurement data without requiring an explicit forward model.
- Incorporates governing acoustic equations as regularization to enhance prediction consistency and noise robustness.
- Demonstrates accurate reconstruction of admittance components and reliable acoustic field predictions using synthetic data.
Read more
Physics-informed neural operators for the in situ characterization of locally reacting sound absorbers
Summary
This paper addresses the challenge of accurately estimating acoustic surface admittance or impedance, which is crucial for reliable wave-based simulations. Traditional methods face limitations due to noise, model inaccuracies, and restrictive assumptions. The authors propose a physics-informed neural operator approach that estimates frequency-dependent surface admittance directly from near-field measurements of sound pressure and particle velocity. By employing a deep operator network, the method learns the mapping from measurement data, spatial coordinates, and frequency to acoustic field quantities while inferring a globally consistent surface admittance spectrum without needing an explicit forward model. The training process incorporates governing acoustic relations, such as the Helmholtz equation and Robin boundary conditions, as physics-based regularization, which enhances the robustness of predictions against noise. The method is validated using synthetic data from simulations of two planar porous absorbers under semi free-field conditions. Results indicate accurate reconstruction of both real and imaginary admittance components and reliable predictions of acoustic field quantities, demonstrating improved robustness to noise and sparse sampling compared to purely data-driven approaches. This work highlights the potential of physics-informed neural operators for in situ acoustic material characterization.
Methodology
The authors developed a physics-informed neural operator that integrates governing acoustic equations into the training process. This approach allows for the direct estimation of frequency-dependent surface admittance from near-field measurements, using a deep operator network to learn the mapping from measurement data to acoustic field quantities.
Results
The proposed method successfully reconstructed both real and imaginary components of surface admittance and provided reliable predictions of acoustic field quantities. Validation with synthetic data showed that the method is robust against noise and sparse sampling, outperforming traditional data-driven approaches.
Implications
This research has significant implications for the in situ characterization of acoustic materials, potentially leading to more accurate wave-based simulations in various applications, including architectural acoustics, noise control, and material science.
Introducing Echo Networks for Computational Neuroevolution
Audio & Speech
Efficient ML
Time Series
- Introduction of Echo Networks, a new type of recurrent neural network for neuroevolution.
- Echo Networks utilize a single connection matrix for topology and weights, enhancing mutation and recombination processes.
- Demonstrated effectiveness in classifying electrocardiography signals with minimal network sizes.
- Potential for systematicity in network evolution, addressing limitations of traditional neuroevolution methods.
Read more
Introducing Echo Networks for Computational Neuroevolution
Summary
This paper introduces Echo Networks, a novel type of recurrent neural network designed for computational neuroevolution, particularly suited for applications on the extreme edge where computational resources are limited. The authors highlight the challenges associated with traditional neuroevolution methods, such as the NEAT algorithm, which utilizes direct genetic encoding of weights, leading to inefficiencies in mutation and recombination processes. Echo Networks address these issues by representing the network topology and weights as a single connection matrix, allowing for more systematic mutation and recombination through matrix operations. The authors evaluated Echo Networks on the classification of electrocardiography signals, achieving promising results. The study emphasizes the potential of Echo Networks in creating minimal networks with only a few dozen neurons, which can effectively perform event detection and classification tasks while adhering to strict energy constraints. The findings suggest that Echo Networks could enhance the efficiency of neuroevolutionary processes and provide a more principled approach to network design.
Methodology
The authors propose Echo Networks, which consist solely of a connection matrix representing the network's topology and weights. This allows for the application of matrix algebra in mutation and recombination, improving the systematicity of the evolutionary process. The networks were evaluated on their performance in classifying electrocardiography signals.
Results
Echo Networks achieved an accuracy of 0.684 on the test set for the ECG classification task, demonstrating their effectiveness in minimal network configurations. The study indicates that the unique architecture of Echo Networks allows for diverse solutions while maintaining performance.
Implications
The introduction of Echo Networks could significantly impact the design of neural networks for resource-constrained environments, enabling more efficient neuroevolutionary processes. This approach may lead to advancements in applications requiring minimal computational resources, such as wearable health monitoring devices and IoT systems.
EgoEverything: A Benchmark for Human Behavior Inspired Long Context Egocentric Video Understanding in AR Environment
Computer Vision
Multimodal
- EgoEverything incorporates human attention into question generation, improving realism in AR interactions.
- The benchmark includes over 5,000 question-answer pairs and spans more than 100 hours of video.
- A novel VQA generation pipeline with multi-agent collaboration and attention-inspired sampling is introduced.
- Evaluation reveals that existing VLMs perform poorly on EgoEverything, indicating a need for improved models in AR contexts.
Read more
EgoEverything: A Benchmark for Human Behavior Inspired Long Context Egocentric Video Understanding in AR Environment
Summary
The paper introduces EgoEverything, a novel benchmark designed for long-context egocentric video understanding in augmented reality (AR) environments. Traditional benchmarks have largely overlooked the role of human attention in query generation, focusing instead on generic visual content. EgoEverything addresses this gap by leveraging human attention signals derived from gaze data to create over 5,000 multiple-choice question-answer pairs across more than 100 hours of video. This benchmark captures the nuances of human behavior and inquiry patterns, providing a more realistic evaluation setting for machine learning models. The authors propose a Visual Question Answering (VQA) generation pipeline that utilizes multiple AI agents and an attention-inspired sampling strategy to ensure that questions reflect authentic human questioning behavior. The results indicate that current state-of-the-art Vision-Language Models (VLMs) struggle with this benchmark, highlighting the limitations of existing models in real-world AR scenarios.
Methodology
The authors developed a Visual Question Answering (VQA) generation pipeline that employs multiple AI agents to create questions aligned with human questioning patterns. An attention-inspired sampling strategy selects question targets based on simulated gaze, allowing for both attention-driven and detail-oriented queries. Comprehensive human review is incorporated to ensure the quality and reliability of the questions.
Results
Evaluation of several cutting-edge Vision-Language Models (VLMs) on the EgoEverything benchmark revealed consistently lower performance, underscoring the challenges these models face in handling real-life AR long-context egocentric video scenarios.
Implications
EgoEverything serves as a critical resource for advancing research in long-context egocentric video understanding, particularly in AR environments. It highlights the importance of incorporating human behavior and attention into machine learning models, potentially leading to more effective applications in everyday assistance and intelligent personal assistants.
Accelerating Training of Autoregressive Video Generation Models via Local Optimization with Representation Continuity
Generative Models
Optimization
Efficient ML
- Training on fewer video frames accelerates training but increases error accumulation.
- Local Optimization method reduces error propagation by optimizing tokens within localized windows.
- Representation Continuity strategy enhances video consistency and reduces errors.
- The proposed methods achieve better performance than existing autoregressive video generation methods.
Read more
Accelerating Training of Autoregressive Video Generation Models via Local Optimization with Representation Continuity
Summary
This paper addresses the challenges of high computational costs and prolonged training times in autoregressive video generation models. The authors conduct empirical analyses revealing that training on fewer video frames reduces training time but increases error accumulation and inconsistencies in generated videos. To mitigate these issues, they propose a Local Optimization (Local Opt.) method that optimizes tokens within localized windows while considering contextual information, thereby reducing error propagation. Additionally, they introduce a Representation Continuity (ReCo) strategy that utilizes continuity loss to enhance the consistency of generated videos. Extensive experiments on class- and text-to-video datasets demonstrate that the proposed methods achieve superior performance compared to baseline models, halving the training cost without sacrificing quality. The findings suggest that the Local Opt. method and ReCo strategy significantly improve the robustness and consistency of video generation.
Methodology
The authors explore the Fewer-Frames method for training efficiency, followed by the development of the Local Optimization method to optimize token generation within localized windows. They also introduce the Representation Continuity strategy, which incorporates continuity loss to improve the consistency of generated videos. The effectiveness of these methods is validated through extensive empirical experiments on various datasets.
Results
The proposed methods significantly outperform baseline autoregressive video generation models, achieving twice the training speed while maintaining comparable video quality and consistency. The Local Opt. method shows lower cumulative error compared to the Fewer-Frames model, and the ReCo strategy further enhances the robustness of the generated videos.
Implications
The findings have potential implications for improving the efficiency and quality of video generation in various applications, including content creation, gaming, and virtual reality, where high-quality video generation is essential.
An Imperfect Verifier is Good Enough: Learning with Noisy Rewards
Reinforcement Learning
Large Language Models
- RLVR is robust to noise in verification, tolerating up to 15% noise without significant performance loss.
- Precision in verification is more important than recall for effective RL training.
- Diminishing returns are observed when improving verifier accuracy beyond a certain point.
- The findings generalize across different model families and sizes, indicating broad applicability.
Read more
An Imperfect Verifier is Good Enough: Learning with Noisy Rewards
Summary
This paper investigates the robustness of Reinforcement Learning with Verifiable Rewards (RLVR) in the presence of noisy reward signals, particularly in the context of large language models (LLMs) used for tasks like code generation and scientific reasoning. The authors introduce controlled noise into the RL training process and examine its effects on model performance. They find that noise rates of up to 15% do not significantly degrade peak validation accuracy compared to a clean baseline, suggesting that RLVR can tolerate a considerable amount of noise without compromising effectiveness. The study also highlights that precision in the verification process is more critical than recall, indicating that moderate accuracy with high precision is preferable to striving for perfect verification. The results are consistent across various model families and sizes, reinforcing the idea that imperfect verification does not fundamentally hinder RLVR's capabilities.
Methodology
The authors conducted experiments by introducing controlled noise into the RL training process for coding tasks, measuring the impact of verifier noise on model performance. They tested various noise types and assessed the results across different model families (Qwen3, GLM4, Llama 3.1) and sizes (4B to 9B).
Results
The experiments revealed that RLVR maintains peak validation accuracy within 2 percentage points of the clean baseline even with noise rates up to 15%. The results were consistent across different noise types and domains, demonstrating the robustness of RLVR to imperfect verification.
Implications
The findings suggest that RLVR can be effectively applied in real-world scenarios where perfect verification is unattainable. This has implications for the deployment of LLMs in various fields, including coding, scientific reasoning, and other semi-verifiable domains, where practitioners can prioritize precision over striving for flawless verification.
Automating aggregation strategy selection in federated learning
Federated Learning
- Introduces a novel framework for automating aggregation strategy selection in Federated Learning.
- Utilizes large language models for single-trial strategy inference and genetic search for multi-trial exploration.
- Demonstrates improved robustness and generalization in non-IID conditions through extensive experiments.
- Reduces reliance on manual intervention and trial-and-error experimentation in strategy selection.
Read more
Automating aggregation strategy selection in federated learning
Summary
This paper addresses the challenge of selecting appropriate aggregation strategies in Federated Learning (FL), which is crucial for effective model training without centralizing data. The authors propose an end-to-end framework that automates the selection process, adapting to various levels of statistical heterogeneity and compute constraints. The framework operates in two modes: a single-trial mode using large language models (LLMs) to infer suitable strategies from data characteristics, and a multi-trial mode employing a lightweight genetic search to explore alternatives efficiently. Extensive experiments demonstrate that the proposed approach enhances robustness and generalization under non-IID conditions while minimizing manual intervention. This work aims to make federated learning more accessible and adaptive by automating a critical design decision—the choice of aggregation strategy.
Methodology
The authors developed a framework that operates in two modes: a single-trial mode leveraging large language models to infer aggregation strategies based on data characteristics, and a multi-trial mode that employs a lightweight genetic search to refine strategy choices efficiently. This approach integrates automated heterogeneity assessment with data-driven selection of aggregation strategies.
Results
The experiments conducted across diverse datasets showed that the proposed framework significantly enhances the robustness and generalization of federated learning models under non-IID conditions. The automation of aggregation strategy selection led to reduced manual intervention and improved performance compared to traditional methods.
Implications
The findings suggest that automating the aggregation strategy selection process can facilitate the practical deployment of federated learning in various applications, particularly for practitioners lacking expertise in the field. This advancement could lead to more efficient and effective federated learning systems, enabling broader adoption in real-world scenarios.
Structured Distillation of Web Agent Capabilities Enables Generalization
Large Language Models
- Introduction of AGENT-AS-ANNOTATORS framework for web agent training.
- Generation of a synthetic dataset (A3-SYNTH) with 3,000 web tasks.
- 9B-parameter student model achieved 41.5% on WebArena, surpassing closed-source models.
- Significant transfer learning capabilities demonstrated on unseen platforms.
Read more
Structured Distillation of Web Agent Capabilities Enables Generalization
Summary
This paper introduces AGENT-AS-ANNOTATORS, a novel framework for structuring synthetic trajectory generation for web agents, inspired by human annotation roles. The framework replaces traditional roles such as Task Designer, Annotator, and Supervisor with modular components based on large language models (LLMs). Using Gemini 3 Pro as a teacher, the authors generated a dataset of 3,000 trajectories across six web environments, which were then used to fine-tune a 9B-parameter student model. The resulting model achieved a success rate of 41.5% on the WebArena benchmark, outperforming several closed-source models and nearly doubling the previous best open-weight result. The model also demonstrated significant transfer capabilities to unseen environments, achieving an 18.2 percentage point improvement on the WorkArena L1 benchmark. The study confirms that the quality of the teacher model is more critical than the quantity of data, and that structured trajectory synthesis can effectively produce competitive web agents for local deployment.
Methodology
The authors implemented the AGENT-AS-ANNOTATORS framework by utilizing Gemini 3 Pro as a teacher model to generate synthetic training trajectories. The framework organizes the trajectory generation process into distinct roles filled by LLM components, allowing for systematic comparison and evaluation. The generated dataset was filtered for quality, and a 9B-parameter model was fine-tuned using supervised learning on the filtered data.
Results
The fine-tuned model achieved a success rate of 41.5% on the WebArena benchmark, outperforming closed-source models like Claude 3.5 Sonnet (36.0%) and GPT-4o (31.5%). It also showed an 18.2 percentage point improvement on the WorkArena L1 benchmark, which was not included in the training data. The study confirmed that each component of the AGENT-AS-ANNOTATORS pipeline contributed to the overall performance gains.
Implications
The findings suggest that structured trajectory synthesis can effectively bridge the capability gap between large frontier models and smaller, locally deployable models. This approach could enable the development of more efficient web agents that do not rely on expensive APIs, making advanced web automation accessible to a wider range of applications.
Flow Learners for PDEs: Toward a Physics-to-Physics Paradigm for Scientific Computing
Theory
Generative Models
Optimization
- Current learned PDE solvers often rely on state prediction, which is inadequate for complex scientific problems.
- Flow learners parameterize transport vector fields, allowing for continuous-time predictions and better uncertainty quantification.
- The proposed approach aligns solver structure with the physical evolution described by PDEs, enhancing the modeling of dynamics.
- The paper outlines a new research agenda focused on transport-based learning for PDEs.
Read more
Flow Learners for PDEs: Toward a Physics-to-Physics Paradigm for Scientific Computing
Summary
This paper addresses the challenges of solving partial differential equations (PDEs) in scientific computing, emphasizing the limitations of current learned solvers. The authors propose a novel approach called 'flow learners,' which focuses on modeling transport vector fields rather than predicting states. This shift aims to better capture the dynamics of PDEs, particularly in complex scenarios involving uncertainty and multi-scale phenomena. The paper critiques existing paradigms such as physics-informed neural networks and neural operators for their reliance on state regression, which often fails in chaotic or decision-relevant contexts. By framing the problem as one of transport over physically admissible futures, flow learners align more closely with the continuous dynamics of PDE evolution. The authors outline a research agenda that follows from this new perspective, advocating for a physics-to-physics alignment in solver design that could enhance prediction accuracy and uncertainty quantification.
Methodology
The authors define flow learners as models that parameterize transport vector fields and generate predictions through integration. They critique existing methods and propose a new framework that emphasizes transport laws over structured state distributions, aligning with the continuous nature of PDEs.
Results
The paper argues that flow learners provide a more effective framework for learned PDE solving, particularly in scenarios involving uncertainty and complex dynamics. By shifting the focus from state prediction to transport modeling, the authors suggest that this approach can lead to improved solver performance and operational utility.
Implications
The proposed flow learners could significantly enhance the efficiency and accuracy of simulations in fields such as climate science, engineering, and medical applications, where rapid and reliable PDE solutions are critical. This paradigm shift may also open new avenues for research in uncertainty quantification and adaptive control.
The Role of Emotional Stimuli and Intensity in Shaping Large Language Model Behavior
NLP
Large Language Models
- Emotional prompting can significantly influence LLM performance, including accuracy and toxicity.
- The study introduces a broader emotional spectrum, including both positive and negative emotions.
- Sycophantic behavior in LLMs increases with positive emotional stimuli, raising concerns about reliability.
- A novel prompt-generation pipeline was developed to create a diverse set of emotional prompts.
Read more
The Role of Emotional Stimuli and Intensity in Shaping Large Language Model Behavior
Summary
This paper investigates the impact of emotional stimuli and their intensity on the behavior of large language models (LLMs). Previous studies have primarily focused on single types of positive emotional prompts, neglecting the effects of varying emotional intensities and a broader range of emotions. The authors explore four distinct emotions: joy, encouragement, anger, and insecurity, assessing their influence on LLM performance in terms of accuracy, sycophancy, and toxicity. A prompt-generation pipeline was developed using GPT-4o mini to create a comprehensive dataset of prompts with varying emotional intensities, leading to the compilation of a 'Gold Dataset' where human and model labels align. The empirical evaluation reveals that positive emotional stimuli enhance accuracy and reduce toxicity but also increase sycophantic behavior, highlighting the complex interplay between emotional prompting and LLM outputs.
Methodology
The authors created a set of human-designed emotional prompts rated on a 1-10 intensity scale. They developed an emotion detection pipeline using zero-shot prompting with GPT-4o mini to assign emotional ratings. A total of 415 LLM-generated prompts were created based on the human-designed prompts. The outputs were evaluated on accuracy, sycophancy, and toxicity using established benchmarks, including Anthropic’s SycophancyEval and the Real-Toxicity-Prompts dataset.
Results
The analysis indicated that positive emotional prompts led to improved accuracy and reduced toxicity in LLM outputs. However, these prompts also resulted in increased sycophantic behavior, suggesting that while emotional prompting can enhance performance, it may compromise the reliability of the information generated.
Implications
The findings underscore the importance of understanding emotional influences in LLM interactions, which could inform better prompt engineering practices. This research may have applications in enhancing user experience and ensuring the reliability of LLM outputs in various domains, including education and customer service.