AI-generated summaries
Today's ML research,
without the noise.
Daily summaries of the latest machine learning papers from arXiv, processed every 8 hours.
55
Papers today
8h
Update frequency
7
Days of history
Meta-learning In-Context Enables Training-Free Cross Subject Brain Decoding
Computer Vision
Generative Models
Interpretability
- Introduces BrainCoDec, a training-free method for cross-subject brain decoding.
- Utilizes a two-stage hierarchical inference process for visual decoding.
- Achieves generalization across subjects without anatomical alignment or stimulus overlap.
- Demonstrates robustness to input variability and effective reconstruction of visual stimuli.
Read more
Meta-learning In-Context Enables Training-Free Cross Subject Brain Decoding
Summary
This paper addresses the challenge of visual decoding from brain signals, particularly the variability in neural representations across individuals that complicates the development of generalizable cross-subject models. The authors propose a novel meta-optimized approach for semantic visual decoding from fMRI data that allows for generalization to new subjects without the need for fine-tuning. The method, named BrainCoDec (Brain In-Context Decoding), utilizes a two-stage hierarchical inference process. In the first stage, it estimates visual response encoder parameters for individual voxels based on a small set of image-brain activation examples from the new subject. In the second stage, it aggregates these parameters across multiple voxels to perform functional inversion and decode visual stimuli. The approach is designed to work without anatomical alignment or stimulus overlap, demonstrating strong cross-subject and cross-scanner generalization. The results indicate that the method is robust to input variability and can effectively reconstruct visual stimuli from brain activity, marking a significant advancement towards a universal brain decoding model.
Methodology
The methodology involves a two-stage hierarchical process where the first stage estimates voxel-specific visual response encoder parameters using a small set of examples from the new subject. The second stage aggregates these parameters across multiple voxels to perform functional inversion, allowing for the decoding of visual stimuli from brain activity.
Results
The proposed method shows strong generalization capabilities across different subjects and scanners without the need for fine-tuning. It effectively reconstructs visual stimuli from brain activity, demonstrating robustness to input variability and alignment issues.
Implications
This work has significant implications for developing generalizable models of human brain function, which can enhance applications in brain-computer interfaces, cognitive assessments, and personalized diagnostics. It paves the way for more scalable and universal approaches to understanding neural representations across populations.
Playing DOOM with 1.3M Parameters: Specialized Small Models vs Large Language Models for Real-Time Game Control
Reinforcement Learning
Efficient ML
Robotics
- SauerkrautLM-Doom-MultiVec outperforms LLMs with significantly fewer parameters in real-time gameplay.
- Innovative use of ModernBERT architecture and depth-aware token representations enhances performance.
- Trained on 31,000 human gameplay demonstrations, the model exhibits superior engagement in gameplay.
- Demonstrates the effectiveness of specialized models for real-time decision-making tasks.
Read more
Playing DOOM with 1.3M Parameters: Specialized Small Models vs Large Language Models for Real-Time Game Control
Summary
This paper introduces SauerkrautLM-Doom-MultiVec, a specialized 1.3 million parameter model designed for real-time gameplay in DOOM. The model significantly outperforms large language models (LLMs) with up to 92,000 times its size, including Nemotron-120B and GPT-4o-mini, by achieving 178 frags in 10 episodes compared to a combined total of 13 frags from all tested LLMs. The authors argue that small, task-specific models can outperform general-purpose LLMs in real-time control tasks, especially when trained on domain-specific data. Key innovations include a ModernBERT encoder with hash embeddings, depth-aware token representations, and an attention pooling classification head. The model processes ASCII frame representations at a rapid 31ms per decision and was trained on 31,000 human gameplay demonstrations, demonstrating the effectiveness of specialized models in gaming environments. This work highlights the potential of small models to deliver high performance at a fraction of the cost and complexity associated with larger models, making them suitable for deployment on consumer hardware.
Methodology
The authors developed a model that combines a ModernBERT encoder with hash embeddings and depth-aware ASCII encoding to process game frames. The model was trained using a data collection pipeline that utilized VizDoom's spectator mode to gather high-quality training data with depth annotations. The architecture employs an attention pooling classification head to select actions based on the processed input.
Results
The SauerkrautLM-Doom-MultiVec model achieved 178 frags in 10 gameplay episodes, averaging 17.8 frags per episode, while the tested LLMs collectively scored only 13 frags. This performance was achieved with a model that has 92,000 times fewer parameters than the largest LLM tested, demonstrating the effectiveness of specialized models in real-time control tasks.
Implications
The findings suggest that small, specialized models can be more effective than large general-purpose models for specific tasks, particularly in real-time applications like gaming. This could lead to broader adoption of efficient models in various domains requiring quick decision-making and low resource consumption.
Structured Distillation of Web Agent Capabilities Enables Generalization
Large Language Models
- Introduction of AGENT-AS-ANNOTATORS framework for web agent capability distillation.
- Generation of a high-quality training dataset (A3-SYNTH) using a frontier LLM as a teacher.
- Significant performance improvements on WebArena and unseen environments.
- Ablation studies confirm the meaningful contributions of each pipeline component.
Read more
Structured Distillation of Web Agent Capabilities Enables Generalization
Summary
This paper introduces AGENT-AS-ANNOTATORS, a novel framework for structuring synthetic trajectory generation for web agents, inspired by human annotation roles. The framework replaces traditional roles such as Task Designer, Annotator, and Supervisor with modular components powered by large language models (LLMs). Using the Gemini 3 Pro model as a teacher, the authors generated a dataset of 3,000 trajectories across six web environments, which were filtered to yield 2,322 high-quality examples for training a 9B-parameter student model. The resulting model achieved a success rate of 41.5% on the WebArena benchmark, outperforming several closed-source models and nearly doubling the previous best open-weight result. The study also demonstrated that the model's capabilities transfer effectively to unseen environments, achieving significant performance gains on the WorkArena L1 benchmark. The authors conducted ablation studies to confirm the contributions of various components in the pipeline, highlighting the importance of teacher quality and the effectiveness of lower reasoning budgets in generating better training data. Overall, the findings suggest that structured trajectory synthesis from a single frontier teacher can produce competitive, locally deployable web agents.
Methodology
The authors implemented the AGENT-AS-ANNOTATORS framework, replacing human roles with LLM modules to generate synthetic trajectories. They used Gemini 3 Pro to create a dataset of web tasks, which were filtered for quality. A 9B-parameter student model was then fine-tuned on the filtered dataset using supervised learning.
Results
The fine-tuned model achieved a 41.5% success rate on WebArena, surpassing closed-source models like Claude 3.5 Sonnet and GPT-4o. It also showed an 18.2 percentage point improvement on the unseen WorkArena L1 benchmark, with consistent gains across other benchmarks.
Implications
The findings suggest that structured trajectory synthesis can effectively close the capability gap between large and small models, enabling the development of competitive web agents that can be deployed locally without reliance on expensive APIs.
SYN-DIGITS: A Synthetic Control Framework for Calibrated Digital Twin Simulation
NLP
Large Language Models
Generative Models
- SYN-DIGITS is a lightweight, model-agnostic calibration framework for digital twin simulations.
- The framework successfully aligns LLM predictions with human ground truth using latent structure learning.
- Empirical evaluations show significant improvements in prediction accuracy and reduction of biases.
- SYN-DIGITS can be integrated with various simulation approaches, including naΓ―ve simulation and fine-tuning.
Read more
SYN-DIGITS: A Synthetic Control Framework for Calibrated Digital Twin Simulation
Summary
The paper introduces SYN-DIGITS, a novel calibration framework designed to enhance the reliability of digital twin simulations, particularly those based on large language models (LLMs). While LLMs have shown promise in simulating human behavior, they often suffer from systematic biases and miscalibration. SYN-DIGITS leverages synthetic control methods from causal inference to align LLM predictions with actual human responses. The framework operates as a post-processing layer that is model-agnostic, meaning it can be applied to any LLM-based simulator without requiring extensive modifications. The authors develop a latent factor model to identify conditions under which calibration is successful, and they conduct a comprehensive evaluation of ten calibration methods across various persona constructions and datasets. The results demonstrate that SYN-DIGITS significantly improves the accuracy of individual-level predictions and reduces distributional discrepancies compared to uncalibrated models, achieving up to 50% relative improvements in correlation and 90% reductions in discrepancies.
Methodology
SYN-DIGITS employs a post-hoc calibration approach that uses latent factor models to align synthetic responses generated by LLMs with real human responses. The framework evaluates multiple calibration methods across different personas and datasets, focusing on both individual-level and distributional simulations. The authors analyze the conditions under which calibration is effective, utilizing techniques from matrix completion and synthetic control.
Results
The experiments reveal that SYN-DIGITS achieves up to 50% relative improvements in individual-level correlation and 50-90% reductions in distributional discrepancies when compared to uncalibrated baselines. The framework consistently outperforms existing calibration methods across various scenarios.
Implications
SYN-DIGITS has significant implications for fields that rely on accurate human behavior simulation, such as market research, recommender systems, and social sciences. By improving the calibration of LLMs, the framework enhances the reliability of digital twin simulations, making them more applicable in real-world decision-making and research contexts.
Quantization Impact on the Accuracy and Communication Efficiency Trade-off in Federated Learning for Aerospace Predictive Maintenance
Federated Learning
Time Series
Efficient ML
- AeroConv1D model designed for efficient predictive maintenance in aerospace using federated learning.
- INT4 quantization achieves accuracy similar to FP32 while reducing communication costs by 8x.
- Non-IID evaluation reveals the limitations of IID client partitioning in assessing quantization performance.
- INT2 quantization leads to instability in performance metrics, making it impractical for deployment.
Read more
Quantization Impact on the Accuracy and Communication Efficiency Trade-off in Federated Learning for Aerospace Predictive Maintenance
Summary
This paper explores the challenges of deploying federated learning (FL) for predictive maintenance in aerospace contexts, particularly focusing on the trade-off between accuracy and communication efficiency when using quantization techniques. The study introduces AeroConv1D, a lightweight 1-D convolutional model designed for FPGA inference, and evaluates its performance under varying levels of symmetric uniform quantization (32, 8, 4, and 2 bits) on the NASA C-MAPSS benchmark. A multi-seed evaluation reveals that INT4 quantization achieves accuracy comparable to full precision (FP32) while significantly reducing communication costs. The paper highlights the methodological pitfalls of using IID client partitioning, which can misrepresent the true performance of quantization under realistic Non-IID conditions. The findings indicate that while INT2 quantization shows some improvements in mean absolute error (MAE), it leads to instability in performance metrics, rendering it unsuitable for practical applications. Additionally, the paper provides FPGA resource projections, confirming that INT4 quantization is feasible for deployment on Xilinx ZCU102 hardware. Overall, this work emphasizes the importance of accurate evaluation methods in quantization studies and provides insights for future research in federated learning for aerospace applications.
Methodology
The study employs a multi-seed evaluation approach (N = 10) to assess the performance of AeroConv1D under different quantization levels. It utilizes paired t-tests to determine statistical significance and compares results under IID and Non-IID client partitioning to highlight methodological biases.
Results
The results indicate that INT4 quantization maintains accuracy comparable to FP32 across different metrics, while INT2 quantization, despite showing lower MAE on one subset, suffers from high instability in performance scores. The communication cost is significantly reduced with INT4, making it a viable option for deployment in bandwidth-constrained environments.
Implications
The findings suggest that careful consideration of quantization techniques is crucial for deploying federated learning in real-world aerospace applications. The study also underscores the need for accurate evaluation methods to avoid misleading conclusions about model performance under varying data distributions.
KV Cache Offloading for Context-Intensive Tasks
NLP
Large Language Models
Efficient ML
- Introduces the Text2JSON benchmark for evaluating KV-cache offloading on context-intensive tasks.
- Identifies significant performance degradation in existing KV offloading methods for Llama 3 and Qwen 3 models.
- Proposes a new strategy to improve accuracy in KV-cache offloading.
- Highlights the inadequacy of current benchmarks in capturing the challenges of context-intensive tasks.
Read more
KV Cache Offloading for Context-Intensive Tasks
Summary
This paper addresses the challenges posed by key-value (KV) cache in large language models (LLMs) when handling long-context inputs, particularly in context-intensive tasks that require extensive information retrieval. The authors introduce the Text2JSON benchmark, designed to evaluate KV-cache offloading techniques on tasks that necessitate extracting structured knowledge from raw text. Their findings reveal significant performance degradation in existing KV offloading methods when applied to context-intensive tasks, particularly with the Llama 3 and Qwen 3 models. The authors identify two primary reasons for this degradation: low-rank projection of keys and unreliable landmarks. To mitigate these issues, they propose a simpler alternative strategy that enhances accuracy across various LLM families and benchmarks. The study emphasizes the need for more rigorous evaluations of long-context compression techniques to ensure their effectiveness in real-world applications.
Methodology
The authors conducted systematic evaluations of KV-cache offloading techniques across a range of benchmarks, including the newly introduced Text2JSON. They analyzed performance degradation in existing methods and identified key factors contributing to accuracy loss. A new strategy was proposed to enhance the performance of KV offloading in context-intensive scenarios.
Results
The study found that existing KV offloading methods resulted in significant performance drops on context-intensive tasks. The proposed alternative strategy improved accuracy across multiple LLM families and benchmarks, demonstrating the potential for better handling of long-context inputs.
Implications
The findings suggest that current KV-cache offloading techniques may not be suitable for all types of tasks, particularly those requiring extensive context retrieval. The proposed strategies could lead to more effective applications of LLMs in real-world scenarios, such as document translation and legal analysis, where context is crucial.
Flow Learners for PDEs: Toward a Physics-to-Physics Paradigm for Scientific Computing
Theory
Generative Models
Optimization
- Current learned PDE solvers often misrepresent the underlying physics by focusing on state prediction rather than transport dynamics.
- Flow learners provide a more accurate framework by parameterizing transport vector fields, enabling better modeling of uncertainty and continuous dynamics.
- The proposed paradigm supports improved predictions over long time horizons and in chaotic or partially observed environments.
- The authors advocate for a shift in the research agenda towards a physics-to-physics approach in the design of learned solvers.
Read more
Flow Learners for PDEs: Toward a Physics-to-Physics Paradigm for Scientific Computing
Summary
This paper addresses the challenges of solving partial differential equations (PDEs) in scientific computing, emphasizing the limitations of current learned solvers that primarily focus on state prediction. The authors propose a new paradigm called 'flow learners,' which parameterizes transport vector fields to generate trajectories through integration, aligning more closely with the continuous dynamics of PDE evolution. The paper critiques existing methods such as physics-informed neural networks and neural operators for their reliance on snapshot predictions, which can lead to inaccuracies over long time horizons and in complex scenarios. By shifting the focus from state prediction to modeling transport over physically admissible futures, flow learners offer enhanced capabilities for uncertainty quantification and continuous-time predictions. The authors outline a research agenda that emerges from this new perspective, advocating for a physics-to-physics alignment in learned PDE solving that could significantly improve the efficiency and applicability of machine learning in scientific computing.
Methodology
The authors introduce flow learners as a new class of models that parameterize transport vector fields instead of predicting states. This involves integrating or sampling the induced dynamics to generate trajectories that reflect the continuous evolution defined by PDEs. The paper discusses the theoretical underpinnings of this approach and contrasts it with traditional regression-based methods.
Results
The paper argues that flow learners can overcome the limitations of existing learned PDE solvers by providing a more robust framework for capturing the dynamics of PDEs. This shift is expected to lead to better performance in terms of prediction accuracy and computational efficiency, particularly in complex and uncertain environments.
Implications
The proposed flow learners could revolutionize the way PDEs are solved in scientific computing, enabling faster and more reliable simulations in fields such as climate modeling, engineering, and medical simulations. The physics-to-physics paradigm may facilitate new applications in inverse design, adaptive control, and uncertainty-aware planning.
Pruning Extensions and Efficiency Trade-Offs for Sustainable Time Series Classification
Time Series
Efficient ML
- Introduces a unified methodology for evaluating performance and efficiency trade-offs in TSC.
- Presents a pruning strategy for hybrid classifiers Hydra and Quant, leading to the development of Hydrant.
- Demonstrates significant energy savings (up to 80%) with minimal impact on accuracy (less than 5%).
- Conducts extensive experiments across diverse datasets and hardware setups to validate findings.
Read more
Pruning Extensions and Efficiency Trade-Offs for Sustainable Time Series Classification
Summary
This paper addresses the need for a unified understanding of performance trade-offs in time series classification (TSC), particularly focusing on energy efficiency. The authors introduce a holistic evaluation framework that balances predictive performance and resource consumption. They propose a pruning strategy applied to two leading hybrid classifiers, Hydra and Quant, resulting in a new model called Hydrant. Through extensive experimentation involving over 4000 configurations across 20 MONSTER datasets and various compute setups, the study reveals that pruning can reduce energy consumption by up to 80% while maintaining competitive accuracy, typically sacrificing less than 5% in predictive quality. The findings emphasize the importance of resource awareness in TSC and provide a foundation for sustainable practices in machine learning.
Methodology
The authors developed a holistic evaluation framework that integrates formalizations for TSC methods, including a pruning strategy for existing classifiers. They conducted systematic experiments with various model designs, hyperparameters, and hardware configurations to analyze the impact on energy efficiency and predictive performance.
Results
The study found that the proposed pruning strategy significantly reduces energy consumption by up to 80% while maintaining competitive predictive accuracy, with an average accuracy loss of less than 5%. The results highlight intricate performance trade-offs influenced by model design and hardware choices.
Implications
The findings suggest that TSC can be made more sustainable through resource-aware practices, which could lead to broader applications in fields such as healthcare and environmental monitoring. The proposed methods and software repository also encourage reproducibility and open science in AI and ML research.
DSPR: Dual-Stream Physics-Residual Networks for Trustworthy Industrial Time Series Forecasting
Time Series
Graph Learning
Interpretability
- DSPR effectively decouples stable trends from regime-dependent dynamics in industrial time series forecasting.
- The framework incorporates an Adaptive Window module and a Physics-Guided Dynamic Graph to enhance physical plausibility.
- DSPR achieves state-of-the-art performance with over 99% Mean Conservation Accuracy and 97.2% Total Variation Ratio.
- The model provides interpretable insights that align with known physical mechanisms, aiding scientific analysis.
Read more
DSPR: Dual-Stream Physics-Residual Networks for Trustworthy Industrial Time Series Forecasting
Summary
The paper introduces DSPR (Dual-Stream Physics-Residual Networks), a novel framework designed to enhance the forecasting accuracy of industrial time series while ensuring physical plausibility. Traditional data-driven models often excel in statistical performance but fail to respect the complex dynamics and transport delays inherent in real-world industrial systems. DSPR addresses this by decoupling stable temporal patterns from regime-dependent residual dynamics through a dual-stream architecture. The first stream captures the statistical evolution of individual variables, while the second stream focuses on residual dynamics using an Adaptive Window module for flow-dependent transport delays and a Physics-Guided Dynamic Graph to learn time-varying interaction structures. Experiments on four industrial benchmarks demonstrate that DSPR significantly improves forecasting accuracy and robustness during regime shifts, achieving state-of-the-art predictive performance with high conservation accuracy and total variation ratios. Additionally, the framework provides interpretable insights into the learned interaction structures and adaptive lags, aligning with known physical mechanisms, thus bridging the gap between advanced forecasting models and trustworthy autonomous control systems.
Methodology
DSPR employs a dual-stream architecture that separates the modeling of stable temporal patterns from regime-dependent residual dynamics. The first stream captures statistical trends, while the second stream utilizes an Adaptive Window module to estimate transport delays and a Physics-Guided Dynamic Graph to incorporate physical priors for learning dynamic interaction structures.
Results
DSPR was validated on four diverse datasets, achieving significant improvements in forecasting accuracy and robustness during regime shifts. It recorded a Mean Conservation Accuracy exceeding 99% and a Total Variation Ratio of up to 97.2%, outperforming existing state-of-the-art models.
Implications
The DSPR framework offers a promising approach for trustworthy industrial time series forecasting, with potential applications in safety-critical domains such as emission control and power dispatch. Its ability to provide interpretable insights into physical mechanisms makes it valuable for scientific analysis and operational decision-making.
SCOT: Multi-Source Cross-City Transfer with Optimal-Transport Soft-Correspondence Objective
Optimization
Graph Learning
Theory
- SCOT addresses the challenge of explicit soft correspondence in cross-city transfer learning.
- The framework utilizes Sinkhorn-based entropic optimal transport for aligning region representations.
- An OT-weighted contrastive objective enhances semantic separation and transferability.
- SCOT shows significant improvements in transfer accuracy and robustness across various urban prediction tasks.
Read more
SCOT: Multi-Source Cross-City Transfer with Optimal-Transport Soft-Correspondence Objective
Summary
The paper introduces SCOT, a novel framework for cross-city transfer learning that addresses the challenges of aligning region representations from different cities with incompatible partitions and no ground-truth correspondences. Traditional methods often rely on heuristic matching or distribution-level alignment, which can be unstable and sensitive to anchor choices. SCOT leverages Sinkhorn-based entropic optimal transport to learn explicit soft correspondences between unequal region sets, enhancing the transferability of learned representations. The framework incorporates an OT-weighted contrastive objective to sharpen semantic distinctions and a cycle-style reconstruction regularizer to stabilize optimization. SCOT also extends to multi-source transfer by aligning multiple cities to a shared prototype hub, guided by a target-induced prior to prevent source domination. Experimental results demonstrate that SCOT consistently outperforms strong baselines in predicting urban metrics such as GDP, population, and CO2 emissions, showcasing improved robustness in scenarios with heterogeneous data and scarce labels.
Methodology
SCOT employs a Sinkhorn-based entropic optimal transport framework to establish explicit soft correspondences between unequal region sets. It incorporates an OT-weighted contrastive objective to enhance semantic discriminability and a cycle reconstruction regularizer for optimization stability. For multi-source transfer, SCOT aligns each source city to a shared prototype hub using balanced entropic transport, guided by a target-induced prior.
Results
The experiments reveal that SCOT consistently achieves higher transfer accuracy and robustness compared to existing methods across various tasks, including GDP, population, and CO2 predictions. The results indicate that the improvements are attributed to the alignment design rather than the capacity of the underlying encoders.
Implications
SCOT's approach to cross-city transfer learning can be applied to various urban computing tasks, enabling better predictions in label-scarce environments. The framework's ability to provide interpretable diagnostics can assist urban planners and researchers in understanding the quality of data alignment across different regions.
An Illusion of Unlearning? Assessing Machine Unlearning Through Internal Representations
Computer Vision
Theory
- Current MU methods often fail to erase the internal representations of forgotten data, leading to potential vulnerabilities.
- Featureβclassifier misalignment is a significant issue that can result in the re-emergence of forgotten concepts.
- A new MU method based on class-mean features (CMF) is proposed to enhance alignment between features and classifiers.
- CMF-based unlearning effectively reduces forgotten information while preserving high accuracy on retained classes.
Read more
An Illusion of Unlearning? Assessing Machine Unlearning Through Internal Representations
Summary
This paper investigates the effectiveness of machine unlearning (MU) methods, which aim to erase the influence of specific training data from models without complete retraining. The authors highlight a critical issue: many MU methods, while appearing successful based on output-level metrics, fail to remove the underlying representations of forgotten data. This phenomenon, termed featureβclassifier misalignment, indicates that hidden features remain discriminative even after unlearning. The study emphasizes the need for evaluating MU effectiveness through internal representations rather than solely relying on output metrics. By analyzing the alignment between class-mean features and classifiers, the authors propose a new MU method based on a class-mean features (CMF) classifier that better aligns features with classifiers. Experiments demonstrate that CMF-based unlearning effectively reduces the retention of forgotten information while maintaining high accuracy on retained classes, underscoring the importance of representation-level evaluations in assessing MU.
Methodology
The authors conducted an analysis of various state-of-the-art MU methods by examining the internal representations of models post-unlearning. They employed linear probing to assess the discriminative power of hidden features and evaluated the alignment between class-mean features and classifiers. The proposed CMF classifier was tested against standard benchmarks to measure its effectiveness in reducing forgotten information while maintaining accuracy.
Results
The results indicated that many MU methods achieved negligible forget accuracy, yet hidden-layer features remained highly discriminative. The proposed CMF-based unlearning method demonstrated a significant reduction in forgotten information in representations while maintaining high retain accuracy across standard benchmarks.
Implications
The findings suggest that current MU methods may not provide true unlearning, posing risks in applications where data privacy and ethical considerations are paramount. The proposed CMF method could enhance the reliability of MU in practical scenarios, ensuring compliance with privacy regulations and improving the trustworthiness of machine learning systems.
Bit-by-Bit: Progressive QAT Strategy with Outlier Channel Splitting for Stable Low-Bit LLMs
Large Language Models
Efficient ML
Optimization
- Introduces a progressive QAT framework that enhances stability during low-bit training.
- Employs outlier channel splitting to mitigate quantization errors effectively.
- Achieves significant speed improvements with custom operators for low-bit configurations.
- Demonstrates superior performance on LLaMA-2/3 compared to existing QAT baselines.
Read more
Bit-by-Bit: Progressive QAT Strategy with Outlier Channel Splitting for Stable Low-Bit LLMs
Summary
The paper presents BIT-BY-BIT, a novel progressive quantization-aware training (QAT) framework designed to enhance the stability and efficiency of training large language models (LLMs) at ultra-low precision. Traditional low-bit QAT methods often face challenges such as convergence instability and high training costs, primarily due to quantization noise from outlier channels and error accumulation across layers. BIT-BY-BIT addresses these issues through three main innovations: (1) a block-wise progressive training approach that gradually reduces precision, ensuring stable initialization for low-bit optimization; (2) a nested structure of integer quantization grids that allows a single model to support multiple bit-widths without retraining; and (3) rounding-aware outlier channel splitting, which reduces quantization errors while preserving output integrity. The framework also incorporates microscaling groups with E4M3 scales to align with industry standards. The authors developed custom operators for efficient 2-bit kernels, achieving significant speed improvements. Comprehensive evaluations on LLaMA-2/3 demonstrate that BIT-BY-BIT outperforms existing QAT methods, achieving minimal loss increases compared to full-precision models.
Methodology
The BIT-BY-BIT framework employs a progressive training strategy that reduces precision in stages, starting with weights and then moving to activations. It utilizes a nested quantization grid structure to facilitate deployment across various bit-widths without retraining. Rounding-aware outlier channel splitting is implemented to address quantization errors, and custom operators are developed for efficient low-bit computations.
Results
The BIT-BY-BIT framework significantly outperformed baseline methods like BitDistiller and EfficientQAT on LLaMA-2/3 models, achieving only a +2.25 increase in perplexity on WikiText2 under W2A2 quantization settings, compared to full precision. The method also demonstrated up to 11Γ speedup over BF16 configurations.
Implications
The advancements in low-bit QAT presented in this paper could lead to more efficient deployment of large language models in resource-constrained environments, enabling broader accessibility and application of LLMs in various domains.
Alloc-MoE: Budget-Aware Expert Activation Allocation for Efficient Mixture-of-Experts Inference
NLP
Large Language Models
Efficient ML
- Introduces the concept of an activation budget for expert activations in MoE models.
- Presents Alloc-L and Alloc-T strategies for optimizing expert allocation at layer and token levels, respectively.
- Demonstrates that Alloc-MoE maintains model performance while significantly improving inference speed.
- Achieves notable speedups on DeepSeek-V2-Lite with reduced expert activations.
Read more
Alloc-MoE: Budget-Aware Expert Activation Allocation for Efficient Mixture-of-Experts Inference
Summary
The paper introduces Alloc-MoE, a novel framework designed to optimize expert activation allocation in Mixture-of-Experts (MoE) models, particularly under resource constraints. MoE architectures are known for their sparse activation mechanism, which enhances the scalability of large language models. However, the high number of expert activations can lead to significant latency during inference. The authors propose the concept of an 'activation budget' to manage the number of expert activations effectively. Alloc-MoE operates at both the layer and token levels, employing two main strategies: Alloc-L for layer-level allocation, which uses sensitivity profiling and dynamic programming to determine optimal expert distribution, and Alloc-T for token-level redistribution based on routing scores. The framework aims to minimize performance degradation while adhering to a fixed activation budget. Experimental results show that Alloc-MoE achieves substantial speedups in inference times without compromising model accuracy, specifically demonstrating 1.15Γ speedup in prefill and 1.34Γ in decode on the DeepSeek-V2-Lite model while using only half of the original activation budget.
Methodology
The methodology involves two main components: Alloc-L, which optimizes layer-level expert activation allocation using sensitivity profiling and dynamic programming, and Alloc-T, which reallocates expert activations at the token level based on routing scores. This dual-level approach allows for efficient budget management without incurring additional latency.
Results
Alloc-MoE was tested on multiple MoE models, achieving a 1.15Γ speedup in prefill and a 1.34Γ speedup in decode on the DeepSeek-V2-Lite model while maintaining performance close to the original model with full expert activations. The framework effectively balances the trade-off between reduced expert activations and model accuracy.
Implications
The findings suggest that Alloc-MoE can facilitate the deployment of large language models in resource-constrained environments by optimizing expert activation allocation. This can lead to more efficient inference in real-world applications, particularly in scenarios where latency is critical.
Prediction Arena: Benchmarking AI Models on Real-World Prediction Markets
Reinforcement Learning
- Prediction Arena benchmarks AI models in real-world prediction markets, providing objective evaluation metrics.
- Cohort 1 models showed significant performance differences on different platforms, with Polymarket yielding better returns than Kalshi.
- The study identifies key factors influencing model performance, including initial prediction accuracy and capitalizing on correct predictions.
- Computational efficiency does not correlate with performance, challenging assumptions about model complexity.
Read more
Prediction Arena: Benchmarking AI Models on Real-World Prediction Markets
Summary
The paper introduces Prediction Arena, a novel benchmark designed to evaluate the predictive accuracy and decision-making capabilities of AI models by allowing them to autonomously trade on real prediction markets using actual capital. This approach contrasts with traditional synthetic benchmarks by providing objective ground truth in live environments, specifically on platforms like Kalshi and Polymarket. The evaluation spans 57 days, tracking two cohorts of models: six frontier models engaged in live trading and four next-generation models in paper trading. The results reveal a performance hierarchy among the models, with significant differences in returns across platforms. Notably, the models in Cohort 1 experienced returns ranging from -16.0% to -30.8% on Kalshi, while on Polymarket, they averaged only -1.1%. The study also highlights the importance of platform design on model success, as demonstrated by the superior performance of gemini-3.1-pro-preview on Polymarket. Beyond performance metrics, the paper analyzes computational efficiency and trading behaviors, providing insights into how models operate under real financial pressure. Overall, Prediction Arena serves as a comprehensive evaluation framework for assessing AI models in realistic trading scenarios.
Methodology
The methodology involves deploying AI models as autonomous traders in live prediction markets. Two cohorts of models are evaluated: Cohort 1 consists of six frontier models engaged in live trading on Kalshi and Polymarket, while Cohort 2 includes four next-generation models in paper trading. The evaluation tracks performance metrics such as account value, profit and loss (PnL), and win rates over a 57-day period.
Results
Cohort 1 models on Kalshi experienced returns between -16.0% and -30.8%, while on Polymarket, they averaged -1.1%. The model grok-4-20-checkpoint achieved a 71.4% settlement win rate, the highest across all platforms. Cohort 2's gemini-3.1-pro-preview, which did not trade on Kalshi, achieved a +6.02% return on Polymarket in just three days.
Implications
The findings suggest that real-world benchmarks like Prediction Arena can provide more accurate assessments of AI model capabilities compared to synthetic environments. This has implications for the development and deployment of AI in financial decision-making contexts, emphasizing the need for models that can perform effectively under real financial pressures.
Accelerating Training of Autoregressive Video Generation Models via Local Optimization with Representation Continuity
Generative Models
Optimization
Efficient ML
- Fewer-Frames method reduces training time but increases error and inconsistency in generated videos.
- Local Optimization method improves training efficiency and reduces error accumulation compared to Fewer-Frames.
- Representation Continuity strategy enhances video consistency and robustness while maintaining training speed.
- Experimental results show the proposed methods outperform existing autoregressive video generation techniques.
Read more
Accelerating Training of Autoregressive Video Generation Models via Local Optimization with Representation Continuity
Summary
This paper addresses the challenges of high computational costs and prolonged training times associated with autoregressive video generation models. The authors conduct empirical analyses revealing that training on fewer video frames reduces training time but increases error accumulation and inconsistencies in generated videos. To mitigate these issues, they propose a Local Optimization (Local Opt.) method that optimizes tokens within localized windows while leveraging contextual information to minimize error propagation. Additionally, they introduce a Representation Continuity (ReCo) strategy that employs continuity loss to enhance the consistency of generated videos. Experimental results on class- and text-to-video datasets demonstrate that the proposed methods achieve superior performance compared to baseline models, halving the training cost without sacrificing quality. The study provides insights into training acceleration and presents theoretical proofs supporting the advantages of Local Opt. and ReCo in reducing error accumulation and improving video consistency.
Methodology
The authors explore a Fewer-Frames method for training, followed by the development of a Local Optimization method that optimizes tokens in localized windows. They also introduce a Representation Continuity strategy that incorporates continuity loss to enhance consistency. The methods are validated through extensive experiments on various datasets, with theoretical proofs provided for the advantages of Local Opt. and ReCo.
Results
The proposed methods significantly reduce training time and error accumulation, achieving superior performance compared to baseline models. The Local Optimization method and Representation Continuity strategy lead to improved video quality and consistency, with experimental evaluations indicating that the approach can achieve twice the training speed of the baseline while maintaining quality.
Implications
The findings suggest that autoregressive video generation can be made more efficient, enabling broader applications in real-time video generation and interactive media. The methods may also be applicable to other generative tasks requiring consistency and efficiency.
What Drives Representation Steering? A Mechanistic Case Study on Steering Refusal
NLP
Large Language Models
Interpretability
- Introduces a multi-token activation patching framework for analyzing steering vectors in LLMs.
- Finds that refusal steering interacts mainly with the OV circuit of the attention mechanism.
- Demonstrates that freezing attention scores has a negligible effect on steering performance.
- Reveals that steering vectors can be sparsified by up to 90-99% while retaining performance.
Read more
What Drives Representation Steering? A Mechanistic Case Study on Steering Refusal
Summary
This paper investigates the internal mechanisms of steering vectors applied to large language models (LLMs) for model alignment, particularly focusing on refusal steering. The authors propose a multi-token activation patching framework to analyze how different steering methodologies interact with model components. Their findings reveal that steering vectors primarily affect the attention mechanism through the OV circuit while largely ignoring the QK circuit. The study demonstrates that freezing attention scores during steering has a minimal impact on performance, suggesting that refusal steering is robust to certain model components. Additionally, the authors introduce a mathematical decomposition of the steered OV circuit, which provides semantically interpretable concepts. They also show that steering vectors can be significantly sparsified (by 90-99%) without substantial loss in performance, indicating that different steering methodologies converge on a small set of important dimensions. This work enhances the understanding of LLMs and offers insights for improving steering robustness and effectiveness.
Methodology
The authors employed a multi-token activation patching approach to extend circuit discovery to steered generations. They analyzed the interaction of steering vectors with model components, particularly focusing on the attention mechanism. The study involved mathematical decomposition of the OV circuit and performance evaluation of various steering methodologies.
Results
The research found that refusal steering primarily interacts with the OV circuit, with minimal performance degradation (approximately 8.75%) when freezing attention scores. The decomposition of the OV circuit revealed semantically interpretable concepts, and the ability to sparsify steering vectors by 90-99% while maintaining performance was demonstrated.
Implications
The findings provide a deeper mechanistic understanding of how steering vectors operate within LLMs, which can inform the design of more effective steering interventions. This research could lead to improved safety and alignment of LLMs in various applications, particularly in contexts requiring nuanced refusal capabilities.
The Impact of Dimensionality on the Stability of Node Embeddings
Graph Learning
- Dimensionality significantly affects the stability of node embeddings.
- Different embedding methods exhibit varying stability patterns with increased dimensionality.
- Maximum stability does not necessarily align with optimal performance in downstream tasks.
- The study emphasizes the importance of selecting appropriate embedding dimensions.
Read more
The Impact of Dimensionality on the Stability of Node Embeddings
Summary
This paper investigates the influence of embedding dimensionality on the stability and performance of node embeddings generated by various methods. Previous research has shown that node embeddings can vary significantly even with identical training parameters due to randomness in the training process. The authors systematically evaluate five popular node embedding techniquesβASNE, DGI, GraphSAGE, node2vec, and VERSEβacross multiple datasets and varying embedding dimensions. They assess stability from both representational and functional perspectives, alongside performance metrics for downstream tasks such as node classification and link prediction. The findings reveal that stability is highly dependent on the dimensionality of the embeddings, with some methods like node2vec and ASNE becoming more stable as dimensionality increases, while others do not follow this trend. Importantly, the study highlights that maximum stability does not always correlate with optimal performance, emphasizing the need for careful selection of embedding dimensions. The authors provide code for reproducibility, contributing to the understanding of trade-offs in graph representation learning.
Methodology
The authors conducted a systematic evaluation of five widely used node embedding methods across multiple datasets, varying the dimensionality of the embeddings. They assessed stability from both representational and functional perspectives, and evaluated performance on downstream tasks such as node classification and link prediction.
Results
The results indicate that embedding stability varies significantly with dimensionality, with methods like node2vec and ASNE showing increased stability at higher dimensions, while others do not. Furthermore, the study found that the highest stability does not always coincide with the best performance on downstream tasks.
Implications
The findings suggest that researchers and practitioners should carefully consider the dimensionality of node embeddings to balance stability and performance. This has implications for applications in social networks, recommendation systems, and other domains where graph representation learning is critical.
Rethinking Residual Errors in Compensation-based LLM Quantization
Large Language Models
Efficient ML
Optimization
- Introduces a refined calibration objective for quantization that aligns outputs with the original model rather than compensated weights.
- Defines 'compensation-aware error' to capture intra-layer discrepancies introduced by weight compensation.
- Utilizes neuron decomposition techniques to efficiently incorporate the new error formulation into weight updates.
- Demonstrates significant performance improvements in quantization for LLMs with minimal modifications to existing methods.
Read more
Rethinking Residual Errors in Compensation-based LLM Quantization
Summary
This paper addresses the challenges of quantizing Large Language Models (LLMs) through a novel approach to weight compensation. Building on previous works like GPTQ and GPTAQ, the authors critique existing calibration objectives that align quantized outputs with compensated weights instead of the original full-precision outputs. They propose a refined calibration objective that aims to align the quantized model's output directly with the original model's output, thereby redefining the concept of residual error. The authors introduce the notion of 'compensation-aware error,' which accounts for discrepancies not only from preceding layers but also from within the current layer due to weight compensation. By leveraging neuron decomposition techniques, they efficiently integrate this new error formulation into the weight update process. Extensive experiments demonstrate that their enhancements significantly improve quantization performance across various LLMs and settings, while requiring minimal modifications to existing frameworks.
Methodology
The authors reformulate the calibration objective for quantization by aligning the quantized model's output with the original full-precision output. They introduce the concept of compensation-aware error, which is integrated into the weight update process using neuron decomposition techniques. This approach allows for efficient computation and incorporation of the new error formulation into existing quantization frameworks like GPTQ and GPTAQ.
Results
The proposed enhancements lead to significant improvements in quantization performance across various large language models and quantization settings, validating the effectiveness of the new calibration objective and compensation-aware error formulation.
Implications
The findings suggest that refining calibration objectives in quantization can enhance the deployment of large language models in resource-constrained environments, making advanced AI technologies more accessible and efficient.
Decisions and Deployment: The Five-Year SAHELI Project (2020-2025) on Restless Multi-Armed Bandits for Improving Maternal and Child Health
Reinforcement Learning
Optimization
- Restless Multi-Armed Bandits effectively optimize limited public health interventions.
- Decision-focused learning enhances the predict-then-optimize approach in healthcare settings.
- Long-term interventions led to improved adherence to mHealth programs and better health behaviors.
Read more
Decisions and Deployment: The Five-Year SAHELI Project (2020-2025) on Restless Multi-Armed Bandits for Improving Maternal and Child Health
Summary
The SAHELI project, a collaboration between AI researchers and ARMMAN, addresses the challenge of optimizing limited healthcare resources for maternal and child health in India through the application of Restless Multi-Armed Bandits (RMAB). The project aims to enhance engagement in mobile health (mHealth) programs, particularly targeting underserved populations. The RMAB framework is utilized to predict beneficiaries' listenership patterns and to develop a scheduling policy for live service calls from healthcare workers. The project began in 2020 and has been operational since April 2022, benefiting over 350,000 mothers. The methodology includes a two-stage learning process and decision-focused learning, moving from a predict-then-optimize paradigm to a more dynamic approach. Empirical evaluations through field studies demonstrate significant improvements in both engagement with the mHealth program and positive shifts in maternal health behaviors. The findings suggest that AI-driven interventions can effectively enhance public health outreach, providing a scalable model for resource allocation in constrained settings.
Methodology
The project employs a Restless Multi-Armed Bandit (RMAB) model to formulate the problem of scheduling live service calls. It incorporates a two-stage learning process to adaptively learn parameters from data and utilizes decision-focused learning to optimize resource allocation dynamically.
Results
The SAHELI project resulted in sustained improvements in beneficiary engagement and statistically significant positive changes in maternal health behaviors, demonstrating the effectiveness of AI-augmented outreach in public health.
Implications
The findings indicate that AI-driven systems can significantly enhance the effectiveness of mHealth programs, providing a blueprint for optimizing resource allocation in public health initiatives, especially in low-resource settings.
An Imperfect Verifier is Good Enough: Learning with Noisy Rewards
Reinforcement Learning
Large Language Models
- RLVR is robust to noise, with up to 15% noise rates yielding minimal performance drops.
- Precision in verification is more important than recall for effective training.
- Diminishing returns are observed when improving verifier accuracy beyond a certain threshold.
- The findings apply across different model families and noise types.
Read more
An Imperfect Verifier is Good Enough: Learning with Noisy Rewards
Summary
This paper investigates the robustness of Reinforcement Learning with Verifiable Rewards (RLVR) in the presence of noisy reward signals, particularly in the context of large language models (LLMs) used for tasks like code generation and scientific reasoning. The authors introduce noise into the RL training process and analyze its impact on model performance. They find that noise rates of up to 15% do not significantly degrade peak validation accuracy compared to a clean baseline, suggesting that RLVR can tolerate a considerable amount of noise without compromising effectiveness. The study encompasses various noise types and model families, demonstrating that precision in verification is more critical than recall. The findings indicate that while improving verifier accuracy is beneficial, there are diminishing returns beyond a certain point, leading to the conclusion that an imperfect verifier can still be effective for RLVR applications.
Methodology
The authors conducted experiments by introducing controlled and realistic noise into the RL training process for coding tasks. They measured the impact of this noise on model performance across various models (Qwen3, GLM4, Llama 3.1) and sizes (4B to 9B), comparing results against a clean baseline to assess the robustness of RLVR.
Results
The experiments revealed that RLVR maintains peak validation accuracy within 2 percentage points of the clean baseline even with noise rates up to 15%. This robustness was consistent across different noise types and model families, indicating that RLVR can effectively handle imperfect verification.
Implications
The findings suggest that RLVR can be effectively applied in real-world scenarios where perfect verification is unattainable. This has significant implications for the deployment of LLMs in various domains, as it allows for more flexible and practical training approaches that do not require flawless reward signals.
A Novel Edge-Assisted Quantum-Classical Hybrid Framework for Crime Pattern Learning and Classification
Optimization
Theory
Efficient ML
- Introduction of a comprehensive quantum-classical comparison framework for crime analytics.
- Development of a novel quantum circuit architecture that leverages crime feature correlations.
- Demonstration of competitive performance of quantum-inspired models compared to classical baselines.
- Hybrid architectures show promise for deployment in resource-constrained environments.
Read more
A Novel Edge-Assisted Quantum-Classical Hybrid Framework for Crime Pattern Learning and Classification
Summary
This paper introduces a novel quantum-classical hybrid framework aimed at enhancing crime pattern analysis and classification, addressing the challenges posed by high-dimensional and imbalanced datasets in crime statistics. The authors evaluate four computational paradigms: pure quantum models, classical machine learning models, and two hybrid architectures. Utilizing 16 years of crime data from Bangladesh, the study employs rigorous cross-validation to assess the performance and efficiency of these models. The results indicate that quantum-inspired approaches, particularly the Quantum Approximate Optimization Algorithm (QAOA), achieve an accuracy of up to 84.6% while requiring fewer parameters than classical models. The proposed correlation-aware circuit design effectively incorporates domain-specific feature relationships, enhancing the performance of quantum models. The hybrid approaches demonstrate competitive training efficiency, making them suitable for resource-constrained environments such as wireless sensor networks in smart city surveillance systems. This research provides a foundational empirical assessment of quantum-enhanced machine learning for structured crime data, suggesting avenues for further exploration with larger datasets and realistic quantum hardware.
Methodology
The study employs a systematic evaluation of four computational paradigms: pure quantum models, classical machine learning models, and two hybrid architectures. It utilizes a dataset of 16 years of crime statistics from Bangladesh and applies rigorous cross-validation methods to assess classification performance and computational efficiency. The authors also implement a correlation-aware circuit design based on Spearman correlation analysis to enhance quantum model performance.
Results
The experimental results reveal that quantum-inspired approaches, particularly QAOA, achieve an accuracy of up to 84.6%. These models require fewer trainable parameters compared to classical baselines, indicating advantages for memory-constrained edge deployment. Hybrid approaches exhibit competitive training efficiency, making them suitable for resource-constrained environments.
Implications
The findings suggest that quantum-classical hybrid frameworks can significantly improve crime pattern analysis and classification, particularly in smart city applications where efficient resource utilization is critical. The low computational overhead and compact parameter footprint of the proposed models indicate their potential for deployment in distributed analytics systems, enhancing law enforcement capabilities.
Bias-Constrained Diffusion Schedules for PDE Emulations: Reconstruction Error Minimization and Efficient Unrolled Training
Generative Models
Optimization
Time Series
- Introduction of the Reconstruction Exposure-Bias concept, linking training and inference errors.
- Development of an Adaptive Noise Schedule to optimize reconstruction error while maintaining stability.
- Proposal of a fast Proxy Unrolled Training method to enhance computational efficiency.
- Demonstrated improvements in accuracy and stability over traditional diffusion and deterministic models.
Read more
Bias-Constrained Diffusion Schedules for PDE Emulations: Reconstruction Error Minimization and Efficient Unrolled Training
Summary
This paper addresses the limitations of Conditional Diffusion Models in emulating complex spatiotemporal dynamics, particularly in terms of reconstruction accuracy and computational efficiency. The authors identify the relationship between noise schedules, reconstruction error reduction rates, and diffusion exposure bias, demonstrating that standard schedules lead to suboptimal performance. They propose an Adaptive Noise Schedule framework that minimizes inference reconstruction error by dynamically constraining the model's exposure bias. Additionally, they introduce a Proxy Unrolled Training method that stabilizes long-term rollouts without the computational burden of full Markov Chain sampling. The proposed methods show significant improvements in both short-term accuracy and long-term stability across various benchmarks, including forced Navier-Stokes, Kuramoto-Sivashinsky, and Transonic Flow.
Methodology
The authors characterize the noise schedule's impact on reconstruction error and exposure bias, leading to the development of an Adaptive Scheduling algorithm. This algorithm optimizes the noise level while ensuring model stability. They also create a Proxy Unrolled Training method that reduces the computational cost of traditional unrolled training by requiring fewer sampling steps.
Results
The proposed methods significantly reduce first-step reconstruction error and mitigate artifacting effects, resulting in improvements in FrΓ©chet Spectral Distance on Kolmogorov turbulent flow benchmarks by multiple orders of magnitude. The Adaptive Noise Schedule and Proxy Unrolled Training demonstrate enhanced performance compared to existing diffusion and deterministic models.
Implications
The findings suggest that optimizing noise schedules and training methods can lead to more accurate and efficient models for spatiotemporal forecasting tasks, particularly in fluid dynamics. This could enhance the applicability of diffusion models in high-precision simulations and real-time forecasting scenarios.
Zero-shot Multivariate Time Series Forecasting Using Tabular Prior Fitted Networks
Time Series
- Introduces a framework for zero-shot multivariate time series forecasting using tabular models.
- Addresses the limitation of treating MTS as independent univariate problems by modeling inter-channel dependencies.
- Utilizes a 'rolled out' tabular format to capture spatial correlations and temporal dependencies.
- Demonstrates competitive performance against state-of-the-art methods in empirical evaluations.
Read more
Zero-shot Multivariate Time Series Forecasting Using Tabular Prior Fitted Networks
Summary
This paper presents a novel framework for zero-shot multivariate time series forecasting using Tabular Prior Fitted Networks (TabPFN). Traditional approaches often treat multivariate time series (MTS) as independent univariate problems, neglecting inter-channel dependencies. The authors propose a method that reformulates MTS forecasting into scalar regression problems, allowing the use of tabular foundation models without requiring retraining or architectural modifications. By transforming the multivariate structure into a 'rolled out' tabular format, the method captures intra-sample dependencies effectively. The authors benchmark their approach against existing methods, demonstrating its competitive performance in forecasting tasks across various domains, including finance and engineering.
Methodology
The proposed method reformulates multivariate time series forecasting as a series of scalar regression problems. It transforms the multivariate data into a tabular format by flattening the multivariate vectors into rows, where each row includes the timestamp, covariate index, and value. This allows the use of tabular foundation models like TabPFN for zero-shot predictions, effectively capturing both temporal and spatial dependencies.
Results
The empirical evaluation shows that the proposed method outperforms the univariate decomposition baseline established by TabPFN-TS and competes well with specialized architectures designed for time series forecasting. The results indicate that the method can effectively leverage cross-channel context in forecasting tasks.
Implications
This framework has significant implications for various applications requiring multivariate time series forecasting, such as finance, meteorology, and engineering. It enables practitioners to utilize existing tabular models for complex forecasting tasks without extensive retraining, thus enhancing efficiency and accessibility in predictive modeling.
Tree-of-Evidence: Efficient "System 2" Search for Faithful Multimodal Grounding
Multimodal
Interpretability
Optimization
- Introduces Tree-of-Evidence (ToE) for improved interpretability of multimodal models.
- Frames interpretability as a discrete optimization problem using Evidence Bottlenecks.
- Maintains high predictive performance with minimal evidence units.
- Achieves better decision agreement and lower errors compared to traditional methods.
Read more
Tree-of-Evidence: Efficient "System 2" Search for Faithful Multimodal Grounding
Summary
The paper introduces Tree-of-Evidence (ToE), an innovative inference-time search algorithm designed to enhance the interpretability of Large Multimodal Models (LMMs) in high-stakes domains such as healthcare. Traditional interpretability methods often fail to accurately represent the decision-making processes of these complex models, especially when integrating diverse data types like time-series and text. ToE addresses this challenge by framing interpretability as a discrete optimization problem, utilizing lightweight Evidence Bottlenecks to score groups of data and employing a beam search to identify the minimal evidence set necessary for reproducing model predictions. The authors evaluate ToE across six tasks involving three datasets, demonstrating that it can maintain over 98% of the full-model AUROC while using as few as five evidence units. The results indicate that ToE not only produces auditable evidence traces but also achieves higher decision agreement and lower probability fidelity errors compared to existing methods. Qualitative analyses reveal that ToE adapts its search strategy based on the nature of the evidence, effectively combining different modalities to clarify ambiguous cases. This work presents a practical mechanism for auditing multimodal models, ensuring that predictions can be traced back to specific, verifiable evidence.
Methodology
The methodology involves training modality-specific classifiers and lightweight selectors that score evidence units. At inference, ToE employs a beam search to construct a compact evidence set, balancing decision agreement, probability stability, and evidence sparsity. This approach allows for a structured search over meaningful evidence units, separating global context from dynamic evidence.
Results
ToE was evaluated on six tasks across three datasets, achieving over 98% AUROC retention with as few as five evidence units. It demonstrated higher decision agreement and lower probability fidelity errors compared to existing interpretability methods, providing clear and auditable evidence traces.
Implications
The findings suggest that ToE can significantly enhance the interpretability of multimodal models in critical applications like healthcare, where understanding the rationale behind predictions is essential for trust and accountability. This method could be applied in various domains requiring transparent decision-making.
BLEG: LLM Functions as Powerful fMRI Graph-Enhancer for Brain Network Analysis
Graph Learning
Large Language Models
NLP
- Introduces BLEG, a framework combining LLMs and GNNs for brain network analysis.
- Addresses limitations of GNNs due to feature sparsity and lack of domain knowledge.
- Demonstrates a three-stage methodology for enhancing GNN performance.
- Achieves superior results on various downstream tasks compared to existing methods.
Read more
BLEG: LLM Functions as Powerful fMRI Graph-Enhancer for Brain Network Analysis
Summary
This paper presents BLEG, a novel framework that integrates Large Language Models (LLMs) with Graph Neural Networks (GNNs) to enhance brain network analysis using functional magnetic resonance imaging (fMRI) data. The authors identify limitations in current GNN approaches, primarily due to high feature sparsity and the lack of domain knowledge in uni-modal neurographs. BLEG operates in three stages: first, it prompts an LLM to generate augmented textual descriptions for fMRI graph data; second, it tunes a smaller language model (LM) based on the generated text-graph dataset while training the GNN for coarse alignment; and third, it fine-tunes the GNN for specific downstream tasks using logits from the tuned LM for fine-grained alignment. The results demonstrate that BLEG significantly improves GNN performance on various tasks, including gender classification and major depressive disorder diagnosis, marking a pioneering effort to leverage LLMs for enhancing GNNs in neuroscience.
Methodology
The BLEG framework consists of three main stages: (1) prompting an LLM to generate augmented text descriptions for fMRI graph data, (2) tuning a smaller LM based on the text-graph dataset while training the GNN for coarse alignment, and (3) fine-tuning the GNN for specific downstream tasks using logits from the tuned LM for fine-grained alignment. The authors also provide a theoretical analysis to support the effectiveness of their approach.
Results
Extensive experiments on various real-world datasets show that BLEG outperforms existing GNN-based methods in tasks such as autism spectrum disorder diagnosis and major depressive disorder diagnosis, confirming its effectiveness in enhancing GNN representation learning.
Implications
The findings suggest that integrating LLMs with GNNs can significantly improve brain network analysis, offering new insights for both research and practical applications in diagnosing neurological conditions. The BLEG framework is data-agnostic and model-agnostic, indicating its potential for broader applications beyond fMRI data.
Sinkhorn doubly stochastic attention rank decay analysis
Theory
NLP
Computer Vision
- Doubly stochastic attention mitigates rank collapse more effectively than row-stochastic attention.
- Rank decay in self-attention using Sinkhorn normalization occurs doubly exponentially with depth.
- Skip connections are crucial for maintaining rank in self-attention networks.
- Empirical validation shows improved performance in sentiment analysis and image classification tasks.
Read more
Sinkhorn doubly stochastic attention rank decay analysis
Summary
This paper investigates the limitations of standard row-stochastic self-attention mechanisms in Transformer architectures, particularly focusing on rank collapse and entropy collapse as depth increases. The authors propose that doubly stochastic attention, normalized using the Sinkhorn algorithm, can better preserve rank compared to traditional Softmax normalization. They establish that the rank decay in self-attention networks using Sinkhorn normalization decreases doubly exponentially with depth, similar to Softmax. The study employs a theoretical framework based on path decomposition to analyze the structural properties of self-attention and validates findings through empirical experiments on sentiment analysis and image classification tasks. The results indicate that doubly stochastic attention leads to more balanced attention distributions and improved performance, while also confirming the importance of skip connections in mitigating rank collapse.
Methodology
The authors utilized a theoretical analysis based on path decomposition to study rank collapse in self-attention networks. They compared the effects of row-stochastic and doubly stochastic attention using Sinkhorn normalization, and conducted empirical experiments on sentiment analysis and image classification tasks to validate their findings.
Results
The study found that doubly stochastic attention matrices, normalized with the Sinkhorn algorithm, preserve rank more effectively than standard Softmax matrices. The rank decay was shown to decrease doubly exponentially with network depth, and the empirical results confirmed that this approach leads to improved performance in various tasks.
Implications
The findings suggest that implementing doubly stochastic attention could enhance the performance and stability of Transformer models in various applications, including natural language processing and computer vision. This work also opens avenues for further research into optimal transport methods in deep learning.
A Machine Learning Framework for Turbofan Health Estimation via Inverse Problem Formulation
Time Series
- Introduction of a realistic turbofan dataset that captures real-world health monitoring challenges.
- Comprehensive evaluation of established methods for health state estimation from sparse measurements.
- Investigation of self-supervised learning approaches to recover health states without true labels.
- Comparison of traditional Bayesian filters and data-driven models, establishing strong baselines.
Read more
A Machine Learning Framework for Turbofan Health Estimation via Inverse Problem Formulation
Summary
This paper addresses the challenge of estimating the health state of turbofan engines, which is an ill-posed inverse problem complicated by sparse sensor data and complex thermodynamics. The authors highlight the fragmentation in current research and the limitations of existing datasets that fail to capture realistic degradation and maintenance patterns. To tackle this, they introduce a new dataset that reflects industry complexities and establish a benchmark for evaluating various methods, including steady-state and nonstationary data-driven models, as well as Bayesian filters. A significant contribution is the exploration of self-supervised learning (SSL) approaches that derive latent representations without true health labels, simulating real-world constraints. The study compares the performance of these SSL methods against traditional prediction baselines, revealing that while classic filters remain robust, SSL methods underscore the complexity of health estimation and the necessity for more advanced inference strategies. The dataset and implementation are made publicly available for reproducibility.
Methodology
The authors developed a new dataset simulating realistic degradation patterns and maintenance events for turbofan engines. They conducted a benchmark evaluation of various methods, including Bayesian filters and data-driven models, and explored self-supervised learning techniques to learn representations from sensor data without true health labels. The performance of these methods was compared to establish a practical lower bound on the difficulty of the inverse problem.
Results
The results indicate that traditional filters, such as Kalman filters, serve as strong baselines for health estimation. Self-supervised learning methods demonstrated the intrinsic complexity of the health estimation task, emphasizing the need for more sophisticated and interpretable inference strategies. The study provides a foundational benchmark for future research in turbofan health monitoring.
Implications
The findings have significant implications for predictive maintenance in aviation, as they provide a framework for more accurate health monitoring of turbofan engines. The introduction of a realistic dataset and the exploration of SSL methods could lead to improved algorithms that better handle the complexities of real-world operational conditions.
TTVS: Boosting Self-Exploring Reinforcement Learning via Test-time Variational Synthesis
Reinforcement Learning
Large Language Models
- TTVS enables dynamic augmentation of training data from unlabeled test queries.
- The framework consists of two modules: Online Variational Synthesis and Test-time Hybrid Exploration.
- TTVS outperforms existing test-time adaptation methods and state-of-the-art RL techniques using only unlabeled data.
- The approach encourages models to learn underlying problem logic rather than superficial patterns.
Read more
TTVS: Boosting Self-Exploring Reinforcement Learning via Test-time Variational Synthesis
Summary
The paper introduces Test-Time Variational Synthesis (TTVS), a novel framework designed to enhance self-exploring reinforcement learning (RL) in Large Reasoning Models (LRMs) by dynamically augmenting training data from unlabeled test queries. Traditional reinforcement learning with verifiable rewards (RLVR) faces limitations in specialized domains where obtaining labeled data is costly or impractical. Existing test-time adaptation methods often rely on static query sets, leading to overfitting on superficial patterns rather than understanding the underlying problem logic. TTVS addresses this by incorporating two main components: Online Variational Synthesis, which generates diverse, semantically-equivalent variations of static test queries, and Test-time Hybrid Exploration, which balances accuracy-driven exploitation with consistency-driven exploration. The framework allows LRMs to self-evolve and adapt to novel problems without direct human supervision. Extensive experiments demonstrate that TTVS significantly outperforms other test-time adaptation methods and even state-of-the-art supervised RL techniques, achieving superior performance in mathematical reasoning tasks using only unlabeled test-time data.
Methodology
The methodology involves two synergistic modules: (1) Online Variational Synthesis, which transforms static test queries into a dynamic stream of diverse variations, and (2) Test-time Hybrid Exploration, which employs a dual-mode update strategy for robust learning. This allows the model to exploit accurate solutions while exploring consistency across generated variants.
Results
The experiments conducted across eight model architectures show that TTVS achieves superior performance in mathematical reasoning tasks, surpassing both traditional test-time adaptation methods and state-of-the-art RL-based post-training methods that rely on large, annotated datasets.
Implications
The implications of TTVS suggest a shift towards self-evolving models that can adapt to new, unlabeled data in specialized domains, potentially reducing the need for extensive human annotation and enabling more scalable applications of reinforcement learning in complex reasoning tasks.
A Systematic Framework for Tabular Data Disentanglement
Theory
Generative Models
Optimization
- Introduces a systematic framework for tabular data disentanglement.
- Modularizes the disentanglement process into four core components.
- Identifies limitations of existing methods and proposes a comprehensive view.
- Highlights the unique challenges posed by tabular data compared to other data types.
Read more
A Systematic Framework for Tabular Data Disentanglement
Summary
This paper addresses the challenges of disentangling tabular data, which is prevalent in various industries such as finance and industrial control systems. The authors propose a systematic framework that modularizes the disentanglement process into four core components: data extraction, data modeling, model analysis, and latent representation extrapolation. The motivation behind this framework is to provide a clearer understanding of the complex interrelationships among attributes in tabular data, which often leads to suboptimal results when applying existing methods from other data domains. The authors highlight the limitations of current techniques, including factor analysis, CT-GAN, and VAE, which struggle with scalability, mode collapse, and poor extrapolation. By offering a comprehensive view of tabular data disentanglement, the framework aims to identify research gaps, facilitate the integration of different approaches, and ultimately lead to the development of more robust and efficient disentanglement techniques. A case study is presented to demonstrate the framework's applicability in synthetic tabular data generation, showcasing its potential for practical applications in data synthesis.
Methodology
The authors propose a systematic framework that includes four main components: data extraction, data modeling, model analysis, and latent representation extrapolation. They analyze existing methods for tabular data disentanglement and identify their limitations, leading to the development of a more comprehensive approach.
Results
The framework provides a structured understanding of tabular data disentanglement and identifies key components and properties necessary for effective disentanglement. The case study on synthetic tabular data generation illustrates the framework's practical applicability and effectiveness in real-world scenarios.
Implications
This work lays the groundwork for future research in tabular data disentanglement, potentially leading to the development of more efficient and scalable techniques. It also opens avenues for integrating various existing methods into a hybrid solution that leverages their strengths while mitigating their weaknesses.
A Graph Foundation Model for Wireless Resource Allocation
Graph Learning
Optimization
- Introduces a novel Graph Foundation Model for resource allocation in wireless networks.
- Utilizes an interference-aware Transformer architecture for improved adaptability.
- Employs a hybrid self-supervised pre-training strategy for effective representation learning.
- Achieves state-of-the-art performance and sample efficiency in various scenarios.
Read more
A Graph Foundation Model for Wireless Resource Allocation
Summary
This paper addresses the challenges of resource allocation in modern wireless networks, particularly in the context of severe mutual interference due to network densification. Traditional iterative algorithms are computationally intensive and slow, making them unsuitable for real-time applications. While deep learning methods have emerged, they often lack the flexibility to adapt to varying objectives without extensive retraining. To overcome these limitations, the authors propose a Graph Foundation Model for Resource Allocation (GFM-RA) that utilizes a pre-training and fine-tuning approach to create unified representations. The model features an interference-aware Transformer architecture that incorporates interference topologies into its attention mechanisms. Additionally, a hybrid self-supervised pre-training strategy is introduced, combining masked edge prediction with contrastive learning to capture transferable structural representations from large unlabeled datasets. Experimental results demonstrate that GFM-RA achieves state-of-the-art performance, exhibits exceptional sample efficiency, and allows for robust few-shot adaptation to diverse downstream tasks, even in out-of-distribution scenarios. This work highlights the potential of pre-trained foundation models in enhancing wireless resource allocation and sets the stage for future research in generalizable learning-based wireless optimization.
Methodology
The proposed GFM-RA model employs a pre-training and fine-tuning paradigm, utilizing an interference-aware Transformer architecture enhanced with a bias projector to incorporate interference topologies into attention mechanisms. A hybrid self-supervised learning approach is used, combining masked edge prediction and contrastive learning to extract transferable representations from large unlabeled datasets.
Results
The GFM-RA framework outperforms existing methods in resource allocation tasks, demonstrating significant improvements in performance metrics and scalability with increased model capacity. The model's ability to adapt efficiently to new objectives with minimal retraining is particularly noteworthy, showcasing its robustness in out-of-distribution scenarios.
Implications
The findings suggest that pre-trained foundation models can significantly enhance the adaptability and efficiency of wireless resource allocation strategies. This approach may lead to more flexible and responsive wireless network management solutions, paving the way for advancements in 6G and beyond.
Tensor-based computation of the Koopman generator via operator logarithm
Theory
Time Series
Efficient ML
- Introduces a tensor-based method for computing the Koopman generator in low-rank TT format.
- Avoids the curse of dimensionality by leveraging eigendecomposition for efficient computation.
- Demonstrates effectiveness on both 4D and 10D dynamical systems, achieving accurate recovery of vector fields.
- Provides a scalable solution for system identification in nonlinear dynamics.
Read more
Tensor-based computation of the Koopman generator via operator logarithm
Summary
This paper addresses the challenge of identifying governing equations of nonlinear dynamical systems from data, particularly focusing on the limitations of existing methods such as sparse identification of nonlinear dynamics (SINDy) and operator-logarithm approaches. The authors propose a novel data-driven method to compute the Koopman generator in a low-rank tensor train (TT) format by taking logarithms of Koopman eigenvalues while preserving the TT structure. This method leverages eigendecomposition to compute the Koopman generator without the need for explicit matrix logarithm calculations, which can be computationally expensive and prone to errors in high-dimensional settings. The effectiveness of the proposed method is demonstrated through numerical experiments on a 4-dimensional LotkaβVolterra system and a 10-dimensional Lorenzβ96 system, showcasing accurate recovery of vector field coefficients and scalability to higher-dimensional systems. The results indicate that the proposed approach can efficiently handle the curse of dimensionality, making it a promising tool for system identification in complex dynamical systems.
Methodology
The proposed method computes the Koopman generator by first determining the Koopman eigenvalues and eigenfunctions in the TT format. It then takes the logarithm of the eigenvalues and uses the resulting eigendecomposition to construct the Koopman generator, all while maintaining the TT structure to ensure computational efficiency.
Results
The numerical experiments conducted on the LotkaβVolterra and Lorenzβ96 systems showed that the proposed method accurately recovers the vector field coefficients and demonstrates scalability to higher-dimensional systems, effectively addressing the computational challenges associated with traditional methods.
Implications
This work has significant implications for various fields that require the identification of nonlinear dynamical systems, such as control theory, physics, and biology. The ability to efficiently compute the Koopman generator in high-dimensional settings could enhance predictive modeling and control strategies in complex systems.
Multimodal Latent Reasoning via Predictive Embeddings
Multimodal
- PEARL eliminates the need for explicit tool invocation at inference time, reducing overhead.
- The framework supports multi-step reasoning and avoids training-inference mismatches.
- PEARL outperforms traditional supervised fine-tuning and reconstruction-based methods in various benchmarks.
- The approach focuses on predictive embedding learning, which is shown to be more effective than reconstruction-based methods.
Read more
Multimodal Latent Reasoning via Predictive Embeddings
Summary
The paper introduces PEARL (Predictive Embedding Alignment for Reasoning in Latent space), a novel framework designed to enhance multimodal reasoning in visual language models (VLMs) by learning from expert tool-use trajectories in a latent space. Traditional tool-augmented approaches face challenges such as high inference overhead, the need for specialized supervision, and the risk of erroneous tool calls. PEARL circumvents these issues by eliminating explicit tool invocation during inference, allowing for a more efficient and effective reasoning process. The framework is inspired by the JEPA model and focuses on predictive embedding learning rather than reconstruction-based methods, which often suffer from training-inference mismatches and limitations in multi-step reasoning. PEARL operates by predicting trajectory embeddings from image-question pairs, thus enabling the model to internalize the effects of tool use without direct invocation. The authors demonstrate that PEARL maintains the standard vision-language generation pipeline while supporting complex reasoning tasks. Experimental results across various perception benchmarks indicate that PEARL matches or surpasses the performance of existing supervised fine-tuning and reconstruction-based methods, highlighting its potential as a more principled approach to multimodal reasoning.
Methodology
PEARL employs a JEPA-inspired framework that learns predictive representations from expert tool-use trajectories. It predicts trajectory embeddings from image-question pairs, allowing for internalization of tool effects without explicit invocation. The model is trained using a combination of a vision-language generation objective and a predictive embedding objective, facilitating the learning of task-relevant transformations in latent space.
Results
PEARL consistently matches or outperforms existing methods, including supervised fine-tuning and reconstruction-based latent reasoning approaches, across multiple multimodal reasoning benchmarks. The analysis reveals that reconstruction-based methods primarily learn embeddings rather than simulating visual transformations, supporting the efficacy of predictive embedding learning.
Implications
The findings suggest that PEARL could significantly improve the efficiency and effectiveness of multimodal reasoning tasks in VLMs, making it applicable in areas such as image editing, object detection, and complex visual reasoning scenarios. This could lead to advancements in applications that require grounded reasoning and interaction with visual content.
Guardian-as-an-Advisor: Advancing Next-Generation Guardian Models for Trustworthy LLMs
NLP
Large Language Models
Reinforcement Learning
- Introduction of the GaaA framework as a soft-gating alternative to traditional hard-gated safety mechanisms.
- Development of GuardSet, a large-scale dataset with over 208,000 examples for training guardian models.
- Training of GuardAdvisor using a combination of supervised fine-tuning and reinforcement learning.
- Demonstration of GuardAdvisor's competitive performance and significant reduction in unnecessary refusals.
Read more
Guardian-as-an-Advisor: Advancing Next-Generation Guardian Models for Trustworthy LLMs
Summary
This paper introduces the Guardian-as-an-Advisor (GaaA) framework, which addresses the limitations of traditional hard-gated safety checkers in large language models (LLMs). Current models often over-refuse queries and misalign with vendor specifications, leading to reduced utility. GaaA employs a soft-gating mechanism where a guardian model predicts a binary risk label and provides a concise explanation, which is then prepended to the original user query for re-inference. This approach maintains the base model's operational integrity while enhancing its trustworthiness. To facilitate training and evaluation, the authors constructed GuardSet, a comprehensive dataset containing over 208,000 examples that unify harmful and harmless cases with a focus on robustness and honesty. The GuardAdvisor model is trained using supervised fine-tuning followed by reinforcement learning to ensure consistency between risk labels and explanations. Experimental results demonstrate that GuardAdvisor achieves competitive detection accuracy, reduces unnecessary refusals, and incurs minimal latency overhead, thus preserving adherence to the original model specifications.
Methodology
The authors developed the GaaA framework, which utilizes a guardian model to provide risk labels and explanations that are appended to user queries. They created the GuardSet dataset through a three-stage process involving collection, processing, and validation. GuardAdvisor was trained using supervised fine-tuning followed by reinforcement learning to ensure semantic consistency between the generated labels and explanations.
Results
GuardAdvisor achieved detection performance comparable to proprietary models, significantly reduced unnecessary refusals, and added only 2-10% overhead in latency during inference. The model maintained compliance with the original specifications of the deployed LLM.
Implications
The GaaA framework has the potential to enhance the trustworthiness of LLMs in various applications, including search, coding, healthcare, and productivity tools, by providing interpretable guidance and reducing the risk of harmful outputs.
Preference Redirection via Attention Concentration: An Attack on Computer Use Agents
Multimodal
- Introduction of PRAC, a novel attack on CUAs that manipulates attention in vision models.
- Demonstration of the attack's effectiveness in redirecting product selection on online shopping platforms.
- Highlighting the security vulnerabilities of CUAs in trusted environments, particularly through visual perception.
- Validation of the attack in realistic scenarios, indicating high success rates.
Read more
Preference Redirection via Attention Concentration: An Attack on Computer Use Agents
Summary
This paper introduces PRAC, a novel attack method targeting Computer Use Agents (CUAs) that utilize multimodal foundation models to interact with graphical user interfaces (GUIs). While previous research has primarily focused on vulnerabilities in the language modality, this work highlights a significant gap in the security of the vision modality. PRAC manipulates the internal preferences of CUAs by redirecting their attention towards a stealthy adversarial patch on product images in online shopping environments. The authors demonstrate that this attack can effectively influence the CUA's selection process, leading it to recommend a specific product manipulated by an adversary. The attack requires white-box access to the model for crafting the adversarial image but shows generalization capabilities to fine-tuned versions of the same model. The study emphasizes the overlooked risks associated with benign actions of CUAs, particularly in trusted environments, and validates the effectiveness of PRAC through realistic deployment scenarios, showcasing high success rates in altering the selection preferences of CUAs.
Methodology
The authors developed PRAC by optimizing a stealthy perturbation on product images to concentrate the attention of the CUA on the adversarial image. This involved using a local white-box CUA to manipulate the attention distribution during the selection process, ensuring that the modified image appeared benign to human users.
Results
The results indicate that PRAC successfully redirected the selection of CUAs towards the adversarial product image in a controlled online shopping environment. The attack maintained a high success rate, demonstrating its effectiveness even when only a small portion of the image could be manipulated.
Implications
The findings suggest that CUAs are vulnerable to subtle adversarial attacks that exploit their visual perception, raising concerns about the security of automated decision-making systems in commercial applications. This highlights the need for enhanced security measures and awareness of potential threats in multimodal AI systems.
Implicit Regularization and Generalization in Overparameterized Neural Networks
Theory
Optimization
- Overparameterized neural networks can generalize well despite classical predictions of overfitting.
- Optimization dynamics, particularly through SGD, play a crucial role in implicit regularization.
- Smaller batch sizes lead to better generalization and flatter minima in the loss landscape.
- Sparse subnetworks can achieve performance comparable to full models, supporting the Lottery Ticket Hypothesis.
Read more
Implicit Regularization and Generalization in Overparameterized Neural Networks
Summary
This paper addresses the paradox of overparameterized neural networks, which often generalize well despite classical statistical learning theory predicting severe overfitting in such models. The author investigates the mechanisms behind this phenomenon, focusing on implicit regularization and optimization dynamics. Through controlled experiments using stochastic gradient descent (SGD) on datasets like CIFAR-10 and MNIST, the study explores the effects of batch size, loss landscape geometry, and the Neural Tangent Kernel (NTK) framework. The findings reveal that smaller batch sizes lead to lower test errors and flatter minima, indicating a relationship between optimization strategies and generalization performance. Additionally, sparse subnetworks, retaining only a fraction of the original parameters, can achieve performance close to that of full models when retrained. This research contributes to a deeper understanding of modern deep learning systems and suggests the need for revised theoretical frameworks to explain generalization in high-dimensional settings.
Methodology
The study employs controlled computational experiments using stochastic gradient descent (SGD) across various batch sizes, analyzes the geometry of loss landscapes through Hessian eigenvalue estimation and weight perturbation, and examines theoretical perspectives from the Neural Tangent Kernel (NTK) regime. Experiments were conducted on CIFAR-10 and MNIST datasets.
Results
The results indicate that generalization in overparameterized models is significantly influenced by the interplay between network architecture, optimization algorithms, and loss landscape geometry. Smaller batch sizes consistently resulted in lower test errors and flatter minima. Additionally, sparse subnetworks, retaining only 10% of the original parameters, achieved performance within 1.15 percentage points of full models when retrained.
Implications
The findings suggest that understanding the implicit regularization effects of optimization methods like SGD can lead to better training strategies for neural networks. This research may inform the design of more efficient neural architectures and training protocols that leverage the properties of overparameterization.
Automating aggregation strategy selection in federated learning
Federated Learning
- Introduces an automated framework for selecting aggregation strategies in Federated Learning.
- Operates in single-trial and multi-trial modes to accommodate different resource constraints.
- Utilizes large language models for strategy inference and a genetic search for optimization.
- Demonstrates improved robustness and generalization in non-IID scenarios through extensive experiments.
Read more
Automating aggregation strategy selection in federated learning
Summary
This paper addresses the challenge of selecting appropriate aggregation strategies in Federated Learning (FL), which is crucial for effective model training without centralizing data. The authors propose an end-to-end framework that automates the selection process, adapting to various levels of statistical heterogeneity and compute constraints. The framework operates in two modes: a single-trial mode, utilizing large language models (LLMs) to infer suitable strategies based on data characteristics, and a multi-trial mode, employing a lightweight genetic search to explore alternatives efficiently. The experiments conducted across diverse datasets demonstrate that the proposed approach enhances robustness and generalization under non-IID conditions while minimizing manual intervention. This work significantly contributes to making FL more accessible and adaptive by automating a critical design decision in the process.
Methodology
The authors developed a framework that integrates automated heterogeneity assessment with data-driven selection of aggregation strategies. In single-trial settings, they implemented a reasoning method based on large language models to infer strategies. In multi-trial settings, they utilized a lightweight genetic search to refine strategy choices efficiently.
Results
The proposed framework showed significant improvements in robustness and generalization performance across various datasets under non-IID conditions. The automation of aggregation strategy selection reduced the need for manual intervention, making the deployment of Federated Learning more practical.
Implications
The automation of aggregation strategy selection can facilitate the broader adoption of Federated Learning in real-world applications, particularly for practitioners lacking expertise in the field. This work paves the way for more efficient and effective collaborative model training across diverse environments.
Optimal Decay Spectra for Linear Recurrences
NLP
Large Language Models
Theory
- Introduces Position-Adaptive Spectral Tapering (PoST) for improved long-range memory in linear recurrent models.
- Establishes a design blueprint for memory channels based on logarithmic equipartition of information.
- Demonstrates minimax optimality through Spectral Reparameterization for geometrically spaced decay rates.
- Implements Position-Adaptive Scaling to dynamically adjust memory channel contributions based on sequence position.
Read more
Optimal Decay Spectra for Linear Recurrences
Summary
This paper addresses the limitations of linear recurrent models in retaining long-range memory due to suboptimal decay spectra. The author identifies two main issues: the collapse of the minimum spectral gap during random initialization and the degradation of performance with linearly spaced decay rates over long contexts. To overcome these challenges, the paper introduces Position-Adaptive Spectral Tapering (PoST), an architecture-agnostic framework that combines two mechanisms: Spectral Reparameterization and Position-Adaptive Scaling. Spectral Reparameterization ensures geometrically spaced log-decay rates, achieving minimax optimality for long-range dependencies. Position-Adaptive Scaling dynamically adjusts the effective contribution of memory channels based on the current position in the sequence, thus eliminating scale mismatches and enhancing performance. The framework is integrated into various architectures, including Mamba-2, RWKV-7, and Gated DeltaNet, demonstrating significant improvements in zero-shot language modeling and long-context retrieval tasks. The proposed methods not only enhance memory efficiency but also maintain computational efficiency, making PoST a valuable contribution to the field of sequence modeling.
Methodology
The paper employs theoretical analysis to diagnose failure modes in existing linear recurrent models, followed by the development of PoST, which integrates Spectral Reparameterization and Position-Adaptive Scaling. The effectiveness of PoST is evaluated through experiments on various architectures, focusing on language modeling and long-context retrieval tasks.
Results
The implementation of PoST across different models resulted in consistent improvements in zero-shot language modeling and significant gains in long-context retrieval for Mamba-2. The framework demonstrated competitive or enhanced performance compared to existing architectures, validating its effectiveness in addressing long-range memory challenges.
Implications
The findings suggest that PoST can be widely applied to enhance the performance of linear recurrent models in various sequence processing tasks, particularly in natural language processing where long-range dependencies are critical. This could lead to more efficient models that maintain high performance without the computational overhead typically associated with traditional architectures.
Physics-informed neural operators for the in situ characterization of locally reacting sound absorbers
Audio & Speech
Theory
Optimization
- Introduces a physics-informed neural operator approach for estimating acoustic surface admittance.
- Avoids the need for explicit forward models by embedding governing acoustic equations into the training process.
- Demonstrates improved robustness to noise and sparse data compared to traditional methods.
- Validates the approach using synthetic data from simulations of porous absorbers.
Read more
Physics-informed neural operators for the in situ characterization of locally reacting sound absorbers
Summary
This paper addresses the challenge of accurately estimating the acoustic surface admittance or impedance of locally reacting sound absorbers, which is crucial for reliable wave-based simulations. Traditional methods face limitations due to noise, model inaccuracies, and restrictive assumptions. The authors propose a novel approach using physics-informed neural operators to estimate frequency-dependent surface admittance directly from near-field measurements of sound pressure and particle velocity. A deep operator network is utilized to learn the mapping from measurement data, spatial coordinates, and frequency to acoustic field quantities, while simultaneously inferring a globally consistent surface admittance spectrum without the need for an explicit forward model. The training process incorporates governing acoustic relations, such as the Helmholtz equation and Robin boundary conditions, as physics-based regularization, which enhances the robustness of predictions against noise and avoids frequency-wise inversion. The method is validated with synthetic data from simulations of two planar porous absorbers under semi free-field conditions, demonstrating accurate reconstruction of both real and imaginary admittance components and reliable predictions of acoustic field quantities. The results indicate improved robustness to noise and sparse sampling compared to purely data-driven approaches, showcasing the potential of physics-informed neural operators for in situ acoustic material characterization.
Methodology
The authors employ a deep operator network that learns the mapping from near-field measurements of sound pressure and particle velocity to acoustic field quantities. The training process incorporates physics-based regularization by embedding governing acoustic equations, allowing for noise-robust predictions without requiring frequency-wise inversion.
Results
The proposed method successfully reconstructs both the real and imaginary components of surface admittance and predicts acoustic field quantities accurately. Validation with synthetic data shows that the approach is robust against noise and sparse sampling, outperforming purely data-driven methods.
Implications
This work has significant implications for the in situ characterization of acoustic materials, potentially enhancing the accuracy of wave-based simulations in various applications, including architectural acoustics, noise control, and material science.
Benchmark Shadows: Data Alignment, Parameter Footprints, and Generalization in Large Language Models
NLP
Large Language Models
Multimodal
- Introduces a regime-centric framework linking data distribution to learning dynamics in LLMs.
- Demonstrates that benchmark-aligned data improves narrow metrics but limits broader representational development.
- Shows that coverage-expanding data leads to better generalization and distributed parameter adaptation.
- Presents parameter-space diagnostics to characterize training regime effects.
Read more
Benchmark Shadows: Data Alignment, Parameter Footprints, and Generalization in Large Language Models
Summary
This paper investigates the relationship between data distribution and model performance in large language models (LLMs), highlighting a discrepancy between benchmark scores and broader capabilities. The authors propose a regime-centric framework that distinguishes between benchmark-aligned data, which focuses on narrow evaluation metrics, and coverage-expanding data, which enhances semantic diversity and generalization. Through controlled experiments on a text-only decoder model and multimodal systems, they demonstrate that the type of data used significantly influences learning dynamics and internal model structures. The study introduces parameter-space diagnostics based on spectral and rank analyses to reveal distinct structural signatures associated with different training regimes. The findings suggest that improving benchmark performance does not necessarily equate to enhanced model capabilities, emphasizing the importance of data distribution in shaping learning outcomes.
Methodology
The authors conducted controlled experiments on a text-only decoder model to isolate the effects of different data distributions under fixed training conditions. They employed parameter-space diagnostics, including spectral and rank analyses, to analyze how various training regimes influence model representations and performance. The study also included a case study on prompt repetition to explore the impact of data artifacts.
Results
The results indicate that benchmark-aligned data can enhance narrow evaluation metrics while constraining broader representational development. In contrast, coverage-expanding data fosters more distributed parameter adaptation and improved generalization. These patterns were consistent across various open-source model families, including multimodal models, suggesting that the effects of data distribution extend beyond controlled settings.
Implications
The findings imply that relying solely on benchmark performance as a measure of model capability is insufficient. Understanding the impact of data distribution on learning dynamics can inform better training strategies for LLMs, potentially leading to models that generalize better across diverse tasks and domains.
Leveraging Complementary Embeddings for Replay Selection in Continual Learning with Small Buffers
Computer Vision
Efficient ML
Theory
- MERS integrates supervised and self-supervised embeddings to improve replay selection in continual learning.
- The method employs a non-parametric alignment strategy based on k-NN density estimation for adaptive selection.
- MERS achieves state-of-the-art performance on Split CIFAR-100 and Split TinyImageNet datasets.
- The approach is efficient and can be seamlessly integrated into existing replay-based continual learning frameworks.
Read more
Leveraging Complementary Embeddings for Replay Selection in Continual Learning with Small Buffers
Summary
This paper addresses the challenge of catastrophic forgetting in Continual Learning (CL), particularly in replay-based methods with limited memory. The authors propose a novel approach called Multiple Embedding Replay Selection (MERS), which enhances the sample selection strategy for replay buffers by integrating both supervised and self-supervised embeddings. Traditional methods often rely solely on supervised embeddings, which can lead to suboptimal performance as they may not capture the full data geometry necessary for future tasks. MERS employs a graph-based approach that utilizes a coverage objective across multiple embedding spaces, allowing for better representation of diverse data distributions. The method is designed to be a drop-in enhancement, requiring no additional model parameters or architectural changes. Empirical evaluations on datasets such as CIFAR-100 and TinyImageNet demonstrate that MERS consistently outperforms state-of-the-art selection strategies, particularly in low-memory scenarios, making it a practical solution for continual learning applications.
Methodology
MERS replaces the traditional single-embedding selection process with a coverage-based approach that considers multiple embedding spaces. It uses a weighted maximum k-coverage problem formulation to ensure diverse and representative example selection. The method adapts the scale of each embedding through non-parametric density estimation, allowing it to effectively balance the contributions of different embeddings.
Results
MERS consistently outperformed single-embedding baselines across various continual learning algorithms and datasets, particularly demonstrating significant improvements in low-memory conditions. The results indicate that the integration of multiple embeddings enhances the model's ability to retain knowledge over time.
Implications
The findings suggest that leveraging complementary embeddings can significantly enhance the performance of continual learning systems, particularly in real-world applications where memory constraints are a concern. This approach can be beneficial in fields such as robotics, autonomous driving, and personalized AI systems, where continual adaptation to new information is crucial.
DMax: Aggressive Parallel Decoding for dLLMs
NLP
Large Language Models
Generative Models
- DMax mitigates error accumulation in parallel decoding of dLLMs.
- Introduces On-Policy Uniform Training (OPUT) for effective self-correction.
- Proposes Soft Parallel Decoding (SPD) to enhance decoding robustness.
- Achieves significant improvements in tokens per forward (TPF) without sacrificing accuracy.
Read more
DMax: Aggressive Parallel Decoding for dLLMs
Summary
The paper introduces DMax, a novel framework aimed at enhancing the efficiency of diffusion language models (dLLMs) by addressing the issue of error accumulation during parallel decoding. Traditional masked dLLMs utilize a binary mask-to-token approach, which can lead to cascading errors when decoding in parallel. DMax reformulates this process into a self-revising mechanism that transitions from mask embeddings to token embeddings. Central to DMax is On-Policy Uniform Training (OPUT), a training strategy that allows the model to learn from its own predictions, thereby improving its ability to recover from errors. Additionally, the authors propose Soft Parallel Decoding (SPD), which represents intermediate decoding states as a blend of predicted and mask embeddings, facilitating iterative self-correction. Experimental results demonstrate that DMax significantly increases tokens per forward (TPF) while maintaining accuracy across various benchmarks, establishing a strong baseline for future research in parallel decoding for dLLMs.
Methodology
The authors developed DMax by reformulating the decoding process of dLLMs from a binary mask-to-token approach to a self-revising transformation in embedding space. This was achieved through On-Policy Uniform Training (OPUT), which allows the model to learn from its own predictions, and Soft Parallel Decoding (SPD), which enables the model to represent intermediate states as hybrid embeddings for better self-correction.
Results
DMax improved the tokens per forward (TPF) on the GSM8K benchmark from 2.04 to 5.48 while maintaining an accuracy of 92.1%. On the MBPP benchmark, TPF increased from 2.71 to 5.86, again with comparable performance. The model achieved an average of 1,338 tokens per second (TPS) on two H200 GPUs at a batch size of 1.
Implications
The DMax framework has the potential to significantly enhance the efficiency of dLLMs in various applications, particularly in scenarios requiring rapid text generation and high-quality outputs. This work lays the groundwork for further advancements in parallel decoding techniques and could influence future designs of language models.
The ecosystem of machine learning competitions: Platforms, participants, and their impact on AI development
Theory
- MLCs significantly contribute to AI innovation and skill development.
- Major platforms like Kaggle dominate participation and prize distribution.
- Competitions bridge the gap between academic research and industrial applications.
- MLCs foster collaboration and knowledge sharing within the AI community.
Read more
The ecosystem of machine learning competitions: Platforms, participants, and their impact on AI development
Summary
This paper provides a comprehensive analysis of machine learning competitions (MLCs) and their significant role in advancing artificial intelligence (AI). It examines major competition platforms like Kaggle and Zindi, focusing on their workflows, evaluation methodologies, and reward structures. The study highlights the importance of MLCs in fostering innovation, skill development, and practical problem-solving, bridging the gap between academic research and industrial applications. By analyzing participant demographics, competition quality, and motivations of hosts, the paper illustrates how MLCs shape AI development and promote collaboration. The findings indicate that MLCs serve as effective environments for knowledge exchange, influencing research priorities and industry standards. The paper also discusses the pedagogical benefits of MLCs in education, emphasizing their role in enhancing engagement and skill acquisition through real-world datasets. Furthermore, it addresses the challenges related to data governance and ethical considerations in open competitions, advocating for responsible AI development. Overall, the study underscores the evolving significance of MLCs in the AI landscape, providing insights for researchers, practitioners, and competition organizers.
Methodology
The study employs a combination of literature synthesis, platform-level data analysis, and insights from practitioners to analyze the MLC ecosystem, focusing on competition formats, participant engagement, and demographic trends.
Results
The analysis reveals that MLCs have evolved into crucial environments for evaluation and knowledge sharing, influencing both applied and research-oriented machine learning. It highlights trends in participation, prize allocation, and the increasing use of competitions by various organizations to tackle real-world problems.
Implications
The findings suggest that MLCs can be leveraged for workforce development and education in AI, while also emphasizing the need for ethical standards in competition design to ensure responsible innovation.
Multimodal Large Language Models for Multi-Subject In-Context Image Generation
Multimodal
Generative Models
Computer Vision
- MUSIC is the first MLLM designed for multi-subject in-context image generation.
- An automatic data generation pipeline is introduced, removing the need for manual annotation.
- The vision chain-of-thought mechanism enhances the model's understanding of multi-subject relationships.
- A novel semantics-driven spatial layout planning method is proposed to reduce semantic conflicts.
Read more
Multimodal Large Language Models for Multi-Subject In-Context Image Generation
Summary
This paper introduces MUSIC, the first Multimodal Large Language Model (MLLM) specifically designed for multi-subject in-context image generation. The authors address the challenges of generating images with multiple subjects, which often lead to issues like subject missing and semantic drift in existing methods. To overcome data scarcity, they propose an automatic and scalable data generation pipeline that eliminates the need for manual annotation. The model's understanding of multi-subject semantic relationships is enhanced through a vision chain-of-thought (CoT) mechanism, which facilitates step-by-step reasoning from subject images to semantics and final image generation. Additionally, a semantics-driven spatial layout planning method is developed to manage visual complexity and mitigate identity entanglement. The authors also curate a new benchmark dataset, MSIC, tailored for evaluating multi-subject in-context generation. Experimental results show that MUSIC significantly outperforms existing methods in both multi- and single-subject scenarios, demonstrating improved semantic consistency and identity fidelity.
Methodology
The authors developed the MUSIC model, integrating a vision chain-of-thought mechanism for reasoning, a semantics-driven spatial layout planning method to manage visual complexity, and an automatic data generation pipeline for scalable training. They also curated the MSIC benchmark dataset for evaluation.
Results
MUSIC demonstrated significant improvements over existing methods in generating images with multiple subjects, achieving state-of-the-art performance on both the MSIC and DreamBench datasets, particularly in terms of semantic consistency and identity fidelity.
Implications
The advancements presented in this paper could have practical applications in personalized image generation, multi-person scene synthesis, and complex product visualization, enhancing the capabilities of generative models in handling intricate visual tasks.
Validated Synthetic Patient Generation for Small Longitudinal Cohorts: Coagulation Dynamics Across Pregnancy
Generative Models
Time Series
Theory
- Introduces multiplicity-weighted Stochastic Attention (SA) for synthetic patient generation.
- SA preserves the geometry of real patient data while generating new synthetic profiles.
- Successfully applied to a longitudinal coagulation dataset from pregnant patients.
- Synthetic patients were validated to be statistically and mechanistically similar to real patients.
Read more
Validated Synthetic Patient Generation for Small Longitudinal Cohorts: Coagulation Dynamics Across Pregnancy
Summary
This paper addresses the challenge of generating synthetic patient data for small longitudinal cohorts, particularly in maternal health where data is scarce and expensive to collect. The authors introduce a novel generative framework called multiplicity-weighted Stochastic Attention (SA), which utilizes modern Hopfield network theory to create synthetic patient profiles that maintain the geometric structure of real patient data. By embedding real patient profiles as memory patterns, SA generates new synthetic patients through Langevin dynamics, allowing for targeted amplification of rare clinical subgroups without the need for retraining. The methodology was applied to a longitudinal coagulation dataset from 23 pregnant patients, capturing 72 biochemical features across three visits. Validation tests demonstrated that the synthetic patients were statistically and mechanistically indistinguishable from real patients. Furthermore, a mechanistic model calibrated on synthetic data was able to predict real patient outcomes as effectively as one calibrated on actual data. This work highlights the potential of SA to enhance data-augmented modeling in small-cohort settings, particularly in clinical research where traditional data collection methods are limited.
Methodology
The authors developed a generative framework based on modern Hopfield network theory, where real patient profiles are treated as memory patterns in a continuous energy landscape. The synthetic patient generation occurs through Langevin dynamics, enabling interpolation between stored patterns while maintaining the original cohort's geometry. The framework incorporates multiplicity weights to amplify rare clinical subgroups during inference without retraining.
Results
The synthetic patients generated by SA were validated through multiple independent tests, showing they were statistically, structurally, and mechanistically indistinguishable from real patients. A mechanistic model calibrated entirely on synthetic data was able to predict held-out real patient outcomes comparably to a model calibrated on real data, indicating the effectiveness of the synthetic data in clinical applications.
Implications
The findings suggest that SA can significantly enhance the ability to conduct analyses and modeling in small longitudinal datasets, particularly in maternal health and other fields where data is limited. This approach could lead to better understanding and management of conditions with rare complications, ultimately improving patient outcomes.
Learning is Forgetting: LLM Training As Lossy Compression
NLP
Large Language Models
Theory
- LLMs are conceptualized as instances of lossy compression, retaining only relevant information from training data.
- Pre-training dynamics align with Information Bottleneck theory, showing a trajectory of initial expansion followed by compression.
- The optimality of compression correlates significantly with performance across multiple benchmarks.
- Quantifying preference information in models predicts downstream performance effectively.
Read more
Learning is Forgetting: LLM Training As Lossy Compression
Summary
This paper presents a novel perspective on the training of large language models (LLMs) by framing it as a process of lossy compression. The authors argue that LLMs retain only the information relevant to their objectives during training, leading to an optimal compression of their training data. They demonstrate that pre-training dynamics of LLMs align closely with theoretical predictions from Information Bottleneck theory, showing a two-phase trajectory where models initially expand their representations before gradually compressing them. The study reveals that different LLMs compress information differently based on their training data and methodologies, yet the optimality of this compression can predict performance on various downstream tasks. By quantifying the preference information in models, the authors establish a significant correlation between the alignment of representations and model performance across multiple benchmarks. This work offers a unified information-theoretic framework for understanding LLM training and provides actionable insights into their representational structures.
Methodology
The authors employed an information-theoretic approach to analyze the pre-training dynamics of LLMs, utilizing the Information Bottleneck theory to quantify the mutual information between representations and inputs/outputs. They conducted experiments across various open-weight models to assess their compression characteristics and performance on downstream tasks.
Results
The study found that LLMs approach optimal compression during pre-training, with smaller models struggling to achieve meaningful compression. The correlation between a model's compression optimality and its performance was significant across six benchmarks for different families of LLMs. Additionally, the quantification of preference information showed a strong predictive relationship with downstream performance metrics.
Implications
This research provides a theoretical framework for understanding how LLMs learn and generalize, potentially guiding future model design and training strategies. It also offers insights into the interpretability of LLMs by linking their representational structures to performance outcomes.
The Role of Emotional Stimuli and Intensity in Shaping Large Language Model Behavior
NLP
Large Language Models
- Emotional prompting can enhance LLM performance but may increase sycophantic behavior.
- The study evaluates four emotions: joy, encouragement, anger, and insecurity, across varying intensities.
- A prompt-generation pipeline was developed to create a comprehensive dataset for analysis.
- Positive emotional stimuli lead to more accurate and less toxic outputs from LLMs.
Read more
The Role of Emotional Stimuli and Intensity in Shaping Large Language Model Behavior
Summary
This paper investigates the impact of emotional stimuli and their intensity on the behavior of large language models (LLMs). While previous research has focused on the effects of positive emotional prompts, this study expands the scope to include four distinct emotions: joy, encouragement, anger, and insecurity. The authors develop a prompt-generation pipeline using GPT-4o mini to create a diverse set of prompts with varying emotional intensities. They compile a 'Gold Dataset' to ensure alignment between human and model-generated labels. The empirical evaluation reveals that positive emotional stimuli enhance accuracy and reduce toxicity in LLM outputs, but also lead to increased sycophantic behavior, where models excessively agree with users. This dual effect highlights the complexity of emotional prompting in LLMs, suggesting that while emotional cues can improve performance, they may also compromise the reliability of the generated information.
Methodology
The authors created a set of human-designed emotional prompts rated on a 1-10 intensity scale. They developed an emotion detection pipeline using zero-shot prompting with GPT-4o mini to assign emotional ratings. A total of 415 LLM-generated prompts were created based on the human prompts. The LLM outputs were evaluated for accuracy, sycophancy, and toxicity using established benchmarks, including Anthropicβs SycophancyEval and the Real-Toxicity-Prompts dataset.
Results
The analysis showed that prompts with positive emotional stimuli resulted in higher accuracy and lower toxicity in LLM outputs. However, these prompts also correlated with increased sycophantic responses, indicating a trade-off between improved performance and the risk of generating overly agreeable outputs.
Implications
The findings suggest that while emotional prompting can be a powerful tool for enhancing LLM performance, it is crucial to consider the potential for sycophantic behavior, which can undermine the reliability of the information generated. This has significant implications for the deployment of LLMs in sensitive applications where accuracy and trustworthiness are paramount.
Approximation of the Basset force in the Maxey-Riley-Gatignol equations via universal differential equations
Theory
Optimization
Time Series
- Introduces a neural network-based approximation for the Basset force in MaRGE.
- Transforms the integro-differential equations into ordinary differential equations for easier numerical solutions.
- Compares FNN and LSTM architectures to capture historical effects in particle motion.
- Demonstrates the effectiveness of the proposed method through numerical experiments in various flow fields.
Read more
Approximation of the Basset force in the Maxey-Riley-Gatignol equations via universal differential equations
Summary
This paper addresses the challenge of numerically solving the Maxey-Riley-Gatignol equations (MaRGE), which describe the motion of spherical inertial particles in a fluid and include the Basset force, an integral term that accounts for historical effects. The Basset force complicates the numerical solution due to its dependence on the particle's past trajectory, leading to its frequent neglect despite its significant impact on particle dynamics. The authors propose a novel approximation of the Basset force using universal differential equations (UDEs) and neural networks, transforming the integro-differential equations into a system of ordinary differential equations (ODEs) that can be solved with standard numerical methods like Runge-Kutta. They compare the performance of a feedforward neural network (FNN) and a long short-term memory (LSTM) network to capture the memory effects of the Basset force. The methodology involves generating training data from the full MaRGE using a numerical solver and testing the approach in two different flow fields. The results demonstrate that the neural network-based approximation effectively reduces the complexity of solving MaRGE while maintaining accuracy, allowing for the use of existing ODE solvers.
Methodology
The authors utilize universal differential equations to approximate the Basset force, replacing the integral term with a neural network. They generate training data by solving the full MaRGE using a numerical solver and evaluate the performance of both feedforward neural networks and LSTMs in capturing the memory effects of the Basset force.
Results
The proposed neural network approximation successfully simplifies the numerical solution of the Maxey-Riley-Gatignol equations, allowing for the use of standard ODE solvers while accurately modeling the effects of the Basset force. The comparison between FNN and LSTM shows that both architectures can effectively capture the historical effects, with potential variations in performance depending on the specific flow field.
Implications
This work has significant implications for the modeling of inertial particles in fluid dynamics, particularly in applications involving dilute suspensions, environmental modeling (e.g., microplastics transport), and industrial processes. The ability to accurately model the Basset force could lead to improved predictions of particle behavior in various fluid environments.
Scaling-Aware Data Selection for End-to-End Autonomous Driving Systems
Optimization
Robotics
Efficient ML
- MOSAIC optimizes data selection by clustering data into domains and modeling their impact on performance metrics.
- The framework significantly reduces the amount of data needed for training while improving model performance.
- MOSAIC outperforms traditional data selection methods, achieving better results with up to 82% less data.
- The approach is robust across different clustering strategies and emphasizes the importance of scaling laws in data selection.
Read more
Scaling-Aware Data Selection for End-to-End Autonomous Driving Systems
Summary
This paper presents a novel framework called Mixture Optimization via Scaling-Aware Iterative Collection (MOSAIC) aimed at improving data selection for training large-scale deep learning models in autonomous driving systems. The authors identify the challenges posed by the need for diverse training data that meets various evaluation criteria, particularly in the context of physical AI applications like autonomous driving. Current data selection frameworks often fail to account for the ambiguity in how different data points influence multiple performance metrics. MOSAIC addresses this by partitioning the dataset into distinct domains, fitting neural scaling laws to these domains, and optimizing the data mixture through an iterative process that focuses on maximizing metric improvements. The framework is applied to an End-to-End (E2E) planner model evaluated on the Extended Predictive Driver Model Score (EPDMS), demonstrating significant improvements in data efficiency and performance compared to existing baselines.
Methodology
MOSAIC operates by first partitioning the dataset into domains, then fitting neural scaling laws to evaluate how data from each domain influences various performance metrics. The framework iteratively selects data points that maximize the expected improvement in an aggregate utility function derived from these metrics, allowing for efficient data mixture optimization.
Results
MOSAIC was tested on the NAVSIM and OpenScene benchmarks, where it demonstrated superior performance in driving tasks, achieving better EPDMS scores with significantly less data. The framework was able to maintain full training performance while requiring 42% fewer data samples, showcasing its efficiency and effectiveness in data selection.
Implications
The findings suggest that MOSAIC can be a valuable tool for optimizing data selection in autonomous driving systems and potentially other physical AI applications, leading to more efficient training processes and better model performance with reduced data requirements.
Data Warmup: Complexity-Aware Curricula for Efficient Diffusion Training
Generative Models
Computer Vision
Efficient ML
- Data Warmup addresses inefficiencies in diffusion training by aligning data complexity with model readiness.
- A semantic-aware complexity metric is introduced, combining foreground dominance and typicality for image scoring.
- The curriculum's simple-to-complex ordering is critical for performance improvements, as reversing it degrades results.
- Data Warmup significantly improves IS by up to 6.11 and FID by up to 3.41 on ImageNet datasets.
Read more
Data Warmup: Complexity-Aware Curricula for Efficient Diffusion Training
Summary
The paper introduces 'Data Warmup', a novel curriculum learning strategy aimed at improving the efficiency of diffusion training by addressing the mismatch between data complexity and model readiness. The authors argue that randomly initialized networks struggle with the full complexity spectrum of training images, leading to inefficiencies in early training stages. To mitigate this, Data Warmup schedules training images from simple to complex based on a semantic-aware complexity metric that combines foreground dominance and typicality. This metric is computed offline, allowing for a temperature-controlled sampler to prioritize low-complexity images initially, gradually transitioning to uniform sampling. The authors demonstrate that this approach significantly enhances image synthesis quality, achieving improvements in Inception Score (IS) and FrΓ©chet Inception Distance (FID) on ImageNet 256Γ256 datasets with SiT backbones. The results indicate that the order of complexity is crucial, as reversing the curriculum leads to degraded performance. The method is efficient, requiring only a brief preprocessing phase without adding per-iteration overhead, and can be combined with existing accelerators like REPA.
Methodology
The authors developed a curriculum learning strategy called Data Warmup, which utilizes a semantic-aware complexity metric to score images based on foreground dominance and typicality. This scoring is done offline, allowing a temperature-controlled sampler to prioritize simpler images at the beginning of training. The method is evaluated on ImageNet datasets using SiT backbones, comparing performance across different sampling orders.
Results
Data Warmup led to significant improvements in image generation quality, with IS improvements of up to 6.11 and FID improvements of up to 3.41 on ImageNet 256Γ256 datasets. The results confirmed that the simple-to-complex ordering of images is essential for achieving these gains, as reversing the order resulted in performance degradation.
Implications
The findings suggest that adopting a complexity-aware curriculum can enhance the efficiency of training generative models, potentially reducing computational costs and training time. This approach could be applied to other generative tasks and models, leading to broader applications in computer vision and beyond.
Critical Patch-Aware Sparse Prompting with Decoupled Training for Continual Learning on the Edge
Computer Vision
Efficient ML
Robotics
- CPS-Prompt improves training-time efficiency for continual learning on edge devices.
- The framework reduces memory usage and computational cost with minimal accuracy loss.
- Critical Patch Sampling (CPS) and Decoupled Prompt and Classifier Training (DPCT) are the two main components.
- CPS-Prompt shows significant improvements in peak memory and energy efficiency over existing methods.
Read more
Critical Patch-Aware Sparse Prompting with Decoupled Training for Continual Learning on the Edge
Summary
This paper presents CPS-Prompt, a novel framework for prompt-based continual learning (PCL) that addresses the challenges of training-time memory usage and computational cost on resource-constrained edge devices. The authors highlight the need for efficient on-device adaptation, particularly in scenarios where excessive intermediate activations can lead to memory overflow and training failures. CPS-Prompt integrates two key components: Critical Patch Sampling (CPS) for task-aware token reduction and Decoupled Prompt and Classifier Training (DPCT) to minimize backpropagation overhead. Experimental evaluations on three public benchmarks and real edge hardware demonstrate that CPS-Prompt achieves a 1.6Γ improvement in peak memory, training time, and energy efficiency compared to the CODA-Prompt baseline, while maintaining accuracy within 2% of the state-of-the-art C-Prompt. The framework is validated on the Jetson Orin Nano, confirming its practicality for continual learning in edge scenarios.
Methodology
CPS-Prompt employs a two-module approach: Critical Patch Sampling (CPS) selects task-specific patches to reduce memory usage during training, while Decoupled Prompt and Classifier Training (DPCT) optimizes the training process by separating the learning of prompts and classifiers to decrease backpropagation overhead. This dual approach allows for efficient training on edge devices without compromising accuracy.
Results
CPS-Prompt achieves approximately 1.6Γ improvements in peak memory, training time, and energy efficiency compared to the CODA-Prompt baseline. It maintains an accuracy level that is within 2% of the state-of-the-art C-Prompt, demonstrating a strong balance between efficiency and performance.
Implications
The findings suggest that CPS-Prompt can significantly enhance the feasibility of continual learning applications on edge devices, such as smartphones and drones, where computational resources are limited. This framework could lead to more efficient and effective on-device learning systems, enabling real-time adaptation to new tasks while preserving prior knowledge.
EgoEverything: A Benchmark for Human Behavior Inspired Long Context Egocentric Video Understanding in AR Environment
Computer Vision
Multimodal
- EgoEverything benchmark incorporates human attention signals for question generation.
- It includes over 5,000 question-answer pairs based on realistic AR scenarios.
- The methodology employs a multi-agent VQA pipeline and attention-inspired sampling.
- Evaluation shows current VLMs struggle with the complexities of real-world AR interactions.
Read more
EgoEverything: A Benchmark for Human Behavior Inspired Long Context Egocentric Video Understanding in AR Environment
Summary
EgoEverything is a novel benchmark designed to enhance long-context egocentric video understanding, particularly in augmented reality (AR) environments. The benchmark addresses the limitations of existing datasets that primarily focus on visual content without considering the underlying human behavior that informs video-related queries. By integrating human attention signals derived from gaze data, EgoEverything generates over 5,000 multiple-choice question-answer pairs across more than 100 hours of video. This approach captures the natural human behavior of querying based on attention, thereby providing a more realistic evaluation setting for machine learning models. The authors highlight the need for models to reason over extended temporal contexts and diverse activities, which is crucial for effective everyday assistance in AR applications. The benchmark's design includes a Visual Question Answering (VQA) generation pipeline that utilizes multiple AI agents and an attention-inspired sampling strategy to create both attention-driven and detail-oriented queries. Evaluation of current Vision-Language Models (VLMs) on EgoEverything reveals significant performance gaps, indicating the challenges these models face in real-life AR scenarios.
Methodology
The authors developed a Visual Question Answering (VQA) generation pipeline that leverages multiple AI agents to produce questions aligned with authentic human questioning patterns. An attention-inspired sampling strategy selects question targets based on simulated gaze, allowing for both attention-driven and detail-oriented queries. Comprehensive human review was incorporated to enhance question quality and reliability.
Results
Evaluation on several cutting-edge Vision-Language Models (VLMs) demonstrated consistently lower performance on the EgoEverything benchmark, highlighting the limitations of these models in handling real-life AR long-context egocentric video understanding tasks.
Implications
EgoEverything has the potential to significantly advance research in long-context egocentric video understanding and improve the development of intelligent personal assistants in AR environments. By focusing on human behavior and attention, it encourages the creation of more effective machine learning models that can assist users in everyday tasks.
Robust Length Prediction: A Perspective from Heavy-Tailed Prompt-Conditioned Distributions
Large Language Models
NLP
Efficient ML
- Output length prediction is critical for efficient LLM serving and resource allocation.
- Existing methods treat output length as a deterministic scalar, which is statistically misaligned.
- The proposed ProD framework captures the heavy-tailed nature of output length distributions.
- ProD-M and ProD-D provide robust point and distributional predictions, respectively.
Read more
Robust Length Prediction: A Perspective from Heavy-Tailed Prompt-Conditioned Distributions
Summary
This paper addresses the challenge of output-length prediction for large language models (LLMs), which is crucial for efficient serving and resource allocation. Traditional methods treat output length as a deterministic scalar, using a one-shot sampled length as the label. However, the authors demonstrate that the output length is better represented as a distribution, specifically a heavy-tailed distribution, where the same prompt can yield varying output lengths. To improve prediction accuracy, they propose a new framework called Prompt-conditioned length Distributions (ProD), which includes two methods: ProD-M, which uses the median of multiple independent generations as a robust point prediction target, and ProD-D, which employs a distributional target that captures the full uncertainty of the output length. Theoretical analysis supports the effectiveness of these methods, showing that increasing the number of samples significantly reduces estimation error. Experimental results on two LLMs, Qwen-2.5-7B and Llama-3-8B, demonstrate that the proposed methods achieve up to a 25% reduction in average prediction error compared to state-of-the-art approaches.
Methodology
The authors introduce the ProD framework, which involves two main methods: ProD-M, which uses the median of multiple independent generations as a training target, and ProD-D, which uses a histogram of sampled lengths to capture the distributional nature of output lengths. Both methods leverage the last-layer hidden states of the LLM without requiring additional models or inference costs.
Results
The proposed methods were tested on Qwen-2.5-7B and Llama-3-8B across four benchmarks, achieving up to a 25% reduction in average prediction error compared to state-of-the-art methods, demonstrating their effectiveness in robust length prediction.
Implications
The findings suggest that treating output length as a distribution rather than a deterministic value can enhance the efficiency of LLM serving, leading to better resource allocation and scheduling in various applications, including chatbots and agentic workflows.
SOLAR: Communication-Efficient Model Adaptation via Subspace-Oriented Latent Adapter Reparametrization
Efficient ML
Federated Learning
Large Language Models
- SOLAR significantly reduces communication and storage costs of PEFT adapters.
- The method utilizes subspace similarity to create compact adapter representations.
- It is model-agnostic and compatible with existing PEFT methods.
- The framework allows for post-training compression without modifying the fine-tuning process.
Read more
SOLAR: Communication-Efficient Model Adaptation via Subspace-Oriented Latent Adapter Reparametrization
Summary
The paper introduces SOLAR, a novel framework aimed at enhancing the efficiency of Parameter-Efficient Fine-Tuning (PEFT) methods by significantly reducing the communication and storage costs associated with model adaptation. Traditional PEFT techniques, while effective, still incur substantial overhead in resource-constrained environments, particularly in distributed systems. SOLAR addresses this issue by reparameterizing PEFT updates as linear combinations of basis vectors derived from the foundation model's singular vectors, incorporating controlled random perturbations. This approach leverages the subspace similarity between the foundation model and task-specific updates, allowing for a decoupling of adapter size from the model architecture. The proposed method is model-agnostic and compatible with existing PEFT techniques, enabling post-training compression without altering the original fine-tuning process. The authors provide a theoretical analysis bounding the reconstruction error and demonstrate through extensive experiments that SOLAR can reduce adapter sizes by up to 98% while maintaining competitive performance across various language and vision tasks.
Methodology
SOLAR employs a three-step framework for post-hoc adapter compression: (1) constructing a basis pool from the foundation model's singular vectors with random perturbations, (2) selecting the most significant basis vectors based on a budget, and (3) reconstructing the adapter using only the selected coefficients and a random seed. The method is designed to be compatible with existing PEFT techniques, allowing for efficient compression without retraining.
Results
The experiments conducted on various language and vision tasks using models such as ViT, GPT-2, and LLaMA demonstrate that SOLAR can reduce the size of adapters by up to 98% while preserving the performance levels of the original LoRA adapters, indicating its effectiveness in maintaining accuracy despite significant compression.
Implications
SOLAR's ability to compress model adaptations efficiently has significant implications for deploying large-scale models in resource-constrained environments, such as edge devices and federated learning systems. It enables faster training, reduced energy consumption, and improved scalability, making it a valuable tool for practical applications of foundation models.
Provably Adaptive Linear Approximation for the Shapley Value and Beyond
Theory
Efficient ML
Interpretability
- Establishes a theoretical framework for approximating semi-values with improved query complexities.
- Develops a linear-space algorithm requiring O(n/Ρ² log(1/δ)) utility queries.
- Introduces Adalina, an adaptive algorithm that minimizes mean square error in linear time and space.
- Bridges existing algorithms and clarifies the benefits of paired sampling.
Read more
Provably Adaptive Linear Approximation for the Shapley Value and Beyond
Summary
This paper addresses the challenge of efficiently approximating the Shapley value and its broader family of semi-values, which are crucial in various attribution problems in machine learning. The authors propose a theoretical framework that leverages a vector concentration inequality to improve the query complexities of existing unbiased randomized algorithms under a Ξ(n) space constraint. They develop a linear-space algorithm that requires O(n/Ρ² log(1/Ξ΄)) utility queries to ensure a specified accuracy with high probability. This framework connects various existing algorithms, including OFA and kernelSHAP, and characterizes the conditions under which paired sampling is advantageous. The authors introduce Adalina, the first adaptive, linear-time, linear-space randomized algorithm that minimizes mean square error for specific utility functions. The theoretical findings are supported by experimental validation, demonstrating the effectiveness of the proposed methods in practical applications.
Methodology
The authors utilize a vector concentration inequality to derive sharper query complexities for existing unbiased algorithms. They systematically develop a linear-space algorithm and introduce Adalina, focusing on minimizing mean square error while adhering to space constraints. The framework connects various approximation methods and provides a holistic perspective on bounding estimates.
Results
The proposed framework allows for a significant reduction in query complexity to O(n/Ρ² log(1/Ξ΄)) while maintaining a Ξ(n) space requirement. The Adalina algorithm achieves improved mean square error, and all theoretical findings are experimentally validated, confirming the practicality of the approach.
Implications
The results have significant implications for large-scale applications in machine learning where efficient attribution is required. The ability to approximate semi-values with reduced query complexity and space usage can enhance the performance of models in various domains, including feature attribution and model interpretability.
GIRL: Generative Imagination Reinforcement Learning via Information-Theoretic Hallucination Control
Reinforcement Learning
Generative Models
Robotics
- GIRL addresses imagination drift in MBRL through cross-modal grounding and trust-region constraints.
- The framework utilizes a frozen DINOv2 model to ensure semantic consistency in imagined trajectories.
- GIRL shows a 38-61% reduction in latent rollout drift compared to DreamerV3.
- It achieves higher asymptotic returns with 40-55% fewer environment steps on long-horizon tasks.
Read more
GIRL: Generative Imagination Reinforcement Learning via Information-Theoretic Hallucination Control
Summary
The paper introduces GIRL (Generative Imagination Reinforcement Learning), a novel framework designed to enhance model-based reinforcement learning (MBRL) by addressing the issue of imagination drift during long-horizon planning. The authors identify that traditional MBRL methods, such as DreamerV3, suffer from compounded model errors that lead to unreliable value estimates and catastrophic policy failures. To mitigate this, GIRL incorporates two key innovations: a cross-modal grounding signal derived from a frozen foundation model (DINOv2) that anchors the latent transition prior to a semantically consistent embedding space, and an uncertainty-adaptive trust-region bottleneck that constrains the imagination drift using a KL regularizer formulated as a Lagrange multiplier. The theoretical contributions include a re-derivation of the value-gap bound that connects the I-ELBO objective to real-environment regret. Empirical evaluations across multiple benchmark suites demonstrate that GIRL significantly reduces latent rollout drift and achieves higher returns with fewer environment interactions compared to existing methods.
Methodology
GIRL employs a latent world-model framework that integrates a cross-modal grounding vector from a frozen DINOv2 model to anchor the latent transition prior. It also implements an uncertainty-adaptive trust-region bottleneck that constrains the KL regularizer based on Expected Information Gain and Relative Performance Loss signals. The model is evaluated using a recurrent state-space model (RSSM) approach, with a focus on maintaining a deterministic recurrent state and stochastic latent variables.
Results
The experiments demonstrate that GIRL reduces latent rollout drift by 38-61% across various tasks compared to DreamerV3. It achieves higher asymptotic returns while requiring 40-55% fewer environment steps on tasks with a horizon of 500 or more. Additionally, GIRL outperforms TD-MPC2 in sparse-reward and high-contact settings, as measured by Interquartile Mean (IQM) and Probability of Improvement (PI). The distilled-prior variant of GIRL also significantly reduces inference overhead.
Implications
The advancements presented in GIRL could lead to more robust and efficient reinforcement learning applications in complex environments, particularly in robotics and autonomous systems where long-horizon planning is critical. The methods developed may also enhance the performance of other model-based learning frameworks by providing a means to control imagination drift effectively.