AI-generated summaries
Today's ML research,
without the noise.
Daily summaries of the latest machine learning papers from arXiv, processed every 8 hours.
44
Papers today
8h
Update frequency
7
Days of history
Reaching Beyond the Mode: RL for Distributional Reasoning in Language Models
NLP
Large Language Models
Reinforcement Learning
- Multi-Answer RL allows language models to generate multiple plausible answers with confidence estimates in a single pass.
- The approach addresses the issue of entropy collapse seen in traditional RL training for LMs.
- Empirical results show substantial improvements in answer diversity, coverage, and calibration scores.
- Models trained with Multi-Answer RL are more token-efficient and accurate, particularly in coding tasks.
Read more
Reaching Beyond the Mode: RL for Distributional Reasoning in Language Models
Summary
This paper addresses the limitations of traditional reinforcement learning (RL) approaches in training language models (LMs) to generate a single dominant answer, which can be inadequate for real-world tasks that require multiple plausible answers or uncertainty estimates. The authors propose a novel method called Multi-Answer RL, which enables LMs to generate a distribution of answers in a single forward pass, thus internalizing the inference-time search process. This approach modifies the RL objective to encourage the generation of diverse candidate answers while providing calibrated confidence estimates for each. The authors empirically validate their method across various benchmarks, including question-answering, medical diagnostics, and coding tasks, demonstrating significant improvements in answer diversity, coverage, and calibration compared to traditional single-answer training methods. The results indicate that Multi-Answer RL is a more efficient and principled alternative to existing inference-time scaling techniques, requiring fewer tokens to generate multiple answers and achieving higher accuracy in coding tasks.
Methodology
The authors introduce Multi-Answer RL, which modifies the traditional RL training objective to optimize for generating a distribution of plausible answers directly. This involves training models to reason over multiple hypotheses and verbalize structured sets of candidate answers, supported by a reward function inspired by proper scoring rules to encourage calibrated distributions.
Results
The empirical validation shows that Multi-Answer RL significantly enhances performance across various benchmarks, achieving over 50% improvement in top-1 accuracy on coding tasks while reducing token usage by more than half. The models also demonstrate better-calibrated uncertainty scores at the set level.
Implications
This research has significant implications for applications requiring nuanced reasoning and uncertainty estimation, such as medical diagnosis, ambiguous question answering, and any domain where multiple correct answers exist. It positions Multi-Answer RL as a viable method for improving the reliability and efficiency of language models in high-stakes environments.
Anchored-Branched Steady-state WInd Flow Transformer (AB-SWIFT): a metamodel for 3D atmospheric flow in urban environments
Theory
Efficient ML
Graph Learning
- AB-SWIFT is the first transformer-based neural operator specifically designed for local-scale atmospheric flow modeling.
- The model is trained on a new dataset that includes various urban geometries and atmospheric stratifications.
- AB-SWIFT achieves superior accuracy compared to existing transformer and graph neural network models.
- The model's architecture allows for flexible representation of terrain topology and atmospheric conditions.
Read more
Anchored-Branched Steady-state WInd Flow Transformer (AB-SWIFT): a metamodel for 3D atmospheric flow in urban environments
Summary
The paper introduces the Anchored-Branched Steady-state WInd Flow Transformer (AB-SWIFT), a novel transformer-based metamodel designed to simulate 3D atmospheric flow in urban environments. Traditional Computational Fluid Dynamics (CFD) methods are often computationally expensive and slow, particularly when high mesh refinement is required for complex urban geometries. AB-SWIFT addresses these challenges by utilizing a branched internal structure that accommodates varying urban layouts and atmospheric conditions. The model is trained on a unique dataset comprising atmospheric simulations across randomized urban geometries and different atmospheric stratifications. The authors demonstrate that AB-SWIFT outperforms existing state-of-the-art transformer and graph-based models in terms of accuracy for predicting atmospheric flow fields. This work represents a significant advancement in the application of deep learning for local-scale atmospheric modeling, providing a faster and more efficient alternative to traditional CFD methods.
Methodology
AB-SWIFT employs a transformer architecture with a branched structure tailored for atmospheric flow simulations. It processes vertical meteorological profiles as inputs, allowing it to model various atmospheric stratifications effectively. The model follows an encode-process-decode scheme, where the encoder transforms physical inputs into latent tokens, the processor operates within the latent space, and the decoder maps the tokens back to physical quantities.
Results
The AB-SWIFT model demonstrated the best accuracy in predicting atmospheric flow fields when compared to state-of-the-art transformers and graph-based models. The results indicate that the model can effectively handle the complexities of urban geometries and varying atmospheric conditions, achieving significant improvements over previous methods.
Implications
The development of AB-SWIFT has potential applications in urban planning, pollutant dispersion modeling, and wind farm optimization, providing a faster and more efficient tool for researchers and practitioners in environmental science and engineering.
Neural Network Conversion of Machine Learning Pipelines
Theory
Efficient ML
Optimization
- Explores the conversion of traditional ML pipelines into neural networks using a student-teacher learning approach.
- Focuses on transferring knowledge from random forest classifiers to neural networks.
- Demonstrates that student NNs can match the performance of teacher models with proper hyper-parameter tuning.
- Investigates the use of random forests for hyper-parameter selection in neural networks.
Read more
Neural Network Conversion of Machine Learning Pipelines
Summary
This paper explores the conversion of traditional machine learning pipelines into neural network (NN) architectures through a student-teacher learning framework. The authors focus on transferring knowledge from a non-neural-based machine learning pipeline, specifically a random forest classifier, to a neural network student. The goal is to achieve a unified inference engine that can optimize various components of the pipeline for multiple machine learning tasks. The study involves experimenting with different NN topologies on 100 OpenML tasks where random forests have previously performed well. The results indicate that with appropriate hyper-parameter selection, the student NN can effectively mimic the performance of the teacher model. Additionally, the authors investigate using random forests to assist in selecting the optimal hyper-parameters for the NN, thereby enhancing the conversion process. The paper highlights the potential benefits of this approach, including improved generalization performance and the ability to adapt to dynamic environments.
Methodology
The authors employed a student-teacher learning framework to transfer knowledge from a random forest classifier (teacher) to a neural network (student). They experimented with various NN architectures across 100 OpenML tasks, focusing on hyper-parameter tuning and the use of random forests to assist in this selection process.
Results
The experiments revealed that the student neural networks could effectively replicate the performance of the random forest classifiers on the majority of tasks when the right hyper-parameters were chosen. This indicates that the conversion approach is viable and can lead to successful knowledge transfer from traditional ML models to neural networks.
Implications
The findings suggest that converting traditional machine learning pipelines into neural networks can lead to enhanced performance and adaptability in dynamic environments. This approach may simplify the optimization of complex systems and improve the deployment of machine learning solutions across various tasks.
A Practical Guide Towards Interpreting Time-Series Deep Clinical Predictive Models: A Reproducibility Study
Time Series
Interpretability
- Attention mechanisms can effectively enhance interpretability in clinical predictive models.
- Black-box interpreters like KernelSHAP and LIME are not suitable for time-series clinical prediction tasks.
- Many interpretability methods lack reliability and trustworthiness.
- The study provides guidelines for improving interpretability in clinical settings.
Read more
A Practical Guide Towards Interpreting Time-Series Deep Clinical Predictive Models: A Reproducibility Study
Summary
This paper addresses the critical need for interpretability in deep clinical predictive models, particularly in time-series data, where clinical decisions require explicit justification. The authors present a comprehensive benchmark that evaluates various interpretability methods across diverse clinical prediction tasks and model architectures. They investigate whether architectural features, such as attention mechanisms, enhance explainability and whether interpretability methods generalize across different clinical tasks. The study reveals that attention mechanisms, when used correctly, significantly improve interpretability. However, black-box interpreters like KernelSHAP and LIME are found to be computationally infeasible for time-series tasks, and several interpretability approaches are deemed unreliable. The authors provide guidelines for enhancing interpretability in clinical predictive pipelines and offer their implementations through the open-source framework PyHealth, promoting reproducibility and extensibility in future research.
Methodology
The authors developed an interpretability benchmark that evaluates various methods against two main criteria: scalability across patient events and populations, and faithfulness to downstream predictions. They compared attention-based models with non-attention-based models and assessed the performance of different interpretability approaches across multiple clinical prediction tasks.
Results
The analysis demonstrated that attention mechanisms, when properly utilized, are efficient for model interpretation. It was found that black-box interpreters are computationally infeasible for time-series tasks, and several interpretability methods were identified as unreliable. The study emphasizes the need for robust interpretability frameworks in clinical predictive modeling.
Implications
The findings underscore the importance of model interpretability in clinical settings, which is essential for building trust in AI systems used for decision-making. The guidelines provided can help improve the integration of interpretable models in clinical workflows, ultimately enhancing patient care and compliance with regulatory standards.
CVA: Context-aware Video-text Alignment for Video Temporal Grounding
Computer Vision
Multimodal
- Introduction of Query-aware Context Diversification (QCD) to enhance data augmentation.
- Development of Context-invariant Boundary Discrimination (CBD) loss for improved semantic consistency.
- Design of Context-enhanced Transformer Encoder (CTE) for effective multi-scale temporal context modeling.
- Achieved state-of-the-art performance on VTG benchmarks, notably improving Recall@1 scores.
Read more
CVA: Context-aware Video-text Alignment for Video Temporal Grounding
Summary
The paper presents a novel framework called Context-aware Video-text Alignment (CVA) aimed at improving video temporal grounding (VTG) by achieving robust video-text alignment that is sensitive to temporal dynamics while minimizing the influence of irrelevant background context. The authors introduce three key components: (1) Query-aware Context Diversification (QCD), a data augmentation strategy that ensures only semantically unrelated content is mixed in to prevent false negatives; (2) Context-invariant Boundary Discrimination (CBD) loss, a contrastive loss function that maintains semantic consistency at challenging temporal boundaries; and (3) Context-enhanced Transformer Encoder (CTE), a hierarchical architecture that utilizes windowed self-attention and bidirectional cross-attention to capture multi-scale temporal context. The synergy of these components allows CVA to achieve state-of-the-art performance on major VTG benchmarks, significantly improving Recall@1 scores by approximately 5 points compared to existing methods. This advancement addresses the critical challenge of aligning text queries with dynamic video content, enhancing the effectiveness of video moment retrieval and highlight detection tasks.
Methodology
The CVA framework integrates three innovative components: QCD for data augmentation that selectively mixes semantically unrelated video clips, CBD loss to enforce consistency at temporal boundaries, and CTE, a hierarchical transformer architecture that combines local and global attention mechanisms to capture temporal context effectively.
Results
CVA demonstrated significant improvements in performance on major VTG benchmarks, achieving a notable increase of approximately 5 points in Recall@1 scores over existing state-of-the-art methods, indicating enhanced robustness in video-text alignment.
Implications
The proposed CVA framework has potential applications in enhancing video content retrieval systems, improving user experience in video platforms, and advancing research in multimodal learning by providing a more effective method for aligning video and text data.
Spatiotemporal System Forecasting with Irregular Time Steps via Masked Autoencoder
Time Series
- Introduction of the Physics-Spatiotemporal Masked Autoencoder (P-STMAE) for forecasting irregular time series.
- Elimination of data imputation while preserving the physical integrity of dynamical systems.
- Significant improvements in prediction accuracy and computational efficiency over traditional methods.
- Demonstrated applicability in real-world scenarios, including ocean temperature forecasting.
Read more
Spatiotemporal System Forecasting with Irregular Time Steps via Masked Autoencoder
Summary
This paper addresses the challenges of predicting high-dimensional dynamical systems with irregular time steps, which often arise from missing data or sparse observations. The authors propose a novel method called the Physics-Spatiotemporal Masked Autoencoder (P-STMAE), which integrates convolutional autoencoders for spatial feature extraction with masked autoencoders optimized for irregular time series. This approach leverages attention mechanisms to reconstruct the entire physical sequence in a single prediction pass, eliminating the need for data imputation while maintaining the physical integrity of the system. The model is evaluated on both simulated datasets and real-world ocean temperature data, demonstrating significant improvements in prediction accuracy, robustness to nonlinearities, and computational efficiency compared to traditional convolutional and recurrent network methods. The findings suggest that P-STMAE can effectively capture complex spatiotemporal patterns without requiring domain-specific knowledge, making it applicable in various fields such as climate modeling, fluid dynamics, ocean forecasting, and environmental monitoring.
Methodology
The proposed method combines convolutional autoencoders for spatial feature extraction with masked autoencoders optimized for irregular time series. It employs attention mechanisms to reconstruct physical sequences directly from irregular observations, avoiding preprocessing steps like data imputation.
Results
The P-STMAE model showed significant enhancements in prediction accuracy and robustness against nonlinearities when tested on multiple datasets, including real-world ocean temperature data. It outperformed traditional convolutional and recurrent network approaches in terms of computational efficiency.
Implications
The findings suggest that P-STMAE can be utilized in various applications such as climate modeling, fluid dynamics, ocean forecasting, and environmental monitoring, providing a powerful tool for researchers and practitioners dealing with irregular time series data.
Interpretable PM2.5 Forecasting for Urban Air Quality: A Comparative Study of Operational Time-Series Models
Time Series
Interpretability
- Developed a transparent, leakage-aware forecasting workflow for PM2.5 prediction.
- Compared three operational time-series models: SARIMAX, Facebook Prophet, and NeuralProphet.
- Demonstrated that lightweight models can achieve competitive accuracy and efficiency.
- Found that online residual correction significantly improves model robustness.
Read more
Interpretable PM2.5 Forecasting for Urban Air Quality: A Comparative Study of Operational Time-Series Models
Summary
This study addresses the challenge of accurate short-term PM2.5 forecasting for urban air quality, particularly in Beijing, China. It critiques the reliance on complex, data-intensive models and investigates whether simpler, interpretable forecasting methods can achieve competitive performance. The authors developed a leakage-aware forecasting workflow that integrates chronological data partitioning, preprocessing, feature selection, and exogenous-driver modeling under a Perfect Prognosis setting. They evaluated three forecasting families—SARIMAX, Facebook Prophet, and NeuralProphet—under two adaptive regimes: weekly walk-forward refitting and frozen forecasting with online residual correction. Results indicated that Facebook Prophet outperformed the others in terms of predictive accuracy and computational efficiency, achieving a Mean Absolute Error (MAE) of 37.61 and a Root Mean Square Error (RMSE) of 50.10 during walk-forward refitting. In the frozen-model regime, corrected SARIMAX yielded the lowest overall error (MAE 32.50; RMSE 46.85). The study concludes that lightweight forecasting strategies can effectively balance accuracy, interpretability, and computational efficiency, making them suitable for real-world applications in urban air quality management.
Methodology
The authors implemented a leakage-aware forecasting workflow that included chronological data partitioning, preprocessing, feature selection, and modeling of exogenous drivers. They conducted a systematic benchmark of the three forecasting models under two deployment strategies: walk-forward refitting and frozen forecasting with online residual correction, using a rolling evaluation design to assess performance over time.
Results
Facebook Prophet achieved the best performance with an MAE of 37.61 and RMSE of 50.10 under walk-forward refitting. In the frozen-model regime, corrected SARIMAX provided the lowest error metrics (MAE 32.50; RMSE 46.85). NeuralProphet was less accurate and stable across both regimes, and residual correction did not enhance its forecasts. Notably, corrected Facebook Prophet reduced runtime significantly while maintaining accuracy.
Implications
The findings suggest that lightweight and interpretable forecasting models can be effectively utilized for urban air quality management, providing timely and accurate PM2.5 predictions that are crucial for public health interventions and policy planning.
Light Cones For Vision: Simple Causal Priors For Visual Hierarchy
Computer Vision
- Introduction of Worldline Slot Attention for modeling visual hierarchies.
- Demonstration that Euclidean geometry fails to capture hierarchical relationships.
- Lorentzian geometry significantly outperforms hyperbolic embeddings in hierarchical object discovery.
- Worldline binding allows for multi-scale information aggregation across hierarchy levels.
Read more
Light Cones For Vision: Simple Causal Priors For Visual Hierarchy
Summary
This paper addresses a fundamental limitation in standard vision models, which treat objects as independent points in Euclidean space and fail to capture hierarchical structures like parts within wholes. The authors introduce 'Worldline Slot Attention', a novel architecture that models objects as persistent trajectories through spacetime worldlines. Each object is represented with multiple slots at different hierarchy levels, sharing the same spatial position but differing in temporal coordinates. The study demonstrates that without geometric structure, specifically Lorentzian geometry, the model performs poorly, achieving an accuracy of only 0.078 in Euclidean space, which is below random chance. In contrast, using Lorentzian worldlines, the model achieves significant improvements in accuracy (0.479–0.661) across three datasets, indicating that visual hierarchies require causal structures rather than tree structures. The findings emphasize the necessity of geometric structure encoding asymmetric causality for hierarchical object discovery, achieved with a lightweight model of only 11K parameters.
Methodology
The authors propose Worldline Slot Attention, which operates in (d+1)-dimensional Lorentzian spacetime. The architecture employs worldline binding, allowing slots at different hierarchy levels to share spatial positions while occupying different temporal coordinates. This enables the model to aggregate information across abstraction levels. The methodology includes the use of Lorentzian geometry to define causal relationships and a scale-adaptive attention mechanism to enhance feature aggregation.
Results
The model achieves a dramatic improvement in accuracy when using Lorentzian worldlines, with results ranging from 0.479 to 0.661 across three datasets, compared to 0.078 in Euclidean space. This performance was consistent across over 20 independent runs, validating the effectiveness of the proposed geometric structure.
Implications
The findings suggest that incorporating causal geometric structures into visual models can enhance the understanding of hierarchical relationships in visual data. This could lead to advancements in object-centric learning and improve the performance of models in tasks requiring hierarchical reasoning.
Experiential Reflective Learning for Self-Improving LLM Agents
Large Language Models
NLP
Reinforcement Learning
- ERL enables LLM agents to adapt to new environments by reflecting on past experiences.
- The framework generates reusable heuristics that improve task execution without requiring parameter updates.
- ERL outperforms existing experiential learning methods, achieving a 7.8% increase in success rate on the Gaia2 benchmark.
- Heuristic retrieval is critical for enhancing performance and reliability in task completion.
Read more
Experiential Reflective Learning for Self-Improving LLM Agents
Summary
This paper introduces Experiential Reflective Learning (ERL), a framework designed to enhance the adaptability of large language model (LLM) agents in specialized environments. Traditional LLM agents often approach new tasks without leveraging past experiences, leading to inefficiencies. ERL addresses this by enabling agents to reflect on their task trajectories and outcomes to generate reusable heuristics. These heuristics capture actionable lessons that can be applied across different tasks. During execution, relevant heuristics are retrieved based on the current task and integrated into the agent's context to guide its actions. The authors demonstrate that ERL significantly improves the success rate and reliability of task completion on the Gaia2 benchmark, outperforming existing methods such as ReAct, ExpeL, and AutoGuide. Systematic ablation studies reveal the importance of selective heuristic retrieval and the advantages of heuristic-based guidance over traditional few-shot prompting methods. Overall, ERL showcases a novel approach to self-improvement in LLM agents, emphasizing the value of experiential learning.
Methodology
The ERL framework consists of two main components: heuristic generation and retrieval-augmented execution. Heuristic generation involves analyzing task outcomes to create structured guidelines that capture successful strategies and failure modes. During task execution, the agent retrieves relevant heuristics from a persistent pool based on the current task's description, ensuring that only the most pertinent advice is provided.
Results
ERL achieved an overall success rate of 56.1% on the Gaia2 benchmark, marking a 7.8% improvement over the ReAct baseline. The framework also demonstrated enhanced task completion reliability and outperformed previous experiential learning methods, such as ExpeL and AutoGuide.
Implications
The findings suggest that integrating experiential learning into LLM agents can significantly enhance their adaptability and performance in diverse environments. This approach could be applied in various domains requiring autonomous decision-making and problem-solving, such as robotics, customer service, and complex system management.
Flow matching on homogeneous spaces
Generative Models
Theory
Efficient ML
- Introduces a framework for flow matching on homogeneous spaces by lifting to Lie groups.
- Eliminates the need for complex geometric computations like geodesics.
- Reformulates flow matching as a Euclidean task on Lie algebras, enhancing computational efficiency.
- Demonstrates the framework's applicability through case studies on specific homogeneous spaces.
Read more
Flow matching on homogeneous spaces
Summary
This paper presents a novel framework for extending Flow Matching to homogeneous spaces, specifically quotients of Lie groups. The proposed approach reformulates the flow matching problem as one on the underlying Lie group by lifting data distributions, which simplifies the complex geometry of homogeneous spaces. By working directly on Lie groups, the method reduces the problem to a Euclidean flow matching task on Lie algebras, eliminating the need for premetrics or geodesics, thus providing a simpler and faster intrinsic framework. The author revisits flow matching on Lie groups and introduces a new formulation that enhances conceptual clarity and implementation ease. The framework is validated through two case studies: SL(2, R)/SO(2, R) and SO(3, R)/SO(2, R), demonstrating its effectiveness in handling generative modeling tasks on these manifolds.
Methodology
The methodology involves reformulating the flow matching problem on homogeneous spaces as a flow matching task on the corresponding Lie group. This is achieved by lifting the data distributions to the Lie group and then projecting the results back to the quotient space. The approach utilizes a neural network to learn the vector field that pushes the noise distribution forward to the target distribution, employing a loss function that simplifies the training process.
Results
The proposed framework successfully demonstrates flow matching on two specific homogeneous spaces, SL(2, R)/SO(2, R) and SO(3, R)/SO(2, R). The results indicate that the method is computationally efficient and conceptually clearer than previous approaches, allowing for effective generative modeling without the complexities associated with Riemannian geometry.
Implications
The implications of this work extend to various applications in generative modeling, particularly in scenarios involving complex geometries. The framework could facilitate advancements in machine learning tasks that require efficient handling of data distributions on manifolds, potentially impacting fields such as computer vision and robotics.
Once-for-All Channel Mixers (HYPERTINYPW): Generative Compression for TinyML
Efficient ML
Time Series
Audio & Speech
- HYPERTINYPW compresses neural networks by generating most PW weights at load time, significantly reducing memory usage.
- The method retains the first PW layer in INT8 format to stabilize performance for morphology-sensitive tasks.
- Achieves a 6.31× reduction in model size while maintaining over 95% accuracy on benchmark ECG datasets.
- Compatible with standard integer operations, ensuring easy integration into existing TinyML frameworks.
Read more
Once-for-All Channel Mixers (HYPERTINYPW): Generative Compression for TinyML
Summary
The paper introduces HYPERTINYPW, a novel approach for compressing neural networks specifically designed for deployment on microcontrollers (MCUs) with limited memory. Traditional pointwise (PW) convolutions consume significant memory even after quantization, posing challenges for TinyML applications. HYPERTINYPW addresses this by replacing most stored PW weights with generated weights using a shared micro-MLP that synthesizes PW kernels from compact per-layer codes at load time. This method preserves the first PW layer in INT8 format to maintain performance in morphology-sensitive tasks. The approach not only reduces memory usage significantly—achieving a 6.31× compression rate while retaining over 95% of the macro-F1 score on benchmark ECG datasets—but also ensures compatibility with standard integer operations, thus facilitating deployment on existing TinyML stacks. The paper provides a comprehensive evaluation of the method, including packed-byte accounting, deployment analysis, and latency/energy profiling, demonstrating its effectiveness across various ECG tasks.
Methodology
HYPERTINYPW employs a compression-as-generation strategy where a shared micro-MLP synthesizes pointwise convolution weights from compact per-layer codes at load time. The first PW layer is kept in INT8 format, and the synthesized weights are cached for reuse during inference, which is performed using standard integer operations to ensure compatibility with existing systems.
Results
The proposed method was validated on three ECG benchmarks (Apnea-ECG, PTB-XL, MIT-BIH), achieving a model size of approximately 225 kB, which is 6.31× smaller than a baseline model of 1.4 MB while retaining at least 95% of the macro-F1 score. Under tighter memory budgets of 32–64 kB, HYPERTINYPW maintained balanced detection performance where traditional compact models failed.
Implications
HYPERTINYPW has significant implications for deploying deep learning models in resource-constrained environments, particularly in healthcare applications involving biosignal analytics. The method's ability to reduce memory usage while maintaining performance opens avenues for real-time on-device inference, enhancing privacy and reliability in sensitive applications.
Revisiting On-Policy Distillation: Empirical Failure Modes and Simple Fixes
NLP
Large Language Models
Optimization
- Analysis of estimator tradeoffs in OPD reveals biases and variance characteristics.
- Identification of three failure modes in sampled-token OPD: imbalanced signals, unreliable guidance, and tokenizer mismatches.
- Introduction of teacher top-K local support matching as a solution to improve OPD.
- Empirical results show enhanced stability and performance in math reasoning tasks.
Read more
Revisiting On-Policy Distillation: Empirical Failure Modes and Simple Fixes
Summary
This paper revisits on-policy distillation (OPD) for large language models (LLMs), particularly focusing on its application in long-horizon reasoning tasks. The authors identify that the common implementation of OPD, which relies on sampled-token comparisons, suffers from several failure modes that lead to instability in training. They analyze the estimator tradeoffs between token-level and sequence-level objectives, revealing that while token-level OPD is biased, it has a tighter worst-case variance bound. The authors empirically demonstrate three main failure modes: an imbalanced one-token signal, unreliable teacher guidance on student-generated prefixes, and distortions from tokenizer mismatches. To address these issues, they propose a new method called teacher top-K local support matching, which utilizes truncated reverse-KL with top-p rollout sampling and special-token masking. This approach aims to provide more stable optimization and improved performance in downstream tasks compared to the traditional sampled-token OPD.
Methodology
The authors conducted a theoretical analysis of the estimator tradeoffs in OPD, comparing token-level and sequence-level objectives. They identified failure modes through empirical studies and proposed a new method, teacher top-K local support matching, which involves truncated reverse-KL with top-p rollout sampling and special-token masking to enhance the stability of the training process.
Results
The proposed teacher top-K local support matching method demonstrated more stable optimization behavior and improved downstream performance in both single-task math reasoning and multi-task agentic-plus-math training compared to the traditional sampled-token OPD.
Implications
The findings suggest that improving the stability of on-policy distillation methods can lead to better performance in long-horizon reasoning tasks, which is crucial for the effective deployment of large language models in real-world applications.
A Systematic Empirical Study of Grokking: Depth, Architecture, Activation, and Regularization
Optimization
Theory
- Depth requires stabilization for effective grokking.
- Architectural differences between models are largely confounded by optimization and regularization.
- Activation function performance is dependent on the regularization regime.
- Weight decay is a critical parameter for enabling grokking within a narrow range.
Read more
A Systematic Empirical Study of Grokking: Depth, Architecture, Activation, and Regularization
Summary
This paper investigates the phenomenon of grokking, which refers to the delayed transition from memorization to generalization in neural networks, particularly in the context of modular addition tasks. The authors conduct a controlled empirical study to disentangle the effects of depth, architecture, activation functions, and regularization on grokking dynamics. The study reveals that grokking is influenced more by the interactions between optimization stability and regularization than by architectural differences. Key findings include: (1) depth has a non-monotonic effect on grokking, with depth-4 MLPs failing to generalize while depth-8 residual networks succeed, indicating the need for architectural stabilization; (2) the performance gap between Transformers and MLPs diminishes under matched hyperparameters, suggesting that previous differences were confounded by optimization and regularization; (3) the effectiveness of activation functions like GELU versus ReLU is dependent on the regularization regime; and (4) weight decay is identified as a crucial control parameter, with a specific range necessary for successful grokking. Overall, the results challenge architecture-centric views and highlight the importance of optimization and regularization in understanding delayed generalization.
Methodology
The authors performed a systematic empirical study using modular addition tasks (mod 97) to isolate and analyze the effects of various factors such as model depth, architecture, activation functions, and regularization techniques. They employed matched and carefully tuned training regimes across different model configurations to ensure controlled comparisons.
Results
The study found that depth-4 MLPs consistently failed to grok, while depth-8 residual networks successfully achieved generalization. The gap in performance between Transformers and MLPs was significantly reduced under matched hyperparameters. Activation function advantages were observed to be regime-dependent, and weight decay was identified as the dominant control parameter for grokking, with a specific 'Goldilocks' range necessary for optimal performance.
Implications
The findings provide insights into the mechanisms behind grokking, suggesting that optimization and regularization strategies can be tuned to enhance generalization in neural networks. This has practical implications for model training, particularly in scenarios where delayed generalization is observed.
Layer-Specific Lipschitz Modulation for Fault-Tolerant Multimodal Representation Learning
Multimodal
Theory
Efficient ML
- Introduces a unified framework for fault-tolerant multimodal representation learning.
- Develops a dual-regularization mechanism to balance sensitivity for anomaly detection and correction.
- Demonstrates improved performance on multimodal datasets compared to existing methods.
- Provides a theoretical analysis of perturbation effects on neural network sensitivity.
Read more
Layer-Specific Lipschitz Modulation for Fault-Tolerant Multimodal Representation Learning
Summary
This paper addresses the challenges of maintaining reliability in multimodal systems under conditions of partial sensor failures, signal degradation, or inconsistencies across modalities. The authors propose a novel framework for fault-tolerant multimodal representation learning that integrates self-supervised anomaly detection with error correction. The framework is grounded in a theoretical analysis of perturbation propagation, leading to the development of Lipschitz- and Jacobian-based criteria to assess how neural operators respond to localized faults. A two-stage self-supervised training scheme is introduced, which first pre-trains a multimodal convolutional autoencoder on clean data to capture localized anomaly signals in the latent space, followed by the addition of a learnable compute block for correction and contrastive objectives for anomaly identification. The paper also introduces layer-specific Lipschitz modulation and gradient clipping to manage sensitivity across detection and correction modules. Experimental evaluations on multimodal fault datasets demonstrate that the proposed approach significantly enhances both anomaly detection accuracy and reconstruction performance under sensor corruption, effectively bridging the gap between analytical robustness and practical fault-tolerant multimodal learning.
Methodology
The methodology involves a two-stage self-supervised training scheme where a multimodal convolutional autoencoder is first pre-trained on clean data. This is followed by the integration of a learnable compute block that employs dense layers for correction and contrastive objectives for identifying anomalies. The approach utilizes Lipschitz modulation and gradient clipping to control sensitivity across different modules.
Results
The experimental results indicate that the proposed framework outperforms existing fragmented methodologies in terms of both anomaly detection accuracy and reconstruction quality under sensor corruption, validating the effectiveness of the dual-regularization mechanism.
Implications
The findings suggest that this framework can enhance the reliability of multimodal systems in industrial and safety-critical environments, potentially reducing operational inefficiencies and costs associated with unplanned downtimes.
Vision Hopfield Memory Networks
Computer Vision
Multimodal
Interpretability
- V-HMN integrates hierarchical memory mechanisms for improved data efficiency and interpretability.
- The model employs local and global Hopfield modules for associative memory dynamics.
- Iterative refinement updates enhance error correction and representation learning.
- V-HMN demonstrates competitive performance on computer vision benchmarks.
Read more
Vision Hopfield Memory Networks
Summary
The paper introduces the Vision Hopfield Memory Network (V-HMN), a novel brain-inspired foundation model designed to enhance the efficiency and interpretability of vision tasks. Unlike traditional architectures that rely heavily on extensive training data and lack biological plausibility, V-HMN integrates hierarchical memory mechanisms with iterative refinement updates. It employs local and global Hopfield modules to facilitate associative memory dynamics at both the image patch level and the contextual level. This design allows for improved data efficiency by reusing stored patterns and enhances interpretability through memory retrieval that clarifies the relationship between inputs and stored patterns. The authors conducted extensive experiments on public computer vision benchmarks, demonstrating that V-HMN achieves competitive performance compared to existing backbone architectures while offering better interpretability and data efficiency. The findings suggest that V-HMN could serve as a next-generation model for vision tasks and provide a framework for multimodal applications, bridging the gap between brain-inspired computation and large-scale machine learning.
Methodology
The V-HMN architecture incorporates local and global memory paths, utilizing Hopfield-style retrieval for local pattern denoising and global contextual modulation. It employs a predictive-coding-inspired refinement rule for iterative error correction, allowing the network to gradually adjust representations toward memory-predicted patterns.
Results
V-HMN achieved competitive results on various public computer vision benchmarks, outperforming traditional architectures in terms of interpretability and data efficiency. The model's hierarchical memory structure allowed for effective memory retrieval and improved learning from limited data.
Implications
The V-HMN model has the potential to advance the field of computer vision by providing a more interpretable and data-efficient framework. Its design could also be adapted for multimodal applications, enhancing the integration of vision, text, and audio processing.
Learning to Staff: Offline Reinforcement Learning and Fine-Tuned LLMs for Warehouse Staffing Optimization
Reinforcement Learning
Large Language Models
Optimization
- Development of a Transformer-GNN architecture for offline RL that improves throughput by 2.4%.
- LLMs require significant task-specific adaptation, with fine-tuning necessary for performance matching historical baselines.
- Iterative preference optimization simulates manager feedback, enabling LLMs to learn and adapt effectively.
- The framework allows for future integration of real manager feedback, enhancing human-AI collaboration.
Read more
Learning to Staff: Offline Reinforcement Learning and Fine-Tuned LLMs for Warehouse Staffing Optimization
Summary
This paper explores machine learning techniques for optimizing staffing decisions in semi-automated warehouse sortation systems. The authors evaluate two primary approaches: offline reinforcement learning (RL) using a custom Transformer-based Graph Neural Network (GNN) and large language models (LLMs) operating on abstracted state descriptions. The offline RL approach, trained on detailed historical data, achieved a 2.4% improvement in throughput compared to historical baselines, demonstrating its effectiveness in high-volume operations. In contrast, the LLMs, which were fine-tuned with supervised learning and Direct Preference Optimization, showed that while prompting alone was inadequate, the combination of fine-tuning and preference optimization allowed them to match or slightly exceed historical performance in a crafted simulator. The study emphasizes the importance of supporting human decision-makers by integrating AI systems that can adapt to their preferences and operational knowledge. The findings suggest that both methods can enhance operational decision-making in warehouses, with offline RL excelling in task-specific scenarios and LLMs providing human-readable insights.
Methodology
The authors employed a two-pronged approach: (1) Offline reinforcement learning using a Transformer-GNN architecture to model detailed state representations and optimize staffing decisions, and (2) Large language models fine-tuned with supervised learning and Direct Preference Optimization to operate on abstracted, human-readable state descriptions. The effectiveness of these methods was evaluated through simulations that mimicked real-world warehouse operations.
Results
The offline RL approach achieved a 2.4% throughput improvement over historical decision-making baselines in learned simulators. The LLMs, when fine-tuned and optimized for preferences, matched or slightly exceeded historical performance, demonstrating the potential for AI systems to assist human managers effectively.
Implications
The findings suggest that integrating machine learning techniques into warehouse staffing can lead to significant operational efficiencies. The ability to adapt AI systems to human preferences and operational knowledge can enhance decision-making processes, potentially leading to cost savings and improved throughput in logistics operations.
Grokking as a Falsifiable Finite-Size Transition
Theory
- Introduces a structured finite-size scaling approach to analyze grokking in neural networks.
- Defines the group order p of Zp as an extensive variable and spectral head–tail contrast as an order parameter.
- Demonstrates that grokking exhibits transition-like finite-size organization, challenging smooth-crossover interpretations.
- Establishes a diagnostic chain that allows for falsifiable claims regarding grokking.
Read more
Grokking as a Falsifiable Finite-Size Transition
Summary
This paper investigates the phenomenon of grokking, characterized by the delayed onset of generalization in neural networks after initial memorization, through the lens of finite-size scaling (FSS). The authors argue that previous descriptions of grokking as a phase transition lacked falsifiable finite-size inputs. They introduce the group order p of Zp as an extensive variable and a spectral head–tail contrast (HTC) as an order parameter to apply a condensed-matter-style diagnostic chain. By conducting Binder-like crossings and susceptibility comparisons, the study reveals a shared finite-size boundary and strongly disfavors a smooth-crossover interpretation of grokking. The findings suggest that grokking can be quantitatively tested as a finite-size claim, although the order of the transition remains unresolved. The paper emphasizes the need for a structured approach to understanding grokking, moving beyond mere analogy to a rigorous diagnostic framework.
Methodology
The authors applied finite-size scaling techniques from statistical mechanics to the study of grokking in neural networks. They identified the group order p of Zp as the size variable and the spectral head–tail contrast (HTC) as the order parameter. They utilized Binder crossings and susceptibility comparisons to analyze the behavior of the system across different sizes, aiming to establish a diagnostic protocol that could confirm or reject the phase transition hypothesis.
Results
The analysis revealed a shared finite-size boundary through Binder-like crossings, and the susceptibility comparison provided strong evidence against a smooth-crossover interpretation of grokking, with a significant ∆AIC value of 16.8. This supports the notion that grokking can be framed as a quantitative finite-size transition, although the specific order of the transition remains to be determined.
Implications
The findings have implications for understanding the dynamics of learning in neural networks, particularly in the context of generalization and memorization. By framing grokking within a rigorous finite-size scaling framework, this research opens avenues for further exploration of phase transitions in machine learning, potentially leading to improved training strategies and architectures.
A Unified Memory Perspective for Probabilistic Trustworthy AI
Theory
Efficient ML
- Introduces a unified probabilistic memory abstraction for analyzing deterministic and stochastic operations.
- Identifies a scaling mismatch between compute throughput, memory bandwidth, and entropy generation.
- Examines architectural trade-offs between conventional von Neumann systems and emerging probabilistic compute-in-memory approaches.
- Outlines evaluation criteria for memory systems to support probabilistic computation effectively.
Read more
A Unified Memory Perspective for Probabilistic Trustworthy AI
Summary
This paper addresses the growing need for trustworthy AI systems that leverage probabilistic computation to enhance robustness, interpretability, security, and privacy. The authors propose a unified perspective on data access that treats deterministic access as a special case of stochastic sampling, allowing both to be analyzed within a common framework. This perspective reveals that increased stochastic demands can lead to reduced data-access efficiency and potentially drive systems into entropy-limited operation. The authors introduce memory-level evaluation criteria, including unified operation, distribution programmability, efficiency, robustness to hardware non-idealities, and parallel compatibility. They analyze the limitations of conventional architectures and explore emerging probabilistic compute-in-memory (CIM) approaches that integrate sampling with memory access, suggesting pathways for developing scalable hardware for trustworthy AI applications.
Methodology
The authors utilize a theoretical framework to analyze the interaction between probabilistic computation and memory access. They define memory-level evaluation criteria and conduct a comparative analysis of conventional architectures and emerging probabilistic compute-in-memory systems.
Results
The analysis reveals that increasing stochastic demands can lead to inefficiencies in data access and performance bottlenecks due to entropy limitations. The proposed unified memory abstraction facilitates a better understanding of these challenges and highlights the need for hardware that can efficiently support both deterministic and stochastic operations.
Implications
The findings suggest significant implications for the design of future AI hardware, particularly in high-stakes applications such as medical decision-making and autonomous systems, where reliable operation and uncertainty quantification are critical. The proposed memory-level evaluation criteria can guide the development of more efficient and robust hardware architectures for trustworthy AI.
Hessian-informed machine learning interatomic potential towards bridging theory and experiments
Theory
Efficient ML
Optimization
- Introduction of Hi-MLIP for capturing local curvature of PES.
- Development of HINT protocol to reduce Hessian label requirements.
- Significant improvements in transition-state search and Gibbs free energy predictions.
- Accurate treatment of anharmonic hydrides, matching experimental results.
Read more
Hessian-informed machine learning interatomic potential towards bridging theory and experiments
Summary
This paper introduces a novel Hessian-informed Machine Learning Interatomic Potential (Hi-MLIP) designed to accurately capture the local curvature of potential energy surfaces (PES), which is essential for predicting thermodynamic and kinetic properties of complex molecular systems. The authors develop a training protocol called Hessian INformed Training (HINT), which significantly reduces the computational burden associated with obtaining Hessian labels by two to four orders of magnitude. HINT incorporates several strategies, including Hessian pre-training, configuration sampling, curriculum learning, and stochastic projection Hessian loss, to enhance data and training efficiency. The Hi-MLIP framework demonstrates improved performance in transition-state searches and Gibbs free energy calculations, achieving results close to chemical accuracy, particularly in data-scarce scenarios. Additionally, it effectively addresses the treatment of strongly anharmonic hydrides, accurately reproducing phonon renormalization and superconducting critical temperatures in alignment with experimental data. This work establishes a practical approach to integrating Hessian supervision into machine learning interatomic potentials, bridging the gap between computational simulations and experimental observations across various systems.
Methodology
The methodology involves the HINT protocol, which includes Hessian pre-training on low-fidelity datasets, weighted-local density sampling for configuration selection, a heterogeneous labeling approach for fine-tuning, and a stochastic projection loss to efficiently compute Hessian information without constructing full Hessian matrices.
Results
The Hi-MLIP framework achieved substantial improvements in computational efficiency, requiring significantly fewer Hessian labels for model convergence. It demonstrated high accuracy in predicting transition states and Gibbs free energy, particularly in scenarios with limited data, and successfully modeled the behavior of strongly anharmonic hydrides, aligning closely with experimental findings.
Implications
The findings suggest that the Hi-MLIP framework can enhance the predictive capabilities of machine learning models in computational chemistry, facilitating more accurate simulations of molecular systems and bridging the gap between theoretical predictions and experimental results. This could have significant implications for materials science, drug discovery, and catalysis.
Can an Actor-Critic Optimization Framework Improve Analog Design Optimization?
Optimization
- Introduces ACOF, integrating actor-critic methodology into analog design optimization.
- Enhances search efficiency by combining proposal and evaluation roles, improving interpretability.
- Achieves significant performance improvements over existing optimization techniques.
- Maintains compatibility with standard simulation-based design flows.
Read more
Can an Actor-Critic Optimization Framework Improve Analog Design Optimization?
Summary
This paper introduces an Actor-Critic Optimization Framework (ACOF) aimed at enhancing the process of analog design optimization, which is often hindered by the expensive simulation cycles required for even minor adjustments in device parameters. Traditional optimization methods lack the nuanced judgment that experienced designers apply when navigating the vast design space. ACOF addresses this by separating the roles of proposal and evaluation: the actor proposes promising regions of the design space, while the critic reviews these proposals, ensuring they adhere to design constraints and redirecting the search when necessary. This structured approach maintains compatibility with existing simulation workflows while improving the stability, interpretability, and efficiency of the search process. The framework was tested on various circuits, yielding an average improvement of 38.9% in the top-10 figure of merit compared to the best existing methods, alongside a 24.7% reduction in regret, with peak improvements reaching 70.5% in figure of merit and 42.2% in regret for specific circuits. ACOF thus offers a more transparent and effective method for automated analog sizing in complex design environments.
Methodology
The ACOF framework operates in a closed-loop process where the actor proposes candidate search regions, the critic audits these proposals for legality and effectiveness, and a Bayesian optimization method evaluates the performance of designs within the approved regions. This iterative process allows for continuous refinement of search parameters based on feedback from previous rounds.
Results
The implementation of ACOF resulted in an average improvement of 38.9% in the top-10 figure of merit across test circuits, with a 24.7% reduction in regret. Individual circuits showed peak gains of 70.5% in figure of merit and 42.2% lower regret, demonstrating the framework's effectiveness in optimizing analog designs.
Implications
The findings suggest that integrating actor-critic frameworks into analog design optimization can significantly streamline the design process, making it more efficient and interpretable. This approach could lead to advancements in automated design tools, reducing the reliance on extensive designer expertise and simulation cycles.
Knowledge-Guided Retrieval-Augmented Generation for Zero-Shot Psychiatric Data: Privacy Preserving Synthetic Data Generation
Generative Models
Large Language Models
NLP
- Introduces a zero-shot, knowledge-guided framework for synthetic psychiatric data generation.
- Utilizes Retrieval-Augmented Generation to ground LLM responses in clinical knowledge.
- Demonstrates competitive performance against state-of-the-art generative models while preserving privacy.
- Shows that clinical retrieval enhances the fidelity of generated data.
Read more
Knowledge-Guided Retrieval-Augmented Generation for Zero-Shot Psychiatric Data: Privacy Preserving Synthetic Data Generation
Summary
This paper addresses the challenge of limited access to real patient data in psychiatric research, which hampers the development of AI systems in healthcare. The authors propose a zero-shot, knowledge-guided framework for generating synthetic psychiatric tabular data using large language models (LLMs) enhanced by Retrieval-Augmented Generation (RAG). By grounding the model in established clinical knowledge from the DSM-5 and ICD-10, the framework generates privacy-preserving synthetic data without relying on real patient records. The authors benchmark their approach against state-of-the-art models like CTGAN and TVAE, which depend on actual data and pose privacy risks. The evaluation focuses on six anxiety-related disorders, revealing that while CTGAN excels in marginal and multivariate structures, the knowledge-augmented LLM performs competitively in pairwise structures and achieves the lowest pairwise error for specific disorders. An ablation study confirms that clinical retrieval significantly enhances data fidelity. Privacy analyses indicate that the LLM-based approach maintains low linkage risks and modest overlaps, making it a viable alternative to traditional data-dependent methods. Overall, this work presents a novel approach to generating high-quality synthetic psychiatric data while ensuring patient privacy.
Methodology
The authors developed a framework that leverages large language models (LLMs) in a zero-shot manner, utilizing Retrieval-Augmented Generation (RAG) to generate synthetic psychiatric data based on established clinical criteria from the DSM-5 and ICD-10. The approach circumvents the need for real patient data, thus avoiding privacy risks associated with traditional generative models.
Results
The evaluation demonstrated that while CTGAN generally outperformed in terms of marginal and multivariate structures, the knowledge-augmented LLM achieved competitive results in pairwise structures and had the lowest pairwise error for separation anxiety and social anxiety disorders. Privacy analyses indicated that the LLM-based model had low average linkage risk and modest overlaps, comparable to CTGAN, while TVAE showed extensive duplication.
Implications
This research has significant implications for accelerating AI research in mental health by providing a method for generating high-fidelity synthetic data without compromising patient privacy. It opens avenues for further exploration of AI applications in healthcare where real data is scarce or sensitive.
Can LLMs Beat Classical Hyperparameter Optimization Algorithms? A Study on autoresearch
Optimization
Large Language Models
- Classical HPO methods outperform LLM-based agents in fixed search spaces.
- LLM agents that edit training code can significantly improve optimization outcomes.
- The hybrid method 'Centaur' combines classical optimization with LLM capabilities, achieving superior results.
- Reliability in optimization methods is more critical than exploration breadth.
Read more
Can LLMs Beat Classical Hyperparameter Optimization Algorithms? A Study on autoresearch
Summary
This paper investigates the performance of Large Language Models (LLMs) in hyperparameter optimization (HPO) tasks compared to classical optimization algorithms. Using the autoresearch repository, the authors benchmark nine HPO methods, including four classical algorithms (CMA-ES, TPE) and four LLM-based methods, under a fixed compute budget. The study finds that classical methods consistently outperform LLM-based agents in a constrained search space. However, an LLM agent that edits training code directly shows improved performance, narrowing the gap with classical methods. The authors introduce 'Centaur,' a hybrid approach that combines CMA-ES with LLMs by sharing the optimizer's internal state, which leads to the best results in their experiments. The findings suggest that while LLMs struggle with optimization state tracking, they can be effectively paired with classical optimizers to enhance performance.
Methodology
The authors benchmarked nine HPO methods, including four classical and four LLM-based approaches, using a fixed 24-hour GPU training budget. They automated the extraction of hyperparameters from training scripts to minimize human bias. The performance of each method was evaluated based on their ability to optimize a small language model's hyperparameters.
Results
Classical methods like CMA-ES and TPE achieved better performance than LLM-based methods within a fixed search space. The LLM agent that edited training code directly performed competitively, while the hybrid Centaur method outperformed all others, including a 0.8B model that surpassed a 27B model in effectiveness.
Implications
The findings suggest that while LLMs have limitations in tracking optimization states, their integration with classical optimization techniques can enhance hyperparameter tuning processes. This hybrid approach could lead to more efficient AutoML systems and improve the performance of machine learning models.
How unconstrained machine-learning models learn physical symmetries
Theory
Graph Learning
Efficient ML
- Unconstrained ML models can learn physical symmetries effectively through data augmentation.
- The paper introduces metrics to measure symmetry content and equivariance in model outputs.
- Analysis of symmetry information flow provides insights into model architecture and training.
- Strategically injecting inductive biases can improve model stability and accuracy.
Read more
How unconstrained machine-learning models learn physical symmetries
Summary
This paper investigates how unconstrained machine-learning (ML) models can learn physical symmetries, a critical aspect of modeling in physics. Traditionally, models are designed with strict constraints to ensure symmetry, which can limit their expressivity and computational efficiency. The authors explore the performance of unconstrained models, specifically transformer-based architectures applied to atomistic simulations and particle physics, demonstrating that these models can achieve high accuracy in approximating equivariant behavior through data augmentation. They introduce rigorous metrics to quantify the symmetry content of learned representations and assess the degree to which outputs meet equivariant conditions. By analyzing the flow of symmetry information across model layers and during training, the authors establish a framework for diagnosing spectral failure modes in ML models. Their findings suggest that incorporating minimal inductive biases can enhance stability and accuracy while maintaining the advantages of unconstrained architectures. This work highlights the potential of unconstrained models in achieving physical fidelity in simulations, paving the way for more efficient and scalable approaches in computational physics.
Methodology
The authors developed rigorous metrics to quantify the symmetry content of learned representations in unconstrained transformer-based models. They applied these metrics to two specific architectures: a graph neural network for atomistic simulations and a PointNet-style model for particle physics. The analysis focused on how symmetry information is processed across architectural layers and evolves during training.
Results
The study found that unconstrained models can approximate equivariant behavior with high accuracy, often outperforming traditional constrained models. The introduced metrics effectively diagnose the symmetry content and equivariance errors in model predictions, revealing that minimal inductive biases can enhance model performance without sacrificing expressivity.
Implications
This research suggests that unconstrained ML models can be a viable alternative to traditional symmetry-enforcing architectures in physical simulations, potentially leading to faster and more accurate models in various domains of computational physics and materials science.
Uncertainty-Guided Label Rebalancing for CPS Safety Monitoring
Time Series
- Introduces U-Balance, a novel approach for rebalancing imbalanced datasets in CPS safety monitoring.
- Utilizes behavioral uncertainty to enhance label rebalancing without generating synthetic samples.
- Demonstrates a significant correlation between behavioral uncertainty and safety outcomes.
- Achieves a notable improvement in F1 score compared to existing methods.
Read more
Uncertainty-Guided Label Rebalancing for CPS Safety Monitoring
Summary
This paper addresses the critical issue of safety monitoring in Cyber-Physical Systems (CPS), particularly focusing on Unmanned Aerial Vehicles (UAVs). The authors highlight the challenge posed by extreme class imbalance in safety datasets, where unsafe events are rare compared to safe events. Traditional rebalancing techniques like SMOTE and class weighting are inadequate for time-series CPS telemetry, leading to unrealistic synthetic samples or overfitting. The authors propose a novel approach called U-Balance, which utilizes behavioral uncertainty—defined as the degree of doubt in CPS decisions—to enhance label rebalancing. U-Balance first trains a GatedMLP-based uncertainty predictor to generate uncertainty scores for telemetry windows. It then employs an uncertainty-guided label rebalancing (uLNR) mechanism that probabilistically relabels safe windows with high uncertainty as unsafe, enriching the minority class with informative samples. The effectiveness of U-Balance is evaluated on a large-scale UAV dataset with a 46:1 safe-to-unsafe ratio, demonstrating a significant correlation between behavioral uncertainty and safety outcomes. The results show that U-Balance achieves an F1 score of 0.806, outperforming the strongest baseline by 14.3 percentage points while maintaining competitive inference efficiency. This work is the first to leverage behavioral uncertainty for dataset rebalancing in CPS safety monitoring, offering a new perspective on utilizing uncertainty in machine learning.
Methodology
The methodology involves training a GatedMLP-based uncertainty predictor to summarize telemetry data into distributional features and generate uncertainty scores. The uncertainty-guided label rebalancing (uLNR) mechanism is then applied to relabel safe telemetry windows with high uncertainty as unsafe, effectively enriching the minority class without synthesizing new data. A safety predictor is subsequently trained on this rebalanced dataset.
Results
U-Balance was evaluated on a UAV dataset with a 46:1 safe-to-unsafe ratio, achieving an F1 score of 0.806, which is a 14.3 percentage point improvement over the strongest baseline. The study confirmed the effectiveness of the uLNR strategy in exploiting uncertainty information, with ablation studies indicating that both the uncertainty predictor and the uLNR mechanism significantly contribute to the overall performance.
Implications
The findings suggest that incorporating behavioral uncertainty into safety monitoring can significantly enhance the predictive capabilities of machine learning models in CPS. This approach could be applied to various CPS applications, improving safety outcomes and potentially reducing risks associated with unsafe behaviors.
GraphER: An Efficient Graph-Based Enrichment and Reranking Method for Retrieval-Augmented Generation
NLP
Graph Learning
Efficient ML
- GraphER enhances retrieval-augmented generation by capturing multiple forms of proximity beyond semantic similarity.
- The method operates independently of knowledge graphs, allowing for seamless integration with existing vector stores.
- GraphER is retriever-agnostic and introduces negligible latency, making it suitable for production environments.
- Experiments show that GraphER significantly improves retrieval performance on complex queries compared to traditional methods.
Read more
GraphER: An Efficient Graph-Based Enrichment and Reranking Method for Retrieval-Augmented Generation
Summary
The paper introduces GraphER, a novel method designed to enhance retrieval-augmented generation (RAG) systems by addressing the limitations of existing semantic search techniques. Traditional methods often fail to retrieve relevant information when it is distributed across multiple sources, particularly for complex queries. GraphER enriches data objects during offline indexing and employs graph-based reranking at query time, allowing it to capture various forms of proximity beyond mere semantic similarity. Unlike previous approaches that rely on knowledge graphs, GraphER integrates seamlessly with standard vector stores and is retriever-agnostic, introducing minimal latency overhead. The authors demonstrate the effectiveness of GraphER through experiments on multiple retrieval benchmarks, showing improved retrieval completeness and relevance in comparison to baseline methods.
Methodology
GraphER enriches data objects during offline indexing and performs graph-based reranking of candidate objects at query time. This approach captures relationships among documents, allowing for a more comprehensive retrieval process without the need for a knowledge graph.
Results
The experimental results indicate that GraphER outperforms baseline retrieval methods in terms of completeness and relevance, particularly for complex queries that require integrating information from multiple sources.
Implications
GraphER has the potential to improve the efficiency and effectiveness of retrieval-augmented generation systems in various applications, particularly in scenarios where information is distributed across multiple sources, such as database querying and information retrieval tasks.
On Neural Scaling Laws for Weather Emulation through Continual Training
Time Series
Theory
Efficient ML
- Adoption of a minimalist Swin Transformer architecture for weather forecasting.
- Continual training with cooldown phases improves model performance and scaling behavior.
- Identification of compute-optimal training regimes through IsoFLOP curves.
- Demonstration that neural scaling laws can guide efficient resource allocation in scientific machine learning.
Read more
On Neural Scaling Laws for Weather Emulation through Continual Training
Summary
This paper investigates neural scaling laws in the context of scientific machine learning, specifically for weather forecasting models. The authors utilize a minimalist Swin Transformer architecture and implement continual training with constant learning rates and cooldown phases to analyze scaling behavior. Their findings indicate that models trained in this manner exhibit predictable scaling trends and can outperform traditional cosine learning rate schedules. The study explores various model and dataset sizes under different compute budgets to construct IsoFLOP curves, identifying compute-optimal training regimes. The results suggest that neural scaling can serve as a diagnostic tool for efficient resource allocation in weather emulation tasks. The authors also emphasize the importance of understanding scaling behavior in scientific applications, as it can disentangle genuine performance improvements from artifacts introduced by complex architectures. The code for their experiments is open-sourced for reproducibility.
Methodology
The authors employed a Swin Transformer architecture for weather forecasting and implemented continual training with constant learning rates and periodic cooldowns. They systematically varied model and dataset sizes under different compute budgets to analyze scaling behavior and construct IsoFLOP curves.
Results
The study found that models trained with the proposed methodology followed predictable scaling trends and outperformed standard learning rate schedules. The IsoFLOP curves revealed compute-optimal combinations of model size and data volume, highlighting potential performance limits as model scales increase.
Implications
The findings suggest that understanding neural scaling laws can significantly enhance the efficiency of resource allocation in scientific machine learning, particularly in weather forecasting. This work may influence future research directions in SciML and the development of more efficient training methodologies.
Missing-Aware Multimodal Fusion for Unified Microservice Incident Management
Multimodal
- ARMOR is designed to handle missing modalities in multimodal data for microservice incident management.
- The framework utilizes a modality-specific asymmetric encoder to isolate distribution disparities among different data types.
- A missing-aware gated fusion mechanism is employed to prevent cross-modal interference from incomplete inputs.
- Self-supervised learning is leveraged to optimize anomaly detection, failure triage, and root cause localization without requiring extensive fault labels.
Read more
Missing-Aware Multimodal Fusion for Unified Microservice Incident Management
Summary
This paper addresses the challenges of automated incident management in microservice architectures, particularly focusing on the issue of missing modalities in multimodal data. The authors propose ARMOR, a self-supervised framework that effectively manages incomplete data by employing a modality-specific asymmetric encoder and a missing-aware gated fusion mechanism. Unlike existing methods that assume complete data, ARMOR is designed to handle scenarios where network fluctuations or agent failures lead to missing metrics, logs, or traces. The framework optimizes three key tasks: anomaly detection (AD), failure triage (FT), and root cause localization (RCL), with AD and RCL not requiring fault labels, while FT relies on minimal failure-type annotations. The experimental results demonstrate that ARMOR not only achieves state-of-the-art performance under ideal conditions but also maintains robust diagnostic accuracy even when faced with significant modality loss, showcasing its effectiveness in real-world applications.
Methodology
The authors developed ARMOR, which includes a modality-specific asymmetric encoder to address distribution disparities among different data types and a missing-aware gated fusion mechanism that uses learnable placeholders and dynamic bias compensation. The framework employs self-supervised auto-regression with mask-guided reconstruction to optimize the tasks of anomaly detection, failure triage, and root cause localization.
Results
The experimental results indicate that ARMOR achieves state-of-the-art performance in incident management tasks under complete data conditions and demonstrates robust diagnostic accuracy even with significant modality loss, outperforming existing approaches that rely on static placeholders.
Implications
The proposed framework can significantly enhance the reliability and efficiency of incident management in microservice architectures, making it a valuable tool for site reliability engineers (SREs) in real-world applications where data incompleteness is common.
Transformers in the Dark: Navigating Unknown Search Spaces via Bandit Feedback
Large Language Models
Reinforcement Learning
Theory
- Introduction of a new framework for evaluating LLMs' search capabilities.
- Transformers can theoretically represent and approximate distinct search strategies.
- Current LLMs show limited search capabilities compared to traditional algorithms.
- Targeted training can significantly improve LLM performance in search tasks.
Read more
Transformers in the Dark: Navigating Unknown Search Spaces via Bandit Feedback
Summary
This paper investigates the capability of Large Language Models (LLMs) to approximate search algorithms within a structured problem-solving framework. The authors introduce a novel setting termed 'unknown tree search with bandit feedback', where the search space is represented as a tree and the exploration and exploitation of this space are evaluated through external feedback. The study demonstrates that Transformers possess the theoretical expressiveness to implement various search strategies and can be trained to approximate these strategies effectively. Empirical results indicate that while current LLMs underperform compared to established search algorithms, targeted training focused on search under uncertainty significantly enhances their performance. The findings suggest that with appropriate training, LLMs can generalize to unseen conditions, thereby unlocking their potential as effective problem-solving agents.
Methodology
The authors developed a simplified framework for 'unknown tree search with bandit feedback', allowing for controlled evaluation of LLM behavior. They conducted theoretical analyses and empirical studies to assess the expressiveness of Transformers and their ability to approximate search strategies. The performance of LLMs was compared against established search algorithms, and targeted training was applied to enhance their effectiveness.
Results
The study found that while LLMs like Qwen3-8B performed comparably to basic search strategies, they lagged behind more sophisticated algorithms such as Monte Carlo Tree Search. However, fine-tuning the LLMs on search trajectories led to significant performance improvements, indicating that with the right training, LLMs can become more effective in search tasks.
Implications
The findings suggest that LLMs can be developed into more capable problem-solving agents through targeted training, potentially expanding their applications in complex decision-making scenarios where search and exploration are critical.
Epistemic Compression: The Case for Deliberate Ignorance in High-Stakes AI
Theory
Efficient ML
- High-capacity models often fail in high-stakes environments due to overfitting noise rather than capturing relevant signals.
- Epistemic Compression promotes model robustness by aligning complexity with data stability, rather than simply increasing parameters.
- The Regime Index effectively distinguishes between environments where simplicity or complexity is advantageous.
- The study found a strong correlation between the Regime Index and the most effective modeling strategies in high-stakes domains.
Read more
Epistemic Compression: The Case for Deliberate Ignorance in High-Stakes AI
Summary
The paper addresses the limitations of high-capacity foundation models in high-stakes domains such as medicine and finance, where reliability is crucial. It introduces the concept of 'Epistemic Compression,' which posits that robustness in AI models arises from aligning model complexity with the temporal stability of data rather than merely increasing model parameters. The author argues that in unstable environments, high-capacity models often memorize noise instead of capturing meaningful signals, leading to the 'Fidelity Paradox.' To operationalize this idea, a 'Regime Index' is proposed to differentiate between 'Shifting Regime' (unstable, data-poor) and 'Stable Regime' (invariant, data-rich) environments. The findings from an exploratory synthesis of 15 high-stakes domains indicate that the Regime Index accurately predicts the superior modeling strategy in 86.7% of cases. The paper advocates for a paradigm shift in AI development from scaling complexity to embracing principled parsimony, emphasizing that deliberate ignorance can enhance model performance in dynamic contexts.
Methodology
The paper employs a conceptual framework to define and operationalize Epistemic Compression and the Regime Index. It synthesizes empirical data from 15 high-stakes domains to evaluate the effectiveness of different modeling strategies based on the identified regimes.
Results
The Regime Index was concordant with the empirically superior modeling strategy in 86.7% of the examined high-stakes domains, demonstrating the validity of the proposed framework in guiding model complexity decisions.
Implications
The findings suggest that AI practitioners should reconsider the emphasis on scaling model complexity and instead focus on designing architectures that enforce parsimony. This approach could lead to more reliable AI applications in critical fields such as healthcare and finance, where understanding what to ignore is essential for performance.
Train at Moving Edge: Online-Verified Prompt Selection for Efficient RL Training of Large Reasoning Model
Reinforcement Learning
Large Language Models
Efficient ML
- HIVE framework improves efficiency in RL training of LLMs by selecting high-utility prompts.
- The concept of 'learning edge' is introduced, highlighting the dynamic nature of prompt utility during training.
- HIVE achieves up to 9.2 million fewer rollouts while maintaining or exceeding accuracy compared to existing methods.
- The methodology combines historical data with real-time entropy measures to optimize prompt selection.
Read more
Train at Moving Edge: Online-Verified Prompt Selection for Efficient RL Training of Large Reasoning Model
Summary
This paper addresses the challenge of computational overhead in reinforcement learning (RL) for training large language models (LLMs) in reasoning tasks. The authors propose a novel framework called HIVE (History-Informed and online-VErified prompt selection) that focuses on selecting high-utility prompts before the rollout phase, thereby improving efficiency. The study reveals that the utility of prompts is non-uniform and evolves during training, with the most informative prompts located at the 'learning edge'—a balance of intermediate difficulty and high uncertainty. HIVE operates in two stages: it first uses historical reward trajectories for coarse selection and then employs prompt entropy as a real-time proxy to prune low-utility prompts. The framework is evaluated across multiple math reasoning benchmarks and demonstrates significant improvements in rollout efficiency and training speed without sacrificing performance.
Methodology
The methodology involves a dual-stage framework where the first stage utilizes historical reward trajectories for initial prompt selection, and the second stage employs prompt entropy as a real-time measure to refine selections, ensuring that only the most informative prompts are used in the training process.
Results
HIVE was tested on six math reasoning benchmarks and demonstrated up to 3.8× speedup in rollout efficiency and 2.2× faster total training time for certain models. The framework also resulted in a reduction of up to 9.2 million rollouts while maintaining or improving reasoning accuracy compared to existing methods.
Implications
The findings suggest that efficient prompt selection can significantly reduce computational costs in RL training of LLMs, making it feasible to scale training processes while maintaining performance. This has potential applications in various reasoning tasks and could influence future research in optimizing RL algorithms for LLMs.
Optimal High-Probability Regret for Online Convex Optimization with Two-Point Bandit Feedback
Optimization
Theory
- Introduces the first high-probability regret bound for two-point feedback in OCO with strongly convex losses.
- Achieves a regret bound of O(d(log T + log(1/δ))/µ), improving upon previous O(d²) dependencies.
- Utilizes a novel analytical framework that departs from traditional reduction-based methods.
- Matches the minimax optimal bounds for both time horizon and dimension.
Read more
Optimal High-Probability Regret for Online Convex Optimization with Two-Point Bandit Feedback
Summary
This paper addresses the challenge of Online Convex Optimization (OCO) with two-point bandit feedback in adversarial settings. The author focuses on minimizing a sequence of adversarially generated convex loss functions while only observing their values at two points. Previous works highlighted the difficulty in achieving tight high-probability regret bounds for strongly convex functions under this feedback type. The paper presents the first high-probability regret bound of O(d(log T + log(1/δ))/µ) for µ-strongly convex losses, resolving an open problem noted by Agarwal et al. (2010). The proposed method improves upon existing approaches by utilizing a novel analytical framework that is robust against the variance of zero-order estimators, significantly enhancing the dependency on the dimension d. This work not only matches the information-theoretic lower bounds for OCO but also achieves minimax optimality concerning both the time horizon T and the dimension d.
Methodology
The author develops a new analytical framework to derive high-probability regret bounds for OCO with two-point feedback. The approach involves constructing approximate gradients using two-point evaluations of the loss functions and applying advanced geometric and probabilistic techniques to achieve concentration results that are robust to variance.
Results
The main result establishes that with high probability (1 - δ), the regret for the proposed algorithm satisfies RT ≤ O(d(log T + log(1/δ))/µ). This result is minimax optimal with respect to both the time horizon T and the dimension d, significantly improving upon previous bounds.
Implications
The findings have significant implications for online learning algorithms, particularly in scenarios where only limited feedback is available. The results can enhance the performance of algorithms in various applications, including online recommendation systems, adaptive control, and real-time decision-making in adversarial environments.
Intern-S1-Pro: Scientific Multimodal Foundation Model at Trillion Scale
Multimodal
Large Language Models
Reinforcement Learning
- Intern-S1-Pro is the first one-trillion-parameter scientific multimodal foundation model.
- The model integrates advanced agent capabilities for autonomous scientific workflows.
- It has been trained on over 100 specialized tasks across critical scientific fields.
- A group routing mechanism is introduced to enhance training stability and efficiency.
Read more
Intern-S1-Pro: Scientific Multimodal Foundation Model at Trillion Scale
Summary
Intern-S1-Pro is introduced as the first one-trillion-parameter scientific multimodal foundation model, significantly enhancing capabilities in both general and scientific domains. The model excels in reasoning and image-text understanding, while also integrating advanced agent capabilities for autonomous scientific workflows. It has been trained to master over 100 specialized tasks across various scientific fields, including chemistry and life sciences. The development leverages a robust infrastructure, including XTuner and LMDeploy, to ensure efficient Reinforcement Learning (RL) training and precision consistency. The paper argues that a sufficiently large generalist model can outperform specialized models in scientific tasks when trained jointly, challenging the notion that specialized models are always superior. Key innovations include a group routing mechanism to maintain expert load balance and a co-designed training infrastructure that optimizes efficiency at massive scales. The results demonstrate significant performance gains over previous models, validating the effectiveness of the proposed methodologies.
Methodology
The model employs a three-layer design based on the SAGE framework, integrating general and specialized training tasks. It utilizes a group routing mechanism for expert load balancing and a co-designed infrastructure for efficient training and inference. Reinforcement Learning techniques are applied to enhance model capabilities.
Results
Intern-S1-Pro demonstrates superior performance in scientific tasks compared to specialized models, achieving significant gains in reasoning and multimodal understanding. The training efficiency is maintained with only a 20% reduction despite scaling to four times the size of its predecessor.
Implications
The advancements in Intern-S1-Pro could accelerate scientific discovery by providing researchers with a powerful tool for tackling complex problems across various scientific domains. Its capabilities may enhance interdisciplinary research and facilitate the integration of diverse scientific knowledge.
Maximum Entropy Behavior Exploration for Sim2Real Zero-Shot Reinforcement Learning
Reinforcement Learning
Robotics
- Introduces FB-MEBE, an online zero-shot RL algorithm for quadrupedal robots.
- Maximizes entropy of behavior distribution to enhance exploration diversity.
- Integrates a regularization critic to ensure policies are physically plausible.
- Demonstrates improved performance in simulated tasks compared to other strategies.
Read more
Maximum Entropy Behavior Exploration for Sim2Real Zero-Shot Reinforcement Learning
Summary
This paper presents FB-MEBE, an online zero-shot reinforcement learning (RL) algorithm designed for quadrupedal control in real robotic systems. The authors identify that traditional undirected exploration methods yield low-diversity data, which negatively impacts the performance of policies when deployed on hardware. To address this, FB-MEBE combines an unsupervised behavior exploration strategy with a regularization critic that maximizes the entropy of the behavior distribution, promoting diverse exploration. This approach not only enhances the exploration process but also aligns the learned policies with physically plausible behaviors, enabling their direct deployment on hardware without additional fine-tuning. The authors empirically demonstrate that FB-MEBE outperforms existing exploration strategies across various simulated downstream tasks, showcasing its effectiveness in achieving natural and efficient robotic behaviors. This work represents a significant advancement in zero-shot RL, particularly for legged robots, by eliminating the need for external datasets during training.
Methodology
FB-MEBE employs an unsupervised behavior exploration strategy that maximizes the entropy of the achieved behavior distribution. It utilizes the Forward-Backward (FB) algorithm to guide the agent's exploration and incorporates a regularization critic to favor physically plausible behaviors, ensuring that the policies learned are suitable for real-world applications.
Results
The empirical results indicate that FB-MEBE significantly improves downstream performance in various simulated tasks compared to other exploration strategies. The learned policies exhibit smooth and natural behaviors, allowing for seamless deployment on real robotic hardware without the need for further fine-tuning.
Implications
The findings suggest that FB-MEBE could facilitate the development of more adaptable and efficient robotic systems capable of performing a wide range of tasks without extensive pre-training on diverse datasets. This has potential applications in robotics, particularly in environments where data collection is challenging.
Offline Decision Transformers for Neural Combinatorial Optimization: Surpassing Heuristics on the Traveling Salesman Problem
Reinforcement Learning
Optimization
- Introduces a novel application of Decision Transformers for the Traveling Salesman Problem.
- Integrates a Pointer Network to effectively handle variable action spaces in node selection.
- Employs expectile regression for improved Return-to-Go predictions, enhancing solution quality.
- Demonstrates that offline RL can surpass traditional heuristics in generating optimal solutions.
Read more
Offline Decision Transformers for Neural Combinatorial Optimization: Surpassing Heuristics on the Traveling Salesman Problem
Summary
This paper addresses the challenges of solving combinatorial optimization problems, specifically the Traveling Salesman Problem (TSP), using Neural Combinatorial Optimization (NCO). Traditional NCO methods rely on online reinforcement learning (RL), which complicates real-world deployment and underutilizes existing heuristic knowledge. The authors propose a novel approach that leverages the offline RL framework of Decision Transformers (DT) to learn from datasets of heuristic solutions. Their method integrates a Pointer Network to manage the variable action space of node selection and employs expectile regression for optimistic Return-to-Go (RTG) conditioning, crucial for instances with varying optimal values. The experimental results demonstrate that their approach consistently produces higher-quality tours than the classical heuristics it was trained on, showcasing the potential of offline RL to enhance performance in combinatorial optimization tasks.
Methodology
The authors reformulate the TSP as a sequence modeling task using Decision Transformers, where trajectories are modeled with observations, actions, and predicted returns. They incorporate a Pointer Network to adapt the action selection mechanism for the TSP's unique requirements, allowing the model to 'point' to nodes rather than selecting from a fixed action set. Expectile regression is utilized to predict optimistic returns, addressing the variability in optimal solutions across different instances.
Results
The proposed method consistently outperformed four classical heuristics in generating higher-quality tours for the TSP. The integration of the Pointer Network and the use of expectile regression for RTG conditioning significantly contributed to these improvements, indicating that the offline RL framework can effectively leverage existing heuristic knowledge.
Implications
This research suggests that offline RL frameworks like Decision Transformers can be powerful tools for solving complex combinatorial optimization problems, potentially leading to innovative solutions in various industries such as logistics and manufacturing. It highlights the importance of utilizing existing domain knowledge to enhance the performance of machine learning models.
SIGMA: Structure-Invariant Generative Molecular Alignment for Chemical Language Models via Autoregressive Contrastive Learning
Generative Models
Graph Learning
Optimization
- SIGMA addresses trajectory divergence in ChemLMs by enforcing latent isotropy through dense trajectory alignment.
- The Structure-Invariant Contrastive Loss maximizes mutual information between equivalent generation paths, decoupling chemical semantics from syntactic variations.
- IsoBeam dynamically prunes redundant search paths during inference, reallocating resources to explore structurally distinct molecular scaffolds.
- Empirical results show that SIGMA outperforms strong baselines in terms of sample efficiency and structural diversity.
Read more
SIGMA: Structure-Invariant Generative Molecular Alignment for Chemical Language Models via Autoregressive Contrastive Learning
Summary
The paper introduces SIGMA, a novel framework designed to address the challenges of trajectory divergence in chemical language models (ChemLMs) that arise from the linearization of molecular graphs into string representations. This divergence leads to manifold fragmentation, where structurally equivalent molecular graphs are misrepresented in latent space due to their different linearization paths. SIGMA employs a token-level contrastive learning approach that aligns the latent representations of prefixes sharing identical suffixes, thereby enforcing geometric consistency without altering the linear representation. Additionally, the authors propose Isomorphic Beam Search (IsoBeam), an inference method that prunes isomorphic redundancies during the generation process, enhancing computational efficiency. Empirical evaluations demonstrate that SIGMA significantly improves sample efficiency and structural diversity in molecular generation tasks compared to existing methods, effectively bridging the gap between sequence scalability and graph fidelity.
Methodology
The methodology involves a token-level contrastive learning framework that aligns latent representations of molecular prefixes sharing identical suffixes. This is complemented by the IsoBeam algorithm, which prunes isomorphic paths during inference to enhance efficiency.
Results
The empirical evaluations indicate that SIGMA significantly improves the sample efficiency and structural diversity of generated molecules compared to traditional ChemLMs, effectively mitigating the issues caused by trajectory divergence.
Implications
The proposed SIGMA framework has the potential to enhance drug discovery processes by improving the generation of novel drug candidates and accurately predicting molecular properties, thus facilitating more efficient exploration of chemical space.
Amplified Patch-Level Differential Privacy for Free via Random Cropping
Computer Vision
Theory
Efficient ML
- Random cropping can amplify differential privacy in machine learning models without additional computational cost.
- A new patch-level neighboring relation is introduced, allowing for a more tailored approach to privacy in vision data.
- The study provides a theoretical framework for understanding the privacy amplification effects of random cropping.
- Empirical results demonstrate improved privacy-utility trade-offs in segmentation tasks using standard architectures.
Read more
Amplified Patch-Level Differential Privacy for Free via Random Cropping
Summary
This paper explores the intersection of random cropping, a common data augmentation technique in computer vision, and differential privacy (DP) in machine learning. The authors propose that random cropping can serve as an implicit mechanism to enhance privacy by probabilistically excluding sensitive content, such as faces or license plates, from model inputs. They introduce a novel patch-level neighboring relation tailored for vision data, allowing for a more refined understanding of privacy risks. The study derives tight privacy bounds for differentially private stochastic gradient descent (DP-SGD) when combined with random cropping, demonstrating that this additional randomness amplifies privacy without altering the model architecture or training procedures. Empirical validation on segmentation tasks shows that this approach improves the privacy-utility trade-off, suggesting that leveraging existing sources of randomness can yield stronger privacy guarantees in machine learning applications.
Methodology
The authors formalize random cropping as a privacy amplification mechanism and introduce a patch-level neighboring relation specific to vision data. They analyze the resulting privacy guarantees theoretically and validate their findings empirically using segmentation models on relevant datasets.
Results
The results indicate that random cropping significantly enhances the privacy-utility trade-off in differentially private training scenarios, with empirical validation showing improved performance across multiple segmentation architectures and datasets.
Implications
This work suggests that existing data augmentation techniques can be leveraged to enhance privacy in machine learning models, potentially leading to more robust privacy-preserving applications in computer vision and beyond.
Social Hippocampus Memory Learning
Federated Learning
- SoHip introduces a memory-centric approach to social machine learning, focusing on memory sharing for collaboration.
- The framework preserves privacy by keeping raw data and local model parameters on-device.
- Theoretical guarantees on convergence and privacy are provided, enhancing the framework's reliability.
- Experimental results show SoHip outperforms existing heterogeneous federated learning methods by up to 8.78% in accuracy.
Read more
Social Hippocampus Memory Learning
Summary
The paper introduces SoHip (Social Hippocampus Memory Learning), a novel framework for social machine learning (SML) that emphasizes collaboration among heterogeneous agents through memory sharing instead of model or data sharing. This approach addresses privacy concerns and computational overhead associated with traditional federated learning (FL) methods, which often require sharing model parameters or intermediate representations. SoHip operates by allowing agents to extract short-term memory from their local models, consolidate it into long-term memory using a hippocampus-inspired mechanism, and then fuse it with collectively aggregated long-term memory to enhance local predictions. The framework ensures that raw data and local models remain on-device, thus preserving privacy. The authors provide theoretical analyses on the convergence and privacy preservation properties of SoHip and validate its effectiveness through experiments on benchmark datasets, demonstrating significant performance improvements over existing methods.
Methodology
SoHip employs a four-step process: (1) agents extract representations to form individual short-term memories, (2) these are consolidated into long-term memories via a hippocampus-inspired module, (3) individual long-term memories are fused with collective long-term memory from a central server, and (4) updated long-term memories are shared with the server for collective aggregation, ensuring only abstracted memory is exchanged.
Results
The experiments conducted on two benchmark datasets against seven baseline methods demonstrate that SoHip consistently achieves superior performance, with accuracy improvements of up to 8.78%.
Implications
SoHip has potential applications in privacy-sensitive domains such as healthcare and finance, where collaborative learning is essential but data privacy is a major concern. The framework could facilitate more effective and secure federated learning systems.
Causal-INSIGHT: Probing Temporal Models to Extract Causal Structure
Time Series
Interpretability
Graph Learning
- Causal-INSIGHT provides a model-agnostic approach to interpret temporal predictors by analyzing their responses to input clamping.
- The framework constructs directed temporal influence signals to reveal dependencies used by predictors for predictions.
- Qbic, a new graph selection criterion, balances predictive accuracy and structural complexity without needing ground-truth labels.
- Causal-INSIGHT shows competitive structural accuracy and improves temporal delay localization across diverse models.
Read more
Causal-INSIGHT: Probing Temporal Models to Extract Causal Structure
Summary
Causal-INSIGHT is a novel, model-agnostic framework designed for interpreting temporal models by extracting causal structures from multivariate time series data. The framework addresses the challenge of understanding directed temporal interactions in complex dynamical systems, which is crucial for high-stakes applications like healthcare and finance. Instead of inferring causal relationships from the data-generating process, Causal-INSIGHT analyzes the responses of a fixed, pre-trained temporal predictor to systematic input clamping during inference. This approach allows for the construction of directed temporal influence signals that reflect the dependencies leveraged by the predictor for making predictions. A key component of the framework is Qbic, a sparsity-aware graph selection criterion that balances predictive fidelity with structural complexity, enabling the extraction of causal graphs without requiring ground-truth labels. The authors demonstrate that Causal-INSIGHT generalizes across various backbone architectures and significantly enhances temporal delay localization, showcasing its effectiveness through experiments on synthetic, simulated, and realistic benchmarks.
Methodology
Causal-INSIGHT employs a post-hoc interpretability approach that involves probing a pre-trained temporal predictor through input clamping. The responses to these clamped inputs are analyzed to construct influence signals, which are then used to derive directed temporal graphs using the Qbic scoring criterion.
Results
Experiments reveal that Causal-INSIGHT maintains competitive structural accuracy and significantly improves the localization of temporal delays when applied to existing predictors. The framework demonstrates its versatility across different backbone architectures, confirming its model-agnostic nature.
Implications
The ability to extract causal structures from temporal models enhances interpretability in critical domains, allowing practitioners to understand the temporal dependencies that influence model predictions. This could lead to better decision-making in fields such as healthcare, finance, and climate science, where understanding model behavior is essential.
Local learning for stable backpropagation-free neural network training towards physical learning
Optimization
Efficient ML
Theory
- FFzero enables stable neural network training without backpropagation.
- The framework combines local learning and directional-derivative optimization.
- Demonstrated effectiveness across multilayer perceptrons and convolutional networks.
- Provides a viable approach for in-situ physical learning using simulated photonic networks.
Read more
Local learning for stable backpropagation-free neural network training towards physical learning
Summary
This paper introduces FFzero, a novel forward-only learning framework designed to enable stable neural network training without the reliance on backpropagation or automatic differentiation. The authors argue that traditional deep learning methods face significant environmental and physical limitations, prompting the need for alternative training paradigms, particularly in the context of physical neural networks. FFzero employs a combination of layer-wise local learning, prototype-based representations, and directional-derivative-based optimization, allowing for effective training through forward evaluations alone. The framework is shown to generalize across various architectures, including multilayer perceptrons and convolutional neural networks, and is validated using a simulated photonic neural network. The results indicate that FFzero can facilitate backpropagation-free in-situ physical learning, addressing the challenges associated with gradient computation in physical systems.
Methodology
The authors developed FFzero, which integrates layer-wise local learning and prototype-based representations with directional-derivative optimization. This approach allows for gradient estimation through forward evaluations, avoiding the need for backpropagation. The framework was tested on multilayer perceptrons and convolutional neural networks, particularly using a simulated photonic neural network to validate its effectiveness.
Results
The implementation of FFzero demonstrated stable training performance in various neural network architectures, effectively overcoming the limitations associated with backpropagation. The results indicated that local learning is particularly effective under forward-only optimization conditions, where traditional backpropagation methods fail.
Implications
The findings suggest that FFzero could pave the way for more sustainable and efficient neural network training methods, particularly in physical systems where traditional digital computing approaches are not feasible. This has potential applications in neuromorphic computing and other areas requiring in-situ learning capabilities.
Physics-Informed Neural Network Digital Twin for Dynamic Tray-Wise Modeling of Distillation Columns under Transient Operating Conditions
Optimization
Theory
Time Series
- Introduction of a PINN framework for modeling distillation columns under transient conditions.
- Integration of thermodynamic constraints into the neural network's loss function for physical consistency.
- Demonstrated superior performance compared to traditional data-driven models.
- Development of a comprehensive transient dataset for training and evaluation.
Read more
Physics-Informed Neural Network Digital Twin for Dynamic Tray-Wise Modeling of Distillation Columns under Transient Operating Conditions
Summary
This paper presents a novel framework that integrates Physics-Informed Neural Networks (PINNs) with digital twin technology for the dynamic, tray-wise modeling of binary distillation columns under transient operating conditions. The proposed model incorporates fundamental thermodynamic constraints, such as vapor-liquid equilibrium (VLE) and tray-level mass and energy balances, directly into the neural network's loss function. This approach ensures that the model adheres to physical laws while maintaining computational efficiency. The framework is trained on a high-fidelity synthetic dataset generated from Aspen HYSYS, which includes 961 timestamped measurements over 8 hours of operation. An adaptive loss-weighting scheme is employed to balance data fidelity and physics consistency during training. The results demonstrate that the PINN model significantly outperforms traditional data-driven methods, achieving a root mean square error (RMSE) of 0.00143 for HX mole fraction prediction, which is a 44.6% improvement over the best baseline model. The digital twin effectively captures the dynamics of the distillation column, including responses to feed fluctuations and pressure transients, establishing its potential for real-time monitoring and control in industrial applications.
Methodology
The methodology involves the development of a PINN architecture that incorporates thermodynamic constraints as residual terms in the loss function. The model is trained using a synthetic dataset generated from Aspen HYSYS, with an adaptive loss-weighting scheme that stages the emphasis on physical constraints versus data fidelity throughout the training process.
Results
The proposed PINN model achieved an RMSE of 0.00143 for HX mole fraction prediction, with an R² value of 0.9887, representing a 44.6% reduction in error compared to the best-performing data-only baseline. The model successfully captured the dynamic behavior of the distillation column under various transient conditions.
Implications
The findings suggest that the PINN digital twin framework can serve as a robust tool for real-time monitoring, control, and optimization of distillation processes, potentially reducing energy consumption and improving operational efficiency in industrial applications.
Training LLMs for Multi-Step Tool Orchestration with Constrained Data Synthesis and Graduated Rewards
Large Language Models
Reinforcement Learning
- Introduces a framework for training LLMs on multi-step tool orchestration using real API responses.
- Develops a graduated reward system that enhances learning by providing feedback on partial correctness.
- Demonstrates substantial improvements in model performance on ComplexFuncBench.
- Confirms the necessity of both atomic validity and orchestration rewards through ablation studies.
Read more
Training LLMs for Multi-Step Tool Orchestration with Constrained Data Synthesis and Graduated Rewards
Summary
This paper addresses the challenges of training large language models (LLMs) for multi-step tool orchestration, where models must invoke multiple dependent APIs in the correct order while managing intermediate outputs. The authors identify two main obstacles: the reliance on simple function calls in existing environments and the inadequacy of binary rewards that do not provide feedback on partial correctness. To overcome these issues, the authors propose a novel framework that includes a reinforcement learning (RL) environment utilizing a large-scale cache of real API responses, allowing for efficient data synthesis of valid orchestration traces. Additionally, they introduce a graduated reward system that breaks down correctness into atomic validity and orchestration, providing more granular feedback. The framework is empirically validated on ComplexFuncBench, demonstrating significant improvements in turn accuracy. Ablation studies confirm the necessity of both reward components for optimal performance, highlighting the importance of fine-grained feedback in training models for complex tasks.
Methodology
The authors construct a deterministic RL training environment that leverages a cache of over 100,000 real API responses to ensure consistent dependency chains. They define workflow templates to guide data synthesis and implement a graduated reward system that evaluates both individual function call correctness and the sequencing of API calls.
Results
The proposed framework shows significant improvements in turn accuracy on ComplexFuncBench, with ablation studies indicating that both components of the graduated reward system are essential for effective learning.
Implications
This work has potential applications in enhancing the capabilities of LLMs in real-world scenarios requiring complex API interactions, such as automated customer service, data retrieval systems, and multi-step workflows in software applications.
GlowQ: Group-Shared LOw-Rank Approximation for Quantized LLMs
Large Language Models
Efficient ML
Optimization
- GlowQ uses group-shared low-rank approximation to enhance quantized LLM efficiency.
- The method reduces computational and memory overhead by caching shared factors for input-sharing groups.
- GlowQ-S variant further optimizes performance by selectively applying corrections.
- Empirical results show significant improvements in latency, throughput, and accuracy over strong baselines.
Read more
GlowQ: Group-Shared LOw-Rank Approximation for Quantized LLMs
Summary
The paper introduces GlowQ, a novel approach for enhancing the efficiency of quantized large language models (LLMs) through group-shared low-rank approximation. Traditional quantization techniques often lead to accuracy degradation, particularly when using low-bit representations. Existing low-rank correction methods tend to increase latency and memory overhead by applying error-correction modules to every layer. GlowQ addresses these limitations by caching a single shared right factor for groups of layers that share input, thus reducing the computational burden and memory usage. The method selectively restores only those groups or layers that provide the most significant accuracy improvements. The authors also present GlowQ-S, a selective variant that further optimizes performance. Empirical evaluations demonstrate that GlowQ reduces time-to-first-byte (TTFB) by 5.6% and increases throughput by 9.6% on average, while also improving downstream accuracy and perplexity on benchmarks like WikiText-2. The selective model, GlowQ-S, achieves even greater reductions in latency and increases in throughput, maintaining accuracy within a narrow margin.
Methodology
GlowQ employs a group-shared low-rank approximation strategy, where a single shared right factor is cached for groups of layers that share input. It computes high-precision projections once per group and reuses them across modules, minimizing redundant calculations. The method incorporates a covariance-aligned objective to ensure that the shared right subspace aligns with frequently used input directions. A selective restore policy is implemented to activate only the most beneficial groups or layers during deployment.
Results
GlowQ reduces time-to-first-byte (TTFB) by 5.6% and increases throughput by 9.6% on average compared to strong baselines. It also lowers perplexity on WikiText-2 by 0.17% and improves downstream accuracy by 0.42 percentage points. The selective variant, GlowQ-S, achieves a 23.4% reduction in TTFB and a 37.4% increase in throughput while maintaining accuracy within 0.2 percentage points.
Implications
The findings suggest that GlowQ can significantly enhance the deployment efficiency of quantized LLMs, making them more practical for real-world applications where latency and resource constraints are critical. This approach could be particularly beneficial in scenarios requiring rapid inference and high throughput, such as conversational AI and real-time language processing.
SEVerA: Verified Synthesis of Self-Evolving Agents
Large Language Models
Generative Models
Theory
- Introduces a formal framework for synthesizing self-evolving agents with safety guarantees.
- Combines hard formal specifications with soft performance objectives in agent synthesis.
- Utilizes Formally Guarded Generative Models (FGGM) to ensure outputs meet specified contracts.
- Achieves zero constraint violations across multiple evaluation tasks.
Read more
SEVerA: Verified Synthesis of Self-Evolving Agents
Summary
The paper introduces SEVerA (Self-Evolving Verified Agents), a framework designed to synthesize self-evolving agents while ensuring formal correctness and safety. Traditional self-evolving frameworks lack formal guarantees, raising concerns about the reliability of synthesized programs executed on unseen inputs. SEVerA addresses this by formulating agentic code generation as a constrained learning problem that combines hard formal specifications with soft objectives for task utility. The authors propose Formally Guarded Generative Models (FGGM), which allow the planner LLM to specify formal output contracts for generative model calls using first-order logic. This ensures that outputs from generative models meet specified contracts for any input and parameter setting. The SEVerA framework operates in three stages: Search, where candidate programs are synthesized; Verification, where correctness is proven against constraints; and Learning, where gradient-based optimization is applied to enhance task performance while maintaining correctness. The framework is evaluated on various tasks, demonstrating zero constraint violations and improved performance over existing methods.
Methodology
The methodology involves a three-stage framework: (1) Search for synthesizing candidate programs with FGGM calls, (2) Verification to prove correctness against hard constraints, and (3) Learning through scalable gradient-based optimization to improve task performance while ensuring formal correctness.
Results
SEVerA was evaluated on tasks such as constrained symbolic regression and policy-compliant agentic tool use, achieving zero constraint violations and outperforming both unconstrained and state-of-the-art baselines in task performance.
Implications
The findings suggest that integrating formal behavioral constraints in self-evolving agent synthesis can significantly enhance reliability and performance, making it applicable in critical domains where safety is paramount, such as automated programming and AI-driven decision-making.
How Class Ontology and Data Scale Affect Audio Transfer Learning
Audio & Speech
- Transfer learning in audio tasks is significantly influenced by the similarity between pre-training and downstream tasks.
- Increasing the number of samples and classes in pre-training data generally improves performance but is not as impactful as task similarity.
- The study provides a set of pre-trained model states on various AudioSet subsets for further research.
- Findings challenge the assumption that larger, more diverse datasets are always optimal for pre-training in audio tasks.
Read more
How Class Ontology and Data Scale Affect Audio Transfer Learning
Summary
This paper investigates the factors influencing the effectiveness of transfer learning (TL) in audio processing tasks, specifically focusing on how class ontology and data scale affect model performance. The authors conduct a comprehensive study using various pre-trained model states derived from subsets of AudioSet, a large labeled audio dataset. They fine-tune these models on three distinct computer audition tasks: acoustic scene recognition, bird activity recognition, and speech command recognition. The findings reveal that while increasing the number of samples and classes in the pre-training data generally enhances transfer learning performance, the similarity between the pre-training data and the downstream tasks plays a more critical role. This suggests that models benefit more from pre-training on data that closely aligns with the target tasks, challenging the assumption that larger and more diverse datasets are always superior for pre-training. The study aims to clarify the conditions under which transfer learning is most effective in the audio domain, providing valuable insights for future research and model development.
Methodology
The authors pre-train various model states on ontology-based subsets of AudioSet and fine-tune them on three specific computer audition tasks. They analyze the impact of task similarity, sample size, and class diversity on transfer learning performance through rigorous experimentation.
Results
The study demonstrates that while larger datasets with more classes can enhance transfer learning, the similarity of the pre-training data to the downstream tasks is the most significant factor affecting model performance. This finding emphasizes the importance of task alignment in audio transfer learning.
Implications
The results have implications for the design of audio models and datasets, suggesting that researchers should prioritize task similarity over sheer data volume when pre-training models for specific audio applications. This could lead to more efficient use of resources and improved model performance in practical applications.