AI-generated summaries
Today's ML research,
without the noise.
Daily summaries of the latest machine learning papers from arXiv, processed every 8 hours.
48
Papers today
8h
Update frequency
7
Days of history
A Systematic Empirical Study of Grokking: Depth, Architecture, Activation, and Regularization
Optimization
Theory
- Depth requires stabilization for effective grokking, with depth-4 MLPs failing while depth-8 residual networks succeed.
- The gap between Transformers and MLPs diminishes under matched hyperparameters, indicating confounding factors in previous studies.
- Activation function performance is dependent on the regularization regime, with GELU being advantageous only under certain conditions.
- Weight decay is the dominant factor influencing grokking, with a narrow optimal range necessary for generalization.
Read more
A Systematic Empirical Study of Grokking: Depth, Architecture, Activation, and Regularization
Summary
This paper investigates the phenomenon of grokking in neural networks, specifically focusing on the transition from memorization to generalization. The authors conduct a controlled empirical study using modular addition (mod 97) to systematically disentangle the effects of depth, architecture, activation functions, and regularization. The study reveals that grokking dynamics are primarily influenced by the interactions between optimization stability and regularization rather than architecture alone. Key findings include the non-monotonic effect of depth on grokking, where depth-4 MLPs fail to generalize while depth-8 residual networks succeed, indicating the need for architectural stabilization. Additionally, the differences between Transformers and MLPs diminish under matched hyperparameters, suggesting that prior findings were confounded by optimization and regularization choices. The study also highlights that activation functions have regime-dependent effects, with GELU outperforming ReLU only when regularization allows for memorization. Lastly, weight decay emerges as a critical control parameter, with a narrow range of optimal regularization strength necessary for grokking to occur. Overall, this research provides a unified empirical understanding of grokking as an interaction-driven phenomenon, challenging architecture-centric views and clarifying the roles of optimization and regularization in delayed generalization.
Methodology
The authors conducted a systematic empirical study using modular addition tasks, carefully controlling and matching training regimes across different neural network architectures, including MLPs and Transformers. They analyzed the effects of depth, architecture, activation functions, and regularization on the grokking phenomenon.
Results
The study found that grokking dynamics are primarily influenced by optimization stability and regularization rather than architecture. Depth-4 MLPs consistently failed to grok, while depth-8 residual networks succeeded. The differences between Transformers and MLPs were largely attributed to optimization and regularization confounds. Activation function effects were shown to be regime-dependent, and weight decay was identified as a critical control parameter with a narrow optimal range for grokking.
Implications
The findings challenge traditional architecture-centric interpretations of neural network performance and provide insights into how optimization and regularization can be leveraged to enhance generalization in deep learning models. This has implications for designing more effective training regimes and understanding the underlying mechanisms of generalization in neural networks.
Longitudinal Digital Phenotyping for Early Cognitive-Motor Screening
Time Series
Multimodal
- Introduces an AI-driven framework for continuous monitoring of cognitive-motor development in children.
- Identifies three distinct developmental profiles based on tablet interaction data.
- Demonstrates high stability in low-performance profiles, indicating potential for persistent deficits.
- Utilizes unsupervised learning techniques to uncover natural patterns of cognitive growth.
Read more
Longitudinal Digital Phenotyping for Early Cognitive-Motor Screening
Summary
This paper presents a novel AI-driven framework for longitudinal digital phenotyping aimed at early cognitive-motor screening in children aged 18 months to 8 years. Traditional assessment methods for cognitive development are often static and subjective, which can hinder timely interventions. The authors leverage digital devices to collect continuous, objective data through tablet-based interactions, focusing on six cognitive-motor tasks such as fine motor control and reaction time. By employing dimensionality reduction techniques (t-SNE) and unsupervised clustering (K-Means++), the study identifies three distinct developmental profiles: low, medium, and high performance. The findings indicate that children in the low-performance cluster exhibit high stability over time, suggesting that early deficits are likely to persist without intervention. In contrast, higher-performance clusters show greater variability, potentially linked to engagement factors. This research validates the use of unsupervised learning on touchscreen interaction data to reveal diverse developmental trajectories, providing a scalable framework for early screening tools and personalized interventions in pediatric care.
Methodology
The study utilized a longitudinal dataset of tablet-based interaction data from children, applying t-SNE for dimensionality reduction and K-Means++ for clustering to identify cognitive-motor performance profiles. The analysis focused on performance metrics derived from six cognitive-motor tasks over multiple academic years.
Results
The analysis revealed three cognitive-motor performance profiles: low, medium, and high. The low-performance cluster showed over 90% retention in early years, indicating stability in deficits, while higher-performance clusters exhibited more variability. This suggests that early identification of low performers could be crucial for timely interventions.
Implications
The findings suggest that digital phenotyping can enhance early detection of cognitive-motor development issues, leading to timely interventions. The framework could be adapted for personalized educational strategies and monitoring in pediatric healthcare, ultimately improving developmental outcomes for children.
Anchored-Branched Steady-state WInd Flow Transformer (AB-SWIFT): a metamodel for 3D atmospheric flow in urban environments
Theory
Graph Learning
Efficient ML
- AB-SWIFT is the first transformer-based neural operator specifically designed for local-scale atmospheric flow modeling.
- The model is trained on a new dataset that includes various urban geometries and atmospheric stratification conditions.
- AB-SWIFT achieves superior accuracy compared to existing transformer and graph neural network models.
- The model's internal branched structure allows for effective representation of complex urban environments.
Read more
Anchored-Branched Steady-state WInd Flow Transformer (AB-SWIFT): a metamodel for 3D atmospheric flow in urban environments
Summary
The paper introduces the Anchored-Branched Steady-state WInd Flow Transformer (AB-SWIFT), a novel transformer-based model designed to efficiently simulate 3D atmospheric flow in urban environments. Traditional Computational Fluid Dynamics (CFD) methods are often slow and costly, particularly when high mesh refinement is required for accurate simulations. AB-SWIFT addresses the challenges posed by complex urban geometries and varying atmospheric conditions by employing a branched internal structure that allows for flexible modeling of terrain and obstacles. The model is trained on a unique dataset that includes atmospheric simulations across diverse urban geometries and atmospheric stratifications, enabling it to capture a wide range of flow behaviors. The authors demonstrate that AB-SWIFT outperforms existing state-of-the-art transformer and graph-based models in terms of accuracy for predicting atmospheric flow fields. The code and dataset used for training are made publicly available, promoting further research and application in urban atmospheric modeling.
Methodology
The AB-SWIFT model utilizes a transformer architecture with a branched internal structure, designed to handle the complexities of urban atmospheric flow. It incorporates vertical meteorological profiles as inputs to account for different atmospheric stratification conditions. The model is trained on a specially curated dataset of atmospheric simulations that represent various urban geometries and flow conditions.
Results
AB-SWIFT demonstrates the best accuracy in predicting atmospheric flow fields when compared to state-of-the-art transformer and graph-based models. The model effectively captures the influence of urban geometry and atmospheric stratification on flow behavior, showcasing its robustness and adaptability in various scenarios.
Implications
The development of AB-SWIFT has significant implications for urban planning, environmental monitoring, and the design of wind farms. By providing a fast and accurate surrogate model for atmospheric flow, it can aid in real-time decision-making processes related to pollutant dispersion, renewable energy optimization, and urban development strategies.
Training LLMs for Multi-Step Tool Orchestration with Constrained Data Synthesis and Graduated Rewards
NLP
Large Language Models
Reinforcement Learning
- Introduces a framework for training LLMs in multi-step tool orchestration using real API responses.
- Develops a graduated reward system that provides fine-grained feedback on correctness.
- Demonstrates substantial improvements in model performance on ComplexFuncBench.
- Identifies the limitations of existing training environments and reward structures in RL-based tool use.
Read more
Training LLMs for Multi-Step Tool Orchestration with Constrained Data Synthesis and Graduated Rewards
Summary
This paper addresses the challenges of training large language models (LLMs) for multi-step tool orchestration, where models must invoke multiple dependent APIs in a correct sequence while managing intermediate outputs. The authors identify two major obstacles: the reliance on simple per-turn function calls in existing environments and the inadequacy of binary rewards that fail to provide feedback for partial correctness. To tackle these issues, the authors propose a novel framework that includes a reinforcement learning (RL) environment supported by a large-scale cache of real API responses, allowing for the synthesis of valid multi-step orchestration traces with controllable complexity. Additionally, they introduce a graduated reward design that breaks down correctness into atomic validity and orchestration, offering more granular feedback. Their empirical evaluations on the ComplexFuncBench demonstrate significant improvements in turn accuracy, with ablation studies confirming the necessity of both reward components for effective learning.
Methodology
The authors formulated multi-step tool orchestration as a sequential decision-making problem and constructed a deterministic RL training environment using a cache of over 100,000 real API responses. They developed a constrained data synthesis pipeline to ensure consistent dependency chains and implemented a graduated reward system that decomposes correctness into atomic validity and orchestration. The framework was empirically validated through training on synthetic data and in-domain evaluation queries.
Results
The proposed framework led to significant improvements in turn accuracy on ComplexFuncBench, with ablation studies indicating that both components of the graduated reward system are critical for achieving high performance. The model demonstrated enhanced capabilities in managing multi-step orchestration tasks compared to existing state-of-the-art approaches.
Implications
This research has potential applications in real-world scenarios requiring complex API interactions, such as automated customer service systems, data processing pipelines, and any domain where sequential decision-making with dependencies is crucial. The findings could inform future developments in training methodologies for LLMs and other AI systems.
Knowledge-Guided Retrieval-Augmented Generation for Zero-Shot Psychiatric Data: Privacy Preserving Synthetic Data Generation
Generative Models
Large Language Models
NLP
- Introduces a zero-shot, knowledge-guided framework for synthetic psychiatric data generation.
- Utilizes large language models (LLMs) with Retrieval-Augmented Generation to avoid reliance on real patient data.
- Demonstrates competitive performance in generating high-fidelity synthetic data while preserving privacy.
- Shows that clinical retrieval enhances the fidelity of generated datasets.
Read more
Knowledge-Guided Retrieval-Augmented Generation for Zero-Shot Psychiatric Data: Privacy Preserving Synthetic Data Generation
Summary
This paper addresses the challenge of limited access to real patient data in psychiatric research, which hampers the development of AI systems in healthcare. The authors propose a novel zero-shot, knowledge-guided framework for generating synthetic psychiatric tabular data using large language models (LLMs) enhanced by Retrieval-Augmented Generation (RAG). By leveraging established clinical knowledge bases such as the DSM-5 and ICD-10, the framework generates privacy-preserving synthetic data without relying on real patient records. The study evaluates the generated datasets against state-of-the-art models like CTGAN and TVAE, which depend on real data. Results indicate that while CTGAN excels in preserving marginal and multivariate structures, the knowledge-augmented LLM demonstrates competitive performance in pairwise structure and achieves lower pairwise error for specific anxiety disorders. An ablation study confirms that clinical retrieval significantly enhances fidelity. Privacy analyses reveal that the LLM-based approach maintains low linkage risk and modest overlaps, making it a promising alternative to traditional data-dependent methods. Overall, this research offers a viable pathway for advancing AI in mental health while safeguarding patient privacy.
Methodology
The authors developed a zero-shot framework that employs large language models (LLMs) guided by clinical knowledge bases (DSM-5 and ICD-10) to generate synthetic psychiatric data. The approach utilizes Retrieval-Augmented Generation (RAG) to ground the model's outputs in clinical criteria, enabling the generation of plausible assessment data without exposure to real patient records. The generated data was evaluated against existing models (CTGAN and TVAE) in terms of fidelity and privacy.
Results
The evaluation revealed that CTGAN generally outperformed in terms of marginal and multivariate structure, while the knowledge-augmented LLM achieved competitive results in pairwise structure and exhibited the lowest pairwise error for separation and social anxiety disorders. The ablation study confirmed that incorporating clinical retrieval improved the fidelity of the generated data. Privacy analysis indicated that the LLM-based approach maintained low average linkage risk and modest overlaps, comparable to CTGAN, while TVAE showed extensive duplication.
Implications
This research has significant implications for advancing AI applications in mental health by providing a method for generating high-quality synthetic data without compromising patient privacy. It opens avenues for further research in synthetic data generation and could facilitate the development of AI tools in healthcare settings where real data access is restricted.
Not a fragment, but the whole: Map-based evaluation of data-driven Fire Danger Index models
Time Series
- Traditional evaluation metrics for wildfire prediction models often neglect the impact of false positives.
- The proposed evaluation framework aligns model performance assessment with real-world operational needs.
- Ensemble machine learning models improve both fire detection accuracy and reduce false alarms.
- The study highlights the importance of incorporating a comprehensive set of fire predictors beyond meteorological variables.
Read more
Not a fragment, but the whole: Map-based evaluation of data-driven Fire Danger Index models
Summary
This paper addresses the limitations of traditional evaluation metrics for machine learning models predicting wildfire occurrences, specifically focusing on the Fire Danger Index (FDI). The authors argue that standard metrics often overlook the critical aspect of false positives, which can lead to inefficiencies in operational wildfire management. They propose a novel evaluation framework that aligns more closely with real-world decision-making processes, emphasizing the importance of accurately predicting fire activity while minimizing false alarms. The study employs advanced machine learning architectures, including Convolutional Neural Networks (CNN) and ConvLSTM, to capture complex spatiotemporal relationships in fire predictors. By systematically assessing model performance, the authors demonstrate that an ensemble of machine learning models significantly enhances fire identification accuracy and reduces false positive rates. The findings suggest that improving early warning systems for wildfires is essential in the context of climate change, where the frequency and intensity of wildfires are increasing. The proposed evaluation method is positioned as a crucial step before integrating machine learning models into existing firefighting operational frameworks.
Methodology
The authors utilize CNN and ConvLSTM architectures to forecast the Fire Danger Index. The CNN captures complex nonlinear relationships in heterogeneous data, while the ConvLSTM integrates spatial and temporal dynamics, allowing for better prediction of fire occurrences based on varying time scales of fire predictors. The evaluation framework proposed systematically compares model performance, particularly focusing on false positives.
Results
The study finds that the ensemble of machine learning models significantly enhances the accuracy of fire identification and reduces the rate of false positives compared to traditional methods. The novel evaluation framework provides a more realistic assessment of model performance in operational contexts, highlighting the importance of minimizing false alarms.
Implications
The findings have significant implications for wildfire management and early warning systems, suggesting that improved evaluation methods can lead to more effective resource allocation and decision-making in firefighting operations. This research could inform the development of more reliable predictive models that better serve operational needs in the context of increasing wildfire risks due to climate change.
Can LLMs Beat Classical Hyperparameter Optimization Algorithms? A Study on autoresearch
Optimization
Large Language Models
- Classical HPO methods outperform LLM-based methods in fixed search spaces.
- LLM agents that edit training code can significantly reduce the performance gap with classical methods.
- The Centaur hybrid method, which shares CMA-ES's internal state with an LLM, achieves the best results.
- Reliability in optimization methods is more critical than exploration breadth.
Read more
Can LLMs Beat Classical Hyperparameter Optimization Algorithms? A Study on autoresearch
Summary
This paper investigates the performance of large language models (LLMs) in hyperparameter optimization (HPO) compared to classical algorithms using the autoresearch framework. The authors benchmark nine HPO methods, including four classical algorithms (CMA-ES, TPE) and four LLM-based methods, under a fixed compute budget. The study finds that classical HPO methods consistently outperform LLM-based agents when operating within a fixed hyperparameter search space. However, an LLM agent that can directly edit training code significantly narrows this performance gap. The authors introduce 'Centaur,' a hybrid approach that combines CMA-ES with LLMs by sharing the optimizer's internal state, which leads to superior results. The findings suggest that while small and mid-sized LLMs struggle with optimization state tracking, a hybrid approach can leverage the strengths of both classical and LLM methods. The paper concludes that a less powerful LLM can be effective when paired with a robust classical optimizer, challenging the assumption that larger models are always better for optimization tasks.
Methodology
The authors benchmarked nine HPO methods, including four classical and four LLM-based approaches, on the autoresearch task. They utilized a fixed 24-hour GPU training budget and automated the extraction of hyperparameters from the training script to minimize human bias. The performance of each method was evaluated based on the best validation bits-per-byte achieved across trials.
Results
The results indicated that classical methods like CMA-ES and TPE found better hyperparameter configurations faster than LLM-based methods. However, the LLM agent that directly edited training code performed competitively. The Centaur hybrid method outperformed all others, demonstrating that sharing the internal state of CMA-ES with an LLM can yield superior optimization results.
Implications
The findings suggest that integrating LLMs with classical optimization methods can enhance hyperparameter tuning processes, potentially leading to more efficient AutoML systems. This hybrid approach may pave the way for future research in optimizing machine learning models, especially in scenarios where computational resources are limited.
SEVerA: Verified Synthesis of Self-Evolving Agents
Large Language Models
Generative Models
Theory
- Introduces a formal framework for synthesizing self-evolving agents with safety guarantees.
- Combines hard formal specifications with soft performance objectives in agent synthesis.
- Utilizes Formally Guarded Generative Models (FGGM) to enforce formal output contracts.
- Achieves zero constraint violations across multiple evaluation tasks.
Read more
SEVerA: Verified Synthesis of Self-Evolving Agents
Summary
This paper introduces SEVerA (Self-Evolving Verified Agents), a framework designed to synthesize self-evolving agents with formal guarantees of safety and correctness. The authors address the limitations of existing self-evolving frameworks that lack formal verification, which raises reliability and security concerns when agents are executed autonomously on unseen inputs. SEVerA formulates agentic code generation as a constrained learning problem, integrating hard formal specifications with soft objectives for task utility. The key innovation is the introduction of Formally Guarded Generative Models (FGGM), which allow the planner LLM to specify formal output contracts using first-order logic for each generative model call. This ensures that outputs meet specified contracts regardless of input or parameter settings. The SEVerA framework consists of three stages: Search, where candidate programs are synthesized; Verification, where correctness is proven against hard constraints; and Learning, where gradient-based optimization is applied to enhance task performance while maintaining formal correctness. The framework is evaluated on various tasks, demonstrating zero constraint violations and improved performance compared to unconstrained and state-of-the-art baselines. The results indicate that formal behavioral constraints not only ensure correctness but also guide the synthesis process towards higher-quality agents.
Methodology
The methodology involves a three-stage framework: Search for synthesizing candidate programs, Verification for proving correctness against hard constraints, and Learning for optimizing task performance using gradient-based methods. Formally Guarded Generative Models (FGGM) are employed to specify formal output contracts for generative model calls.
Results
SEVerA was evaluated on tasks such as constrained symbolic regression and policy-compliant agentic tool use, achieving zero constraint violations while improving task performance over both unconstrained and state-of-the-art baselines.
Implications
The framework has significant implications for the development of reliable and secure autonomous agents, particularly in applications requiring formal correctness, such as program repair, scientific discovery, and automated decision-making systems.
SIGMA: Structure-Invariant Generative Molecular Alignment for Chemical Language Models via Autoregressive Contrastive Learning
Generative Models
Graph Learning
Optimization
- SIGMA addresses trajectory divergence in ChemLMs by enforcing latent isotropy through dense trajectory alignment.
- The Structure-Invariant Contrastive Loss maximizes mutual information between equivalent generation paths, decoupling chemical semantics from syntactic variations.
- IsoBeam eliminates isomorphic redundancy during inference, optimizing computational resources and enhancing exploration of structurally distinct molecular scaffolds.
- Empirical results show that SIGMA outperforms strong baselines in sample efficiency and structural diversity.
Read more
SIGMA: Structure-Invariant Generative Molecular Alignment for Chemical Language Models via Autoregressive Contrastive Learning
Summary
The paper introduces SIGMA, a novel framework designed to address the challenges of trajectory divergence in Chemical Language Models (ChemLMs) caused by the inherent modality mismatch between 1D string representations and 2D/3D molecular graphs. Traditional autoregressive models treat different linearizations of the same molecular graph as distinct sequences, leading to manifold fragmentation and inefficiencies in molecular generation. SIGMA employs a token-level contrastive learning approach to enforce geometric invariance, aligning latent representations of structurally equivalent partial graphs. This method allows for a unified trajectory in latent space, enhancing sample efficiency and structural diversity. Additionally, the authors propose Isomorphic Beam Search (IsoBeam), a decoding strategy that prunes redundant paths during inference, further optimizing the generation process. Empirical evaluations demonstrate that SIGMA significantly improves upon existing models in terms of sample efficiency and structural fidelity, bridging the gap between sequence scalability and graph fidelity.
Methodology
The authors propose a token-level contrastive learning framework that aligns latent representations of equivalent molecular prefixes. This is complemented by the IsoBeam algorithm, which dynamically prunes isomorphic paths during inference to reduce redundancy and improve exploration.
Results
Empirical evaluations on standard benchmarks indicate that SIGMA significantly enhances sample efficiency and structural diversity compared to existing models, effectively bridging the gap between sequence-based and graph-based molecular generation.
Implications
The proposed SIGMA framework has the potential to improve drug discovery processes by enabling more efficient and diverse molecular generation, thereby facilitating the design of novel drug candidates and enhancing the predictive modeling of molecular properties.
Contrastive Learning Boosts Deterministic and Generative Models for Weather Data
Time Series
Generative Models
Multimodal
- Contrastive learning effectively generates robust low-dimensional embeddings from high-dimensional weather data.
- The proposed SPARTA method improves performance over traditional autoencoders in downstream tasks.
- Incorporating domain-specific knowledge through graph neural networks enhances the contrastive learning approach.
- The methodology addresses the challenges of data sparsity and multimodality in weather datasets.
Read more
Contrastive Learning Boosts Deterministic and Generative Models for Weather Data
Summary
This paper addresses the challenges of high-dimensional and multimodal weather data, which complicates tasks such as forecasting and extreme-weather detection. The author proposes a novel approach using contrastive learning to generate low-dimensional embeddings from unlabelled weather data, specifically focusing on the ERA5 dataset. The study introduces SPARse-data augmented conTRAstive spatiotemporal embeddings (SPARTA), which aligns sparse samples with complete ones through a contrastive loss term. The methodology includes a temporally aware batch sampling strategy and a cycle-consistency loss to enhance the latent space structure. Additionally, a graph neural network fusion technique is proposed to incorporate domain-specific physical knowledge into the contrastive learning process. The results demonstrate that contrastive learning significantly outperforms traditional autoencoder methods across various downstream tasks, indicating its effectiveness as a compression technique for sparse geoscience data.
Methodology
The study employs contrastive learning to create SPARTA embeddings from the ERA5 dataset, utilizing a contrastive loss term to align sparse and complete samples. It introduces a temporally aware batch sampling strategy and a cycle-consistency loss to refine the latent space structure. Additionally, a graph neural network fusion technique is used to integrate physical knowledge into the model.
Results
The experiments show that the proposed contrastive learning method outperforms autoencoders in various downstream tasks, demonstrating improved performance in handling sparse weather data and generating more structured embeddings.
Implications
The findings suggest that contrastive learning can be a powerful tool for weather data analysis, potentially leading to better forecasting models and improved understanding of climate phenomena. This approach could be applied to other domains with similar data challenges.
Gap Safe Screening Rules for Fast Training of Robust Support Vector Machines under Feature Noise
Optimization
Efficient ML
Theory
- Introduction of safe sample screening rules for R-SVMs to reduce training complexity.
- First application of safe screening techniques to worst-case robust models in supervised learning.
- Methodology based on Lagrangian duality rather than Fenchel-Rockafellar duality.
- Experimental results show significant reduction in training time with preserved accuracy.
Read more
Gap Safe Screening Rules for Fast Training of Robust Support Vector Machines under Feature Noise
Summary
This paper introduces a novel approach to enhance the training efficiency of Robust Support Vector Machines (R-SVMs) by developing safe sample screening rules that effectively reduce computational complexity without compromising the optimal solution. The authors highlight the challenges posed by feature noise in R-SVMs, which typically require more intensive computational resources due to their worst-case robust formulation. The proposed screening rules identify training samples whose uncertainty sets are guaranteed to lie entirely on one side of the margin hyperplane, allowing for a reduction in the problem size and acceleration of the optimization process. This study is the first to apply safe screening techniques to worst-case robust models in supervised machine learning, marking a significant advancement in the field. The methodology is grounded in Lagrangian duality, diverging from the commonly used Fenchel-Rockafellar duality, and leads to the establishment of both an ideal and a practical screening rule adapted to the robust setting. Experimental results demonstrate that the proposed method significantly decreases training time while maintaining classification accuracy, showcasing its potential for practical applications in scenarios with feature noise.
Methodology
The authors reformulate the R-SVM as a second-order cone program and derive its dual using Lagrangian duality. They establish KKT conditions and develop screening rules based on a safe region that is compatible with the robust formulation, adapting GAP-based safe regions to the context of R-SVMs.
Results
The proposed screening rules lead to a substantial decrease in training time for R-SVMs while ensuring that classification accuracy remains intact. The experimental validation confirms the effectiveness of the approach in handling feature noise.
Implications
This work has significant implications for the efficient training of robust machine learning models, particularly in applications where feature noise is prevalent. The safe screening techniques can be integrated into existing R-SVM frameworks to enhance their scalability and usability in real-world scenarios.
Local learning for stable backpropagation-free neural network training towards physical learning
Efficient ML
Theory
Optimization
- Introduction of FFzero, a backpropagation-free learning framework.
- Utilizes layer-wise local learning and directional-derivative optimization.
- Demonstrates effectiveness in multilayer perceptron and convolutional networks.
- Provides a viable path for in-situ physical learning using simulated photonic networks.
Read more
Local learning for stable backpropagation-free neural network training towards physical learning
Summary
This paper addresses the limitations of traditional backpropagation methods in training neural networks, particularly in the context of physical learning systems. The authors introduce FFzero, a novel forward-only learning framework that enables stable neural network training without relying on backpropagation or automatic differentiation. FFzero employs layer-wise local learning, prototype-based representations, and directional-derivative-based optimization, allowing for effective training through forward evaluations. The framework is demonstrated to generalize across multilayer perceptron and convolutional neural networks for both classification and regression tasks. A simulated photonic neural network serves as a case study, illustrating that FFzero can facilitate backpropagation-free in-situ physical learning, thus providing a promising alternative to conventional digital training methods that are increasingly constrained by physical and environmental limitations.
Methodology
The authors developed FFzero, which combines local learning techniques with prototype-based representations and directional-derivative optimization. This approach allows for training neural networks through forward evaluations only, avoiding the need for backpropagation. The framework was tested on multilayer perceptron and convolutional neural networks, with a focus on both classification and regression tasks.
Results
The results indicate that FFzero successfully enables stable training of neural networks without backpropagation, outperforming traditional methods in scenarios where backpropagation fails. The simulated photonic neural network demonstrated the feasibility of in-situ physical learning, showcasing the potential of FFzero in practical applications.
Implications
The findings suggest that FFzero could lead to more sustainable and efficient training methods for neural networks, particularly in physical systems where traditional digital computing methods are limited. This could pave the way for advancements in neuromorphic computing and other analog computing paradigms, reducing the environmental impact of AI training.
DyMRL: Dynamic Multispace Representation Learning for Multimodal Event Forecasting in Knowledge Graph
Multimodal
Graph Learning
Time Series
- DyMRL integrates time-sensitive structural features from multiple geometric spaces for deep representation learning.
- The approach incorporates dual fusion-evolution attention mechanisms for dynamic multimodal feature fusion.
- Extensive experiments show that DyMRL outperforms existing dynamic and static methods in event forecasting.
- The framework reflects human-like cognitive processes in associative thinking and logical reasoning.
Read more
DyMRL: Dynamic Multispace Representation Learning for Multimodal Event Forecasting in Knowledge Graph
Summary
The paper introduces DyMRL, a novel approach for dynamic multispace representation learning aimed at enhancing multimodal event forecasting within knowledge graphs. Traditional methods have primarily focused on static settings, neglecting the dynamic nature of multimodal knowledge acquisition and fusion. DyMRL addresses two critical issues: the learning of time-sensitive information across different modalities and the evolving fusion of multimodal features. To tackle the first issue, DyMRL employs a relational message-passing framework that integrates time-specific structural features from various geometric spaces (Euclidean, hyperbolic, and complex) to capture deep relational features. This approach reflects human cognitive processes such as associative thinking and logical reasoning. For the second issue, DyMRL utilizes dual fusion-evolution attention mechanisms that dynamically adjust the emphasis on different modalities over time, allowing for a more nuanced understanding of their contributions to future events. The effectiveness of DyMRL is validated through extensive experiments on four multimodal temporal knowledge graph benchmarks, demonstrating superior performance compared to state-of-the-art dynamic unimodal and static multimodal methods.
Methodology
DyMRL employs a relational message-passing framework to learn deep representations from time-specific structural features across Euclidean, hyperbolic, and complex spaces. It also integrates dual fusion-evolution attention mechanisms to dynamically adjust the learning emphasis on different modalities over time.
Results
DyMRL demonstrated significant improvements in event forecasting accuracy on four multimodal temporal knowledge graph benchmarks, outperforming both state-of-the-art dynamic unimodal and static multimodal baseline methods.
Implications
The proposed DyMRL framework has potential applications in various domains that require accurate event forecasting based on multimodal data, such as urban management, recommendation systems, and other real-world scenarios where knowledge graphs are utilized.
Interpretable PM2.5 Forecasting for Urban Air Quality: A Comparative Study of Operational Time-Series Models
Time Series
Interpretability
Efficient ML
- Lightweight forecasting models can achieve competitive accuracy in PM2.5 prediction.
- Facebook Prophet demonstrated the best performance in terms of accuracy and execution time.
- Residual correction significantly improved the robustness of forecasts.
- The study emphasizes the importance of interpretability and computational efficiency in air quality forecasting.
Read more
Interpretable PM2.5 Forecasting for Urban Air Quality: A Comparative Study of Operational Time-Series Models
Summary
This study addresses the challenge of accurate short-term PM2.5 forecasting for urban air quality, particularly in Beijing, China. It investigates whether lightweight and interpretable forecasting methods can compete with more complex models. The authors developed a leakage-aware forecasting workflow that integrates chronological data partitioning, preprocessing, feature selection, and exogenous-driver modeling under a Perfect Prognosis setting. Three forecasting models were compared: SARIMAX, Facebook Prophet, and NeuralProphet, evaluated under two adaptive regimes: weekly walk-forward refitting and frozen forecasting with online residual correction. Results indicated that Facebook Prophet outperformed the others in predictive accuracy and computational efficiency, achieving a mean absolute error (MAE) of 37.61 and a root mean square error (RMSE) of 50.10 under walk-forward refitting. In the frozen-model regime, corrected SARIMAX yielded the lowest overall error (MAE 32.50; RMSE 46.85). The findings suggest that lightweight forecasting strategies can provide a practical balance between accuracy, interpretability, and computational efficiency, making them suitable for real-world applications in urban air quality management.
Methodology
The authors employed a leakage-aware forecasting workflow that included chronological data partitioning, preprocessing, feature selection, and modeling with exogenous drivers. They compared three forecasting families (SARIMAX, Facebook Prophet, NeuralProphet) under two regimes: walk-forward refitting and frozen forecasting with online residual correction, using a rolling multi-week evaluation design.
Results
Facebook Prophet achieved the best performance under walk-forward refitting with an MAE of 37.61 and RMSE of 50.10. In the frozen-model regime, corrected SARIMAX yielded the lowest error (MAE 32.50; RMSE 46.85). NeuralProphet was less accurate and stable across both regimes, and residual correction did not enhance its forecasts.
Implications
The findings suggest that lightweight and interpretable forecasting models can be effectively utilized for urban air quality management, providing timely predictions that are crucial for public health and policy planning. This approach can facilitate the deployment of forecasting systems in smart cities, enhancing adaptive traffic and health management.
From Intent to Evidence: A Categorical Approach for Structural Evaluation of Deep Research Agents
Theory
- Introduces a formal framework for evaluating Deep Research Agents using category theory.
- Develops a benchmark with 296 questions to rigorously test DRA capabilities.
- Finds that state-of-the-art models achieve only 19.9% average accuracy, revealing significant evaluation challenges.
- Identifies a dichotomy in AI capabilities: strengths in dynamic reasoning but weaknesses in multi-hop synthesis.
Read more
From Intent to Evidence: A Categorical Approach for Structural Evaluation of Deep Research Agents
Summary
This paper addresses the evaluation challenges faced by Deep Research Agents (DRAs), which are increasingly utilized for complex information synthesis. The authors argue that existing empirical benchmarks are insufficient as they do not rigorously model agent behavior or test their capabilities in long-horizon synthesis and ambiguity resolution. To overcome this limitation, the authors propose a formal framework based on category theory, modeling DRA behavior as a composition of structure-preserving maps (functors). They introduce a novel benchmark consisting of 296 questions designed to stress-test DRAs across four axes: traversing sequential connectivity chains, verifying intersections within V-structure pullbacks, imposing topological ordering on retrieved substructures, and performing ontological falsification via the Yoneda Probe. The evaluation of 11 leading models reveals a persistently low baseline performance, with the best model achieving only 19.9% average accuracy. The findings indicate a significant gap in current AI capabilities, where advanced models excel in dynamic topological reordering and ontological verification but struggle with multi-hop structural synthesis. This highlights a reliance on brittle heuristics rather than a comprehensive understanding of complex structural information, suggesting that while DRAs can integrate search and reasoning, mastering complex structural tasks remains a significant challenge.
Methodology
The authors formalize DRA behavior using category theory, modeling the research workflow as a composition of functors. They construct a benchmark with 296 bilingual questions that stress-test agents along four axes, ensuring rigorous evaluation through a human-verified pipeline.
Results
The evaluation of 11 leading models shows a low baseline performance, with the best model achieving only 19.9% accuracy. The results reveal that while models can perform well in certain areas, they generally fail in multi-hop structural synthesis, indicating a reliance on heuristics.
Implications
The findings suggest that while current DRAs can integrate search and reasoning effectively, there is a pressing need for advancements in their ability to handle complex structural tasks. This work lays the groundwork for future research in developing more robust evaluation frameworks and improving DRA capabilities.
Optimal High-Probability Regret for Online Convex Optimization with Two-Point Bandit Feedback
Optimization
Theory
- Introduces the first high-probability regret bound for two-point feedback in OCO.
- Achieves a minimax optimal regret bound of O(d(log T + log(1/δ))/µ) for strongly convex losses.
- Improves the dimension dependency from O(d²) to O(d), enhancing efficiency.
- Develops a novel analytical framework that is robust to variance in gradient estimators.
Read more
Optimal High-Probability Regret for Online Convex Optimization with Two-Point Bandit Feedback
Summary
This paper addresses the challenge of Online Convex Optimization (OCO) with two-point bandit feedback in adversarial settings, where a player aims to minimize a sequence of convex loss functions while only observing their values at two points. Previous work established expectation bounds but struggled to achieve tight high-probability regret bounds for strongly convex functions due to the heavy-tailed nature of bandit gradient estimators. The author resolves this issue by providing the first high-probability regret bound of O(d(log T + log(1/δ))/µ) for µ-strongly convex losses. This result is minimax optimal concerning both the time horizon T and the dimension d. The methodology departs from conventional approaches by introducing a novel analytical framework that enhances robustness against the variance of zero-order estimators, significantly improving the dependency on dimension from O(d²) to O(d). The techniques used extend those from previous works to accommodate high-confidence concentration, matching the information-theoretic lower bounds for this setting.
Methodology
The paper employs a novel analytical framework that departs from traditional reduction-based methods. It utilizes two-point feedback to construct approximate gradient estimators and applies advanced geometric and probabilistic techniques to derive high-probability regret bounds. The proposed algorithm involves selecting actions based on randomized points and updating them using the constructed gradients.
Results
The main result establishes that with high probability (at least 1 - δ), the regret for the proposed algorithm satisfies RT ≤ O(d(log T + log(1/δ))/µ). This bound is shown to be minimax optimal with respect to both the time horizon T and the dimension d, resolving a long-standing open problem in the field.
Implications
The findings have significant implications for the design of efficient online learning algorithms in adversarial environments, particularly in applications where only limited feedback is available. This work can enhance performance in various domains that rely on online convex optimization, such as machine learning, finance, and operations research.
Neural Network Conversion of Machine Learning Pipelines
Theory
Efficient ML
Optimization
- Introduces a novel approach to convert traditional ML pipelines into neural networks using student-teacher learning.
- Demonstrates that neural networks can effectively mimic the performance of random forest classifiers across multiple tasks.
- Explores the use of random forests for hyper-parameter selection in neural networks.
- Highlights the benefits of unified inference engines for joint optimization of ML components.
Read more
Neural Network Conversion of Machine Learning Pipelines
Summary
This paper explores the conversion of traditional machine learning pipelines into neural network (NN) architectures using a student-teacher learning approach. The authors focus on transferring knowledge from a non-neural-based machine learning model, specifically a random forest classifier, to a neural network student. The goal is to enable joint optimization of various components within the pipeline and create a unified inference engine for multiple machine learning tasks. The authors conducted experiments on 100 OpenML tasks where random forests were previously effective. They found that with appropriate hyper-parameter selection, the student NN could replicate the performance of the teacher model in most cases. Additionally, the study investigates the use of random forests to aid in selecting the optimal hyper-parameters for the neural network, highlighting the potential for improved performance through this conversion process.
Methodology
The authors employed a student-teacher learning framework where a neural network (student) learns from a random forest classifier (teacher). They experimented with various neural network topologies and hyper-parameters across 100 OpenML tasks, using the teacher model to generate labels for additional training data for the student model.
Results
The results indicated that the student neural networks could successfully mimic the performance of the random forest teacher in the majority of the tasks when the right hyper-parameters were chosen. The study also confirmed that the random forest could assist in hyper-parameter selection for the neural networks, enhancing their performance.
Implications
This research has significant implications for the development of more efficient machine learning systems, particularly in scenarios where traditional models can be seamlessly integrated into neural network architectures. It suggests that leveraging existing models can lead to improved performance and adaptability in various applications.
How unconstrained machine-learning models learn physical symmetries
Theory
Graph Learning
Efficient ML
- Unconstrained ML models can learn physical symmetries effectively through data augmentation.
- The paper introduces metrics to measure the symmetry content of learned representations.
- Analysis of symmetry processing across model layers provides insights into model performance.
- Strategic injection of inductive biases can improve model stability and accuracy.
Read more
How unconstrained machine-learning models learn physical symmetries
Summary
This paper explores how unconstrained machine-learning (ML) models can learn physical symmetries, which are crucial in many areas of physics. Traditionally, models are designed with strict constraints to ensure they adhere to these symmetries, but this can limit their expressivity and computational efficiency. The authors introduce rigorous metrics to quantify the symmetry content of learned representations in unconstrained models, specifically focusing on two transformer-based architectures applied to atomistic simulations and particle physics. The study reveals that these models can achieve high accuracy in approximating equivariant behavior through data augmentation, despite not being explicitly designed to enforce symmetry. By analyzing how symmetry information is processed across model layers, the authors establish a framework for diagnosing spectral failure modes in ML models. They demonstrate that by strategically injecting minimal inductive biases, one can enhance both stability and accuracy while maintaining the advantages of unconstrained architectures. This work highlights the potential of unconstrained models in achieving physical fidelity in simulations without the computational burden of strict symmetry enforcement.
Methodology
The authors developed metrics to quantify equivariance errors and the symmetry content of internal features in ML models. They applied these metrics to two transformer-based architectures, analyzing how symmetry information is processed during training and across architectural layers.
Results
The study found that unconstrained models could approximate equivariant behavior with high accuracy, demonstrating that the errors due to approximate symmetry are often negligible compared to baseline model accuracy. The proposed metrics effectively diagnosed symmetry learning and identified potential improvements in model design.
Implications
This research suggests that unconstrained ML models can be a viable alternative to traditional models in physical simulations, offering enhanced expressivity and efficiency. The findings can inform the design of future ML architectures that balance performance with adherence to physical principles.
Grokking as a Falsifiable Finite-Size Transition
Theory
- Introduces a framework for testing grokking as a finite-size transition using statistical mechanics principles.
- Identifies the group order p of Zp as an extensive variable and spectral head–tail contrast as an order parameter.
- Demonstrates that grokking exhibits a shared finite-size boundary, challenging the smooth-crossover interpretation.
- Applies a rigorous diagnostic protocol that distinguishes genuine transitions from mere fitting exercises.
Read more
Grokking as a Falsifiable Finite-Size Transition
Summary
This paper investigates the phenomenon of grokking in neural networks, characterized by a delayed onset of generalization following initial memorization. The authors argue that previous descriptions of grokking as a phase transition lack falsifiable finite-size inputs. They introduce a framework using the group order p of Zp as an extensive variable and a spectral head–tail contrast as an order parameter. By applying a finite-size scaling (FSS) diagnostic protocol, they conduct Binder crossings and susceptibility comparisons to analyze the transition behavior. Their findings suggest a shared finite-size boundary that strongly disfavors a smooth-crossover interpretation, indicating that grokking can be quantitatively tested as a finite-size claim. However, the exact nature of the transition order remains unresolved. The study emphasizes the need for a structured approach to understanding grokking through the lens of statistical mechanics, providing a foundation for future research in this area.
Methodology
The authors employed a finite-size scaling (FSS) diagnostic protocol, which involved identifying an extensive variable (group order p of Zp) and an order parameter (spectral head–tail contrast). They conducted Binder crossings and susceptibility comparisons to analyze the transition behavior of neural networks trained on modular arithmetic tasks.
Results
The analysis revealed evidence supporting a transition-like finite-size organization in grokking, with strong disfavor towards a smooth-crossover interpretation. The findings indicate a shared finite-size boundary across different system sizes, although the specific transition order remains undetermined.
Implications
The study provides a structured approach to understanding grokking, which could lead to improved insights into generalization in neural networks. It opens avenues for future research to explore the nature of transitions in learning systems and the application of statistical mechanics in machine learning.
An Explainable Ensemble Learning Framework for Crop Classification with Optimized Feature Pyramids and Deep Networks
Interpretability
- Introduction of a high-performance meta-ensemble framework for crop classification.
- Integration of Explainable AI methods to enhance transparency and interpretability.
- Identification of key soil and climate features impacting crop suitability, consistent with agronomic knowledge.
- Achieved 98.80% accuracy, precision, recall, and F1-score, outperforming individual models.
Read more
An Explainable Ensemble Learning Framework for Crop Classification with Optimized Feature Pyramids and Deep Networks
Summary
This paper addresses the challenges faced in agriculture due to climate change and resource depletion by proposing an explainable ensemble learning framework for crop classification. The framework integrates optimized feature pyramids, deep networks, self-attention mechanisms, and residual networks to enhance crop suitability predictions based on soil characteristics and climatic conditions. Utilizing a dataset of 3,867 instances with 29 features from the Ethiopian Agricultural Transformation Agency and NASA, the authors employed various preprocessing techniques, including label encoding, outlier removal, normalization, and SMOTE for class balancing. A comparative analysis of several machine learning models, including Logistic Regression, K-Nearest Neighbors, Support Vector Machines, Decision Trees, Random Forest, Gradient Boosting, and a novel Relative Error Support Vector Machine, was conducted with hyperparameter tuning through Grid Search and cross-validation. The proposed 'Final Ensemble' meta-ensemble design achieved an impressive accuracy of 98.80%, surpassing individual models like K-Nearest Neighbors, which had an accuracy of 95.56%. The integration of Explainable AI methods, such as SHAP and permutation importance, provided insights into critical features influencing crop classification, thereby bridging the gap between complex machine learning models and actionable agricultural decision-making.
Methodology
The methodology involves a novel meta-ensemble framework that combines Feature Pyramid Networks, Deep Networks, Self-Attention mechanisms, and Residual Networks. Preprocessing techniques such as label encoding, outlier removal, normalization, and SMOTE were utilized to prepare the dataset. Various machine learning models were compared, and hyperparameter tuning was performed using Grid Search and cross-validation to optimize performance.
Results
The 'Final Ensemble' achieved an accuracy of 98.80%, with corresponding precision, recall, and F1-score also reaching 98.80%. This performance significantly outperformed individual models, with K-Nearest Neighbors achieving 95.56% accuracy. Explainable AI methods highlighted critical features such as soil pH, nitrogen, and zinc, providing actionable insights for crop classification.
Implications
The findings of this study have significant implications for sustainable agriculture by providing farmers and policymakers with interpretable, data-driven recommendations for crop selection. The framework fosters trust in AI-powered solutions and enhances the adoption of machine learning technologies in agricultural practices.
Physics-Informed Neural Network Digital Twin for Dynamic Tray-Wise Modeling of Distillation Columns under Transient Operating Conditions
Optimization
Theory
Time Series
- Introduction of a PINN framework that embeds thermodynamic constraints for distillation modeling.
- Development of a sigmoid-scheduled adaptive loss-weighting strategy for training.
- Creation of a comprehensive synthetic dataset for evaluating the model's performance.
- Demonstration of superior performance compared to traditional data-driven models.
Read more
Physics-Informed Neural Network Digital Twin for Dynamic Tray-Wise Modeling of Distillation Columns under Transient Operating Conditions
Summary
This paper presents a novel Physics-Informed Neural Network (PINN) digital twin framework aimed at enhancing the dynamic, tray-wise modeling of binary distillation columns under transient operating conditions. By integrating fundamental thermodynamic constraints directly into the neural network's loss function, the proposed model ensures compliance with vapor-liquid equilibrium (VLE) principles, tray-level mass and energy balances, and the McCabe-Thiele methodology. The model is trained on a high-fidelity synthetic dataset generated from Aspen HYSYS, comprising 961 timestamped measurements over 8 hours of operation with 16 sensor streams. An adaptive loss-weighting scheme is employed to balance data fidelity and physics consistency during training. The results show that the PINN outperforms five data-driven baselines, achieving a root mean square error (RMSE) of 0.00143 for HX mole fraction prediction, which is a 44.6% improvement over the best data-only baseline, while adhering to thermodynamic constraints. The digital twin effectively captures transient dynamics, including feed tray responses and variations in reflux ratios and pressure, establishing its potential for real-time soft sensing, model-predictive control, and anomaly detection in industrial distillation processes.
Methodology
The methodology involves the development of a PINN architecture that incorporates thermodynamic constraints into the loss function. The model is trained using a synthetic dataset generated from Aspen HYSYS, with an adaptive loss-weighting scheme that stages the emphasis on physical constraints versus data fidelity throughout the training process.
Results
The proposed PINN achieved an RMSE of 0.00143 for HX mole fraction prediction, with an R² value of 0.9887, representing a 44.6% reduction in error compared to the best-performing data-only baseline. The model accurately predicts tray-wise temperature and composition profiles under transient conditions, demonstrating its effectiveness in capturing column dynamics.
Implications
The findings suggest that the PINN digital twin can significantly enhance the monitoring, control, and optimization of distillation processes in industrial settings. Its ability to provide real-time insights and predictions could lead to improved operational efficiency and reduced energy consumption in distillation operations.
Layer-Specific Lipschitz Modulation for Fault-Tolerant Multimodal Representation Learning
Multimodal
Theory
Efficient ML
- Introduces a unified framework for fault-tolerant multimodal representation learning.
- Develops a dual-regularization mechanism to balance sensitivity for anomaly detection and correction.
- Demonstrates improved performance on multimodal fault datasets compared to existing methods.
- Integrates theoretical insights on perturbation effects into practical applications.
Read more
Layer-Specific Lipschitz Modulation for Fault-Tolerant Multimodal Representation Learning
Summary
This paper addresses the challenges of maintaining reliability in multimodal systems under conditions of partial sensor failures, signal degradation, or cross-modal inconsistencies. The authors propose a unified framework for fault-tolerant multimodal representation learning that integrates self-supervised anomaly detection and error correction. The framework is grounded in a theoretical analysis of perturbation propagation, leading to the development of Lipschitz- and Jacobian-based criteria to assess the impact of localized faults on neural operators. A two-stage self-supervised training scheme is introduced, which involves pre-training a multimodal convolutional autoencoder on clean data to retain localized anomaly signals in the latent space, followed by the addition of a learnable compute block for correction and contrastive objectives for anomaly identification. The authors also introduce layer-specific Lipschitz modulation and gradient clipping to manage sensitivity across detection and correction modules. Experimental results on multimodal fault datasets indicate that the proposed approach enhances both anomaly detection accuracy and reconstruction quality under sensor corruption, effectively bridging the gap between analytical robustness and practical fault-tolerant multimodal learning.
Methodology
The methodology involves a two-stage self-supervised training process: first, a multimodal convolutional autoencoder is pre-trained on clean data to preserve anomaly signals, and second, a learnable compute block is added for correction and contrastive learning. The framework employs Lipschitz modulation and gradient clipping to control sensitivity across different modules.
Results
The experimental evaluation shows that the proposed framework significantly improves anomaly detection accuracy and reconstruction performance in the presence of sensor corruption, outperforming existing fragmented methodologies.
Implications
The findings suggest that the proposed framework can enhance the reliability of multimodal systems in industrial and safety-critical environments, potentially leading to reduced downtime and operational inefficiencies. This approach could be applied in various sectors where multimodal data integration is crucial for fault tolerance.
Maximum Entropy Behavior Exploration for Sim2Real Zero-Shot Reinforcement Learning
Reinforcement Learning
Robotics
- Introduction of FB-MEBE, an online zero-shot RL algorithm for quadrupedal robots.
- Maximizes entropy of behavior distribution to enhance exploration and policy diversity.
- Combines unsupervised exploration with a regularization critic for physically plausible behaviors.
- Demonstrates improved performance in simulated tasks compared to other exploration strategies.
Read more
Maximum Entropy Behavior Exploration for Sim2Real Zero-Shot Reinforcement Learning
Summary
This paper addresses the challenge of zero-shot reinforcement learning (RL) in robotic systems, particularly focusing on quadrupedal control. The authors propose FB-MEBE, an online zero-shot RL algorithm that enhances behavior exploration by maximizing the entropy of the behavior distribution. Traditional methods often suffer from low-diversity data due to ineffective exploration strategies, leading to suboptimal performance in real-world applications. FB-MEBE combines an unsupervised behavior exploration strategy with a regularization critic to promote exploration and ensure that the learned policies are both effective and physically plausible. The empirical results demonstrate that FB-MEBE outperforms existing exploration strategies in various simulated tasks and allows for the direct deployment of learned policies onto real hardware without the need for additional fine-tuning. This work represents a significant advancement in the application of zero-shot RL to real robotic systems, showcasing the potential for efficient learning and deployment in complex environments.
Methodology
The authors developed FB-MEBE, which utilizes the Forward-Backward (FB) algorithm to facilitate online zero-shot RL. The method incorporates an unsupervised behavior exploration strategy that maximizes the entropy of the behavior distribution, guiding the agent towards less frequently visited behaviors. Additionally, a regularization critic is employed to ensure that the learned policies align with natural and physically plausible behaviors, thus maintaining zero-shot capabilities while being deployable in real-world scenarios.
Results
FB-MEBE was empirically validated in various simulated downstream tasks, showing significant improvements in performance over traditional exploration strategies. The policies generated by FB-MEBE exhibited smooth and natural behaviors, allowing for seamless deployment on real robotic hardware without the need for additional fine-tuning.
Implications
The findings suggest that FB-MEBE could enhance the development of general-purpose robotic agents capable of adapting to a wide range of tasks without extensive prior knowledge or data collection. This approach could streamline the deployment of RL in robotics, reducing the reliance on large, diverse datasets and enabling more efficient learning in real-world applications.
Process-Aware AI for Rainfall-Runoff Modeling: A Mass-Conserving Neural Framework with Hydrological Process Constraints
Interpretability
Time Series
- Embedding hydrological process constraints enhances interpretability in rainfall-runoff predictions.
- Vertical drainage significantly improves model performance in arid and snow-dominated basins.
- Process-aware AI models can achieve deep-learning predictive skill while retaining physical interpretability.
Read more
Process-Aware AI for Rainfall-Runoff Modeling: A Mass-Conserving Neural Framework with Hydrological Process Constraints
Summary
This paper presents a novel approach to rainfall-runoff modeling by integrating hydrological process constraints into a mass-conserving neural framework known as the Mass-Conserving Perceptron (MCP). The authors argue that while machine learning models can achieve high predictive accuracy, they often lack physical interpretability. The MCP framework allows for the enforcement of conservation principles while learning hydrological relationships from data. The study systematically introduces various hydrological processes, such as bounded soil storage and vertical drainage, into the MCP model to enhance its predictive skill and interpretability. The models were evaluated across 15 catchments in diverse hydroclimatic regions of the continental United States, focusing on daily streamflow predictions. Results indicate that incorporating physical processes generally improves model performance, with specific enhancements observed in arid and snow-dominated regions. The findings suggest that the best-performing MCP configurations can achieve predictive skills comparable to advanced deep learning models while maintaining interpretability, thus providing a promising direction for future hydrological modeling.
Methodology
The study employs a Mass-Conserving Perceptron (MCP) framework, progressively integrating hydrological processes such as soil storage, conductivity, porosity, infiltration capacity, surface ponding, and vertical drainage. The models were evaluated using daily streamflow data from 15 catchments across various hydroclimatic regions in the U.S.
Results
The results demonstrate that augmenting the MCP's internal physical structure generally leads to improved predictive performance. Specifically, vertical drainage enhances skill in arid and snow-dominated regions but may reduce performance in rainfall-dominated areas. The best MCP configurations approach the predictive skill of Long Short-Term Memory (LSTM) models while maintaining explicit physical interpretability.
Implications
This research indicates that integrating physical constraints into AI models can bridge the gap between high predictive accuracy and interpretability in hydrological modeling. Such models could be beneficial for applications in flood forecasting, drought assessment, and water resource management.
Transformers in the Dark: Navigating Unknown Search Spaces via Bandit Feedback
Large Language Models
Reinforcement Learning
Theory
- Introduces a new framework for evaluating LLMs' search capabilities using external feedback.
- Demonstrates that Transformers can represent and approximate various search strategies.
- Shows that targeted training can significantly improve LLM performance in search tasks.
- Highlights the limitations of current LLMs compared to traditional search algorithms.
Read more
Transformers in the Dark: Navigating Unknown Search Spaces via Bandit Feedback
Summary
This paper explores the potential of Large Language Models (LLMs) to approximate search algorithms within a structured search space represented as a tree. The authors introduce a novel framework termed 'unknown tree search with bandit feedback,' which allows for controlled evaluation of LLMs' search capabilities by providing external expansions and feedback. The study demonstrates that Transformers can theoretically represent various search strategies and can be trained to imitate these strategies effectively. Empirical results indicate that while current LLMs underperform compared to established search algorithms, targeted training focused on search under uncertainty can significantly enhance their performance. The findings suggest that LLMs have the potential to improve problem-solving capabilities when appropriately trained, bridging the gap between LLMs and specialized search algorithms.
Methodology
The authors developed a simplified framework for 'unknown tree search with bandit feedback,' where tree expansions and feedback are externally specified. They conducted theoretical analyses to show the expressiveness of Transformers and performed empirical studies to evaluate LLMs' ability to approximate search strategies and generalize to unseen conditions.
Results
The study found that while Transformers can be trained to imitate search strategies, existing LLMs still lag behind traditional algorithms like uniform and greedy sampling in terms of performance. However, fine-tuning LLMs on search trajectories significantly improved their effectiveness, demonstrating the potential for enhanced problem-solving capabilities.
Implications
The findings suggest that LLMs could be developed into more effective problem-solving agents by integrating search capabilities directly into their training processes. This could lead to advancements in applications requiring complex decision-making and exploration in uncertain environments.
Learning to Staff: Offline Reinforcement Learning and Fine-Tuned LLMs for Warehouse Staffing Optimization
Reinforcement Learning
Large Language Models
Optimization
- Development of a Transformer-GNN architecture for offline RL that improves throughput by 2.4%.
- LLMs require substantial task-specific adaptation; prompting alone is inadequate.
- Supervised fine-tuning and preference optimization enable LLMs to match historical performance.
- Iterative feedback loops can facilitate human-AI collaboration in decision-making.
Read more
Learning to Staff: Offline Reinforcement Learning and Fine-Tuned LLMs for Warehouse Staffing Optimization
Summary
This paper explores machine learning techniques for optimizing staffing decisions in semi-automated warehouse sortation systems. The authors evaluate two main approaches: offline reinforcement learning (RL) using a custom Transformer-based Graph Neural Network (GNN) and large language models (LLMs) that operate on abstracted, human-readable state descriptions. The offline RL approach achieves a 2.4% improvement in throughput over historical baselines by leveraging detailed historical data and modeling interactions between system components. In contrast, the LLMs, which are more aligned with the operational reasoning of warehouse managers, require significant task-specific adaptation. The study finds that while simple prompting is insufficient, supervised fine-tuning combined with Direct Preference Optimization allows LLMs to match or slightly exceed historical performance. The authors also introduce an iterative feedback loop that simulates manager preferences, laying the groundwork for future human-AI collaborative learning. Overall, the findings indicate that both methods can effectively support operational decision-making in warehouse staffing, highlighting the importance of aligning AI systems with human expertise.
Methodology
The authors employ two main methodologies: (1) Offline reinforcement learning using a Transformer-GNN architecture to model detailed state representations and optimize staffing decisions based on historical data, and (2) Large language models that process abstracted, human-readable state descriptions, evaluated through various prompting and fine-tuning strategies, including Direct Preference Optimization to simulate manager feedback.
Results
The offline RL approach achieved a 2.4% throughput improvement over historical decision-making baselines in learned simulators. The LLMs, when fine-tuned and optimized for preferences, matched or slightly exceeded historical performance, demonstrating the effectiveness of both methods in optimizing staffing decisions.
Implications
The findings suggest that integrating AI systems into warehouse staffing can lead to significant operational efficiencies. The ability to simulate manager feedback allows for continuous improvement and adaptation of AI systems, potentially enhancing decision-making processes in logistics and other operational domains.
How Class Ontology and Data Scale Affect Audio Transfer Learning
Audio & Speech
- Transfer learning performance in audio tasks is significantly influenced by the similarity between pre-training and downstream tasks.
- Increasing the number of samples and classes in pre-training data positively impacts transfer learning, but not as much as task similarity.
- The study provides a comprehensive analysis of various subsets of AudioSet for pre-training DNNs.
- Findings challenge the assumption that larger datasets with broader ontologies are always the best choice for pre-training in audio tasks.
Read more
How Class Ontology and Data Scale Affect Audio Transfer Learning
Summary
This paper investigates the effects of class ontology and data scale on audio transfer learning (TL), focusing on the performance of deep neural networks (DNNs) when pre-trained on various subsets of AudioSet and fine-tuned on three specific computer audition tasks: acoustic scene recognition, bird activity recognition, and speech command recognition. The authors highlight the importance of understanding the factors that contribute to effective TL, particularly in the audio domain, which has been less explored compared to computer vision. The study reveals that while increasing the number of samples and classes in pre-training data generally enhances TL performance, the similarity between pre-training and downstream tasks is a more significant factor. This finding challenges the assumption that larger and more diverse datasets are always optimal for pre-training, suggesting that task similarity may lead to better feature learning. The authors provide a set of pre-trained model states and analyze how task similarity and data characteristics impact TL outcomes, contributing valuable insights to the field of audio machine learning.
Methodology
The authors conducted a rigorous study by pre-training various model states on ontology-based subsets of AudioSet and fine-tuning them on three distinct computer audition tasks. They systematically varied the number of samples and classes in the pre-training data and analyzed the impact of these factors on transfer learning performance.
Results
The results indicate that while increasing the number of samples and classes in the pre-training dataset generally improves transfer learning outcomes, the similarity between the pre-training tasks and the downstream tasks is the most critical factor affecting model performance. This suggests that models can learn more relevant features when the tasks are closely aligned.
Implications
The findings have significant implications for the development of audio models, suggesting that researchers should prioritize task similarity over merely increasing dataset size and diversity when pre-training. This could lead to more efficient and effective audio recognition systems.
Flow matching on homogeneous spaces
Generative Models
Theory
Efficient ML
- Introduces a framework for flow matching on homogeneous spaces by lifting to Lie groups.
- Avoids complex geometry by simplifying the problem to Euclidean flow matching on Lie algebras.
- Eliminates the need for premetrics or geodesics, making the approach simpler and faster.
- Demonstrates the framework's effectiveness through case studies on specific homogeneous spaces.
Read more
Flow matching on homogeneous spaces
Summary
This paper presents a novel framework for extending Flow Matching to homogeneous spaces, specifically quotients of Lie groups. The proposed method reformulates the flow matching problem as one on the underlying Lie group by lifting data distributions, thereby simplifying the complex geometry of homogeneous spaces. This approach allows the problem to be reduced to a Euclidean flow matching task on Lie algebras, eliminating the need for premetrics or geodesics, which are often computationally intensive. The author revisits existing flow matching techniques on Lie groups and introduces a more conceptual and implementable formulation. The framework is demonstrated through two case studies: SL(2, R)/SO(2, R) and SO(3, R)/SO(2, R), showcasing its applicability in generative modeling on these specific homogeneous spaces.
Methodology
The methodology involves reformulating flow matching as a task on Lie groups by lifting data distributions, which allows for a reduction to Euclidean flow matching on Lie algebras. The loss function is adapted to accommodate this new framework, facilitating the learning of vector fields that push noise distributions to target distributions without the need for ODE simulations.
Results
The proposed framework successfully demonstrates flow matching on two homogeneous spaces, SL(2, R)/SO(2, R) and SO(3, R)/SO(2, R). The results indicate that the method is computationally efficient and conceptually clearer than previous approaches, particularly in avoiding the complexities associated with Riemannian geometry.
Implications
This work has potential implications for generative modeling in various applications where data is structured on homogeneous spaces, such as in information geometry and manifold learning. The simplification of flow matching could lead to more efficient algorithms in generative models, enhancing their scalability and applicability.
Epistemic Compression: The Case for Deliberate Ignorance in High-Stakes AI
Theory
- Foundation models struggle in high-stakes environments due to the Fidelity Paradox, where they memorize noise instead of capturing useful signals.
- Epistemic Compression emphasizes the importance of model architecture in enforcing parsimony to enhance robustness.
- The Regime Index effectively differentiates between Shifting and Stable Regimes, guiding model complexity decisions.
- High-capacity models can lead to overfitting in unstable environments, necessitating a focus on simpler, more robust models.
Read more
Epistemic Compression: The Case for Deliberate Ignorance in High-Stakes AI
Summary
This paper addresses the limitations of foundation models in high-stakes environments such as medicine and finance, where they often fail due to their tendency to memorize noise rather than capture meaningful signals. The author introduces the concept of 'Epistemic Compression,' which posits that robustness in AI models arises from aligning model complexity with the 'shelf life' of the data, rather than merely increasing model parameters. This principle is operationalized through a Regime Index that distinguishes between Shifting Regimes (unstable, data-poor environments where simplicity is advantageous) and Stable Regimes (invariant, data-rich environments where complexity can be beneficial). The findings suggest that high-capacity models are prone to overfitting in unstable contexts, and a shift towards principled parsimony is necessary for effective AI deployment in high-stakes scenarios. The paper synthesizes insights from 15 high-stakes domains, demonstrating that the proposed Regime Index aligns with the most effective modeling strategies in 86.7% of cases, advocating for a paradigm shift in AI model design.
Methodology
The paper introduces the Regime Index to categorize environments into Shifting and Stable Regimes, and it synthesizes empirical data from 15 high-stakes domains to validate the effectiveness of Epistemic Compression in enhancing model robustness.
Results
The application of the Regime Index was concordant with the empirically superior modeling strategy in 86.7% of the examined cases, indicating that aligning model complexity with data stability significantly improves performance in high-stakes AI applications.
Implications
The findings suggest that AI practitioners should prioritize model simplicity and robustness over sheer complexity, particularly in high-stakes fields like medicine and finance, where the cost of failure is high. This could lead to more reliable AI systems that are better equipped to handle real-world variability.
CVA: Context-aware Video-text Alignment for Video Temporal Grounding
Computer Vision
Multimodal
- Introduction of Query-aware Context Diversification (QCD) to enhance data augmentation.
- Development of Context-invariant Boundary Discrimination (CBD) loss for improved semantic consistency.
- Design of Context-enhanced Transformer Encoder (CTE) for capturing multi-scale temporal context.
- Achieves state-of-the-art performance on major Video Temporal Grounding benchmarks.
Read more
CVA: Context-aware Video-text Alignment for Video Temporal Grounding
Summary
The paper presents Context-aware Video-text Alignment (CVA), a framework designed to improve video temporal grounding by enhancing video-text alignment while mitigating the influence of irrelevant background context. The authors introduce three main components: Query-aware Context Diversification (QCD), which is a data augmentation strategy that ensures only semantically unrelated content is mixed, thus preventing false negatives; Context-invariant Boundary Discrimination (CBD) loss, a contrastive loss that maintains semantic consistency at challenging temporal boundaries; and Context-enhanced Transformer Encoder (CTE), a hierarchical architecture that utilizes windowed self-attention and bidirectional cross-attention to capture multi-scale temporal context. The combination of these components allows CVA to achieve state-of-the-art performance on benchmarks like QVHighlights and Charades-STA, demonstrating a significant improvement in Recall@1 scores, which indicates its effectiveness in addressing the challenges of video-text alignment.
Methodology
The methodology involves a combination of innovative data augmentation techniques and a novel transformer architecture. QCD ensures that only semantically unrelated clips are used for training, while CBD loss focuses on maintaining semantic consistency at temporal boundaries. CTE employs a hierarchical structure that integrates windowed self-attention and bidirectional cross-attention to effectively model the temporal dynamics of video content.
Results
CVA achieves state-of-the-art performance on major benchmarks for Video Moment Retrieval and Highlight Detection, with a notable improvement of around 5 points in Recall@1 scores compared to existing methods, indicating enhanced precision in video-text alignment.
Implications
The findings suggest that CVA can significantly improve user experience in video content retrieval by providing more accurate temporal grounding, which is crucial for applications in video search engines, content recommendation systems, and multimedia information retrieval.
Offline Decision Transformers for Neural Combinatorial Optimization: Surpassing Heuristics on the Traveling Salesman Problem
Reinforcement Learning
Optimization
- Introduces a novel Decision Transformer framework for the Traveling Salesman Problem (TSP).
- Demonstrates that conditioning on appropriate Return-to-Go (RTG) is crucial for surpassing heuristic performance.
- Employs expectile regression to improve the quality of solutions in combinatorial optimization.
- Shows that offline RL can effectively leverage heuristic datasets to generate superior solutions.
Read more
Offline Decision Transformers for Neural Combinatorial Optimization: Surpassing Heuristics on the Traveling Salesman Problem
Summary
This paper addresses the challenges of solving combinatorial optimization problems, specifically the Traveling Salesman Problem (TSP), using a novel approach that leverages offline reinforcement learning (RL) through Decision Transformers (DT). Traditional methods often rely on online RL, which can be resource-intensive and inefficient for real-world applications. The authors propose a framework that utilizes datasets of heuristic solutions to train a model that not only imitates these heuristics but also aims to outperform them. Key innovations include the integration of a Pointer Network to manage the variable action space of node selection and the use of expectile regression for optimistic Return-to-Go (RTG) predictions. The experiments demonstrate that the proposed method consistently yields higher-quality tours than the classical heuristics it was trained on, showcasing the potential of offline RL to enhance combinatorial optimization solutions by effectively utilizing existing domain knowledge.
Methodology
The authors adapt the Decision Transformer framework to the TSP by modeling trajectories that include observations, actions, and RTG. They redefine the state representation to include node embeddings and utilize a Pointer Network to select nodes based on attention mechanisms, allowing for a more flexible action selection process. Expectile regression is employed to predict optimistic RTG values, enhancing the model's ability to generate high-quality tours.
Results
The proposed method consistently outperformed four classical heuristics in generating TSP solutions, demonstrating its effectiveness in utilizing offline RL to synthesize and exceed existing heuristic strategies. The results indicate that the integration of Pointer Networks and expectile regression significantly contributes to the quality of the solutions produced.
Implications
This research suggests that offline RL frameworks can be powerful tools for solving complex combinatorial optimization problems, potentially transforming how industries approach NP-hard problems like the TSP. By leveraging existing heuristic knowledge, the proposed method could lead to more efficient and effective optimization strategies in logistics, manufacturing, and other fields.
A Practical Guide Towards Interpreting Time-Series Deep Clinical Predictive Models: A Reproducibility Study
Time Series
Interpretability
- Attention mechanisms can effectively enhance interpretability in clinical predictive models.
- Black-box interpreters like KernelSHAP and LIME are not suitable for time-series clinical prediction tasks.
- Many interpretability approaches lack reliability and cannot be trusted for clinical applications.
- The study provides guidelines for improving interpretability in clinical predictive workflows.
Read more
A Practical Guide Towards Interpreting Time-Series Deep Clinical Predictive Models: A Reproducibility Study
Summary
This paper addresses the critical need for interpretability in deep clinical predictive models, particularly in time-series data, where clinical decisions require explicit justification. The authors present a comprehensive benchmark that evaluates various interpretability methods across different clinical prediction tasks and model architectures. They explore whether architectural features, such as attention mechanisms, enhance explainability and whether interpretability approaches generalize across tasks. The study reveals that attention mechanisms, when used correctly, significantly improve interpretability. In contrast, black-box interpreters like KernelSHAP and LIME are found to be computationally infeasible for time-series tasks, and several interpretability methods are deemed unreliable. The authors provide guidelines for enhancing interpretability in clinical predictive pipelines and offer their implementations through PyHealth, an open-source framework aimed at supporting reproducibility and extensibility in research.
Methodology
The authors developed an interpretability benchmark that evaluates various interpretability methods based on their scalability across patient events and populations, as well as their faithfulness to model predictions. They compared attention-based models with non-attention-based models and assessed the performance of different interpretability approaches across diverse clinical prediction tasks.
Results
The analysis showed that attention mechanisms, when properly utilized, are efficient for interpreting model predictions. It was also found that black-box interpreters are computationally infeasible for time-series tasks, and several interpretability methods were unreliable. The study emphasizes the need for robust interpretability methods in clinical settings.
Implications
The findings highlight the importance of model interpretability in clinical decision-making, suggesting that attention mechanisms should be prioritized in model design. The guidelines provided can help improve the trustworthiness of clinical predictive models, ultimately aiding in their deployment in real-world healthcare settings.
The Order Is The Message
Theory
Efficient ML
Interpretability
- Example ordering in neural network training is a significant information channel, not just a nuisance variable.
- Counterfactual gradient decomposition reveals that ordering contributes approximately 85% to the cumulative gradient norm.
- Consistent ordering can enhance learning efficiency, achieving high accuracy with significantly less training data compared to IID shuffling.
- The study highlights the importance of temporal structure in learning, which is often overlooked in traditional training methodologies.
Read more
The Order Is The Message
Summary
This paper investigates the role of example ordering in neural network training, challenging the conventional IID (independent and identically distributed) assumption that treats ordering as a nuisance variable. The author demonstrates that the order of training examples acts as an information channel, significantly influencing the training process. Through counterfactual gradient decomposition, it is shown that approximately 85% of the cumulative gradient norm during training epochs is dependent on the ordering strategy. The study reveals that while IID shuffling leads to incoherent contributions from the ordering channel, consistent ordering can constructively interfere, enhancing feature acquisition. In a controlled experiment on modular arithmetic, two fixed-ordering strategies achieved 99.5% test accuracy with minimal training data, while the IID baseline performed poorly. The findings suggest that the sequential structure of training examples carries critical information that is not present in individual examples, raising important considerations for training efficiency and safety in machine learning.
Methodology
The paper employs counterfactual gradient decomposition to analyze the impact of example ordering on the training process. It conducts controlled experiments to compare the performance of various ordering strategies, including IID shuffling and fixed-ordering approaches, on a modular arithmetic task.
Results
The experiments demonstrated that two fixed-ordering strategies achieved 99.5% test accuracy after 487 and 659 epochs, respectively, using only 0.3% of the input space. In contrast, the IID baseline achieved only 0.30% accuracy after 5,000 epochs. An adversarial ordering strategy completely suppressed learning, highlighting the critical role of ordering in model performance.
Implications
The findings suggest that reconsidering the IID assumption in training could lead to more efficient learning strategies and improved model performance. Additionally, the active ordering channel raises concerns about safety and alignment in machine learning, necessitating further exploration of its implications.
Once-for-All Channel Mixers (HYPERTINYPW): Generative Compression for TinyML
Efficient ML
Time Series
Audio & Speech
- HYPERTINYPW replaces stored PW weights with generated weights to reduce memory usage.
- The method maintains compatibility with standard integer operators, ensuring efficient inference.
- Validation on ECG benchmarks shows significant compression without sacrificing performance.
- The approach enforces a shared latent basis across layers, reducing redundancy.
Read more
Once-for-All Channel Mixers (HYPERTINYPW): Generative Compression for TinyML
Summary
The paper introduces HYPERTINYPW, a novel approach to compressing neural networks for deployment on microcontrollers (MCUs) by replacing most stored pointwise (PW) weights with generated weights. This method utilizes a shared micro-MLP to synthesize PW kernels from tiny per-layer codes at load time, while retaining the first PW layer in INT8 format to stabilize early mixing. The approach aims to address the significant memory constraints faced by MCUs, where traditional compression techniques often fail to reduce the memory footprint of PW layers effectively. HYPERTINYPW maintains compatibility with existing integer-only inference frameworks, ensuring that the synthesis process incurs minimal overhead. The authors validate their method on three ECG benchmark datasets, demonstrating that HYPERTINYPW achieves a substantial reduction in model size—compressing a 1.4 MB CNN to approximately 225 kB—while retaining over 95% of the original model's performance. This work highlights the potential of compression-as-generation strategies in resource-constrained environments, paving the way for more efficient deployment of machine learning models in TinyML applications.
Methodology
The methodology involves a compression-as-generation approach where a shared micro-MLP synthesizes most PW kernels from compact per-layer codes at load time. The first PW layer is kept in INT8 format to anchor early morphology-sensitive mixing. The generated weights are cached for reuse during inference, which is performed using standard integer operations to ensure compatibility with existing TinyML frameworks.
Results
HYPERTINYPW compresses a baseline CNN model from approximately 1.4 MB to around 225 kB, achieving a 6.31× reduction in size (84.15% fewer bytes) while retaining at least 95% of the macro-F1 score on the tested ECG datasets. The method sustains balanced detection performance under tight memory budgets of 32–64 kB, where traditional compact models tend to degrade.
Implications
The findings suggest that HYPERTINYPW can significantly enhance the deployment of deep learning models in resource-constrained environments, such as wearable devices and embedded systems. The compression-as-generation strategy could be applied to various domains beyond ECG, including on-device speech recognition and other biosignal processing tasks, indicating a broader applicability in TinyML.
A Unified Memory Perspective for Probabilistic Trustworthy AI
Theory
Efficient ML
- Introduces a unified probabilistic memory abstraction for analyzing deterministic and stochastic operations.
- Identifies a scaling mismatch between compute throughput, memory bandwidth, and entropy generation, leading to 'entropy wall' issues.
- Examines architectural trade-offs between conventional von Neumann systems and emerging probabilistic compute-in-memory approaches.
- Defines memory-level evaluation criteria to assess the effectiveness of memory systems in probabilistic computation.
Read more
A Unified Memory Perspective for Probabilistic Trustworthy AI
Summary
This paper presents a unified perspective on the interaction between probabilistic computation and memory access in trustworthy AI systems. As AI applications increasingly rely on probabilistic methods for robustness, interpretability, and security, the demand for stochastic sampling has grown, shifting performance bottlenecks from arithmetic units to memory systems. The authors propose a framework that treats deterministic data access as a limiting case of stochastic sampling, allowing for a common analysis of both modes. They introduce memory-level evaluation criteria, including unified operation, distribution programmability, efficiency, robustness to hardware non-idealities, and parallel compatibility. The paper highlights the limitations of conventional architectures and explores emerging probabilistic compute-in-memory (CIM) approaches that integrate sampling with memory access, paving the way for scalable hardware solutions for trustworthy AI.
Methodology
The authors utilize a theoretical framework to analyze the interplay between probabilistic computation and memory access. They define evaluation criteria for memory systems and conduct a comparative analysis of conventional architectures and emerging probabilistic compute-in-memory systems.
Results
The analysis reveals that increasing stochastic demand can lead systems to operate in entropy-limited regimes, highlighting the need for improved memory architectures that can efficiently support both data access and stochastic sampling. The proposed criteria provide a basis for evaluating and designing future memory systems tailored for probabilistic workloads.
Implications
The findings suggest that as AI systems become more reliant on probabilistic methods, there is a critical need for hardware innovations that can effectively manage the demands of stochastic computation. This could lead to advancements in trustworthy AI applications across various domains, including healthcare, autonomous systems, and security.
Uncertainty-Guided Label Rebalancing for CPS Safety Monitoring
Time Series
- Introduces U-Balance, a novel approach for rebalancing imbalanced datasets in CPS safety monitoring.
- Utilizes behavioral uncertainty as a key signal correlated with safety outcomes.
- Demonstrates significant improvement in safety prediction performance over traditional methods.
- Achieves a notable F1 score of 0.806 on a challenging UAV dataset.
Read more
Uncertainty-Guided Label Rebalancing for CPS Safety Monitoring
Summary
This paper addresses the challenge of safety monitoring in Cyber-Physical Systems (CPS), particularly in the context of Unmanned Aerial Vehicles (UAVs), where unsafe events are rare, leading to significant class imbalance in datasets. Traditional rebalancing techniques like SMOTE and class weighting have proven ineffective for time-series CPS telemetry, often resulting in unrealistic synthetic samples or overfitting. The authors propose a novel approach called U-Balance, which utilizes behavioral uncertainty—defined as the degree of doubt in CPS decisions—to enhance label rebalancing. U-Balance first trains a GatedMLP-based uncertainty predictor that summarizes telemetry data into distributional features and outputs an uncertainty score. It then employs an uncertainty-guided label rebalancing (uLNR) mechanism that probabilistically relabels safe-labeled telemetry windows with high uncertainty as unsafe, enriching the minority class without generating new data. The effectiveness of U-Balance is evaluated on a UAV benchmark with a 46:1 safe-to-unsafe ratio, demonstrating a significant correlation between behavioral uncertainty and safety outcomes. U-Balance achieves an F1 score of 0.806, outperforming the strongest baseline by 14.3 percentage points while maintaining competitive inference efficiency. The study highlights the importance of leveraging behavioral uncertainty for dataset rebalancing in CPS safety monitoring, marking a significant advancement in the field.
Methodology
The methodology involves training a GatedMLP-based uncertainty predictor to extract distributional kinematic features from telemetry data and generate uncertainty scores. The uncertainty-guided label rebalancing (uLNR) mechanism is then applied to relabel safe telemetry windows with high uncertainty as unsafe, effectively enriching the minority class prior to training the safety predictor.
Results
U-Balance was evaluated on a UAV dataset with a 46:1 safe-to-unsafe ratio, achieving an F1 score of 0.806, which is a 14.3 percentage point improvement over the strongest baseline. The results confirm a moderate but significant correlation between behavioral uncertainty and safety, validating the effectiveness of the uLNR strategy.
Implications
The findings suggest that incorporating behavioral uncertainty into safety monitoring can significantly enhance the predictive performance of machine learning models in CPS. This approach could be applied to various CPS applications, improving safety measures in dynamic and unpredictable environments.
Causal-INSIGHT: Probing Temporal Models to Extract Causal Structure
Time Series
Interpretability
Graph Learning
- Causal-INSIGHT is a model-agnostic framework for extracting causal structures from temporal predictors.
- The framework utilizes input clamping to analyze model responses and construct directed temporal influence signals.
- Qbic, a new graph selection criterion, balances predictive accuracy and structural complexity.
- Causal-INSIGHT shows competitive performance across various architectures and improves temporal delay localization.
Read more
Causal-INSIGHT: Probing Temporal Models to Extract Causal Structure
Summary
Causal-INSIGHT is a novel, model-agnostic framework designed to extract causal structures from trained temporal predictors in multivariate time series data. The framework addresses the challenge of interpreting complex dynamical systems by analyzing how a fixed, pre-trained model responds to systematic input clamping during inference. This approach allows for the construction of directed temporal influence signals that reflect the dependencies utilized by the predictor for making predictions. A key innovation of Causal-INSIGHT is the introduction of Qbic, a sparsity-aware graph selection criterion that balances predictive fidelity with structural complexity without requiring ground-truth graph labels. The framework operates purely at inference time, making it applicable across various architectures without necessitating modifications to the models themselves. Experimental results demonstrate that Causal-INSIGHT generalizes well across different backbone architectures, maintains competitive structural accuracy, and significantly improves temporal delay localization when applied to existing predictors. This work highlights the importance of interpretability in high-stakes applications where understanding model predictions is crucial.
Methodology
Causal-INSIGHT employs a two-stage process: first, it probes a pre-trained temporal predictor using input clamping to observe how changes in input affect predictions. Second, it constructs causal influence signals from these observations, which are then used to create directed temporal graphs using the Qbic scoring method, balancing predictive utility and sparsity.
Results
Experiments indicate that Causal-INSIGHT effectively generalizes across different temporal prediction architectures, achieving competitive structural accuracy and significantly enhancing the localization of temporal delays in predictions compared to existing methods.
Implications
The framework has significant implications for fields such as neuroscience, healthcare, and finance, where understanding the causal relationships and temporal dependencies in data is critical for decision-making and model validation.
Spatiotemporal System Forecasting with Irregular Time Steps via Masked Autoencoder
Time Series
- Introduces the Physics-Spatiotemporal Masked Autoencoder (P-STMAE) for forecasting irregular time series.
- Integrates convolutional autoencoders with masked autoencoders to handle missing data without imputation.
- Achieves significant improvements in prediction accuracy and computational efficiency over traditional methods.
- Demonstrates robustness to nonlinearities in high-dimensional dynamical systems.
Read more
Spatiotemporal System Forecasting with Irregular Time Steps via Masked Autoencoder
Summary
This paper addresses the challenge of predicting high-dimensional dynamical systems with irregular time steps, which often arise from missing data or sparse observations. The authors propose a novel approach called the Physics-Spatiotemporal Masked Autoencoder (P-STMAE), which combines convolutional autoencoders for spatial feature extraction with masked autoencoders optimized for irregular time series. This method utilizes attention mechanisms to reconstruct the entire physical sequence in a single prediction pass, eliminating the need for data imputation while maintaining the physical integrity of the system. The model is evaluated on various simulated datasets and real-world ocean temperature data, demonstrating significant improvements in prediction accuracy, robustness to nonlinearities, and computational efficiency compared to traditional convolutional and recurrent network methods. The P-STMAE shows promise for capturing complex spatiotemporal patterns without requiring domain-specific knowledge, making it applicable in fields such as climate modeling, fluid dynamics, ocean forecasting, and environmental monitoring.
Methodology
The P-STMAE employs convolutional autoencoders for spatial feature extraction and masked autoencoders optimized for irregular time series. It leverages attention mechanisms to reconstruct sequences in a single pass, avoiding preprocessing steps like data imputation.
Results
The proposed method outperforms traditional convolutional and recurrent network approaches in terms of prediction accuracy, robustness to nonlinearities, and computational efficiency, as evidenced by evaluations on simulated datasets and real-world ocean temperature data.
Implications
The P-STMAE has potential applications in various domains that require accurate forecasting of spatiotemporal systems, such as climate modeling, fluid dynamics, and environmental monitoring, without the need for extensive preprocessing or domain-specific knowledge.
GlowQ: Group-Shared LOw-Rank Approximation for Quantized LLMs
Large Language Models
Efficient ML
Optimization
- GlowQ utilizes a group-shared low-rank approximation to enhance quantized LLMs.
- The method reduces latency and memory overhead by caching a single shared right factor per input-sharing group.
- GlowQ-S, a selective variant, further optimizes performance by applying corrections only where beneficial.
- Empirical results show significant improvements in efficiency and accuracy over strong baselines.
Read more
GlowQ: Group-Shared LOw-Rank Approximation for Quantized LLMs
Summary
GlowQ introduces a novel approach to enhance the efficiency of quantized large language models (LLMs) by employing a group-shared low-rank approximation. Traditional quantization methods often lead to accuracy degradation, particularly at low-bit representations. Existing low-rank correction techniques restore all layers and add error-correction modules to every decoder block, which increases latency and memory usage. GlowQ addresses these limitations by caching a single shared right factor for input-sharing groups and selectively restoring only the groups or layers that provide the most significant accuracy improvements. This method reduces both parameter and memory overhead while maintaining the expressivity of layer-specific corrections. Additionally, a selective variant, GlowQ-S, further optimizes performance by applying the cached shared module only where it is most beneficial. Empirical evaluations demonstrate that GlowQ reduces time-to-first-byte (TTFB) by 5.6% and increases throughput by 9.6% on average, while also improving downstream accuracy and reducing perplexity on the WikiText-2 benchmark. The selective model GlowQ-S achieves even greater efficiency gains, cutting TTFB by 23.4% and increasing throughput by 37.4%, while maintaining accuracy within a narrow margin.
Methodology
GlowQ employs a group-sharing strategy for low-rank approximation, where a single shared right factor is computed for input-sharing groups. It uses a covariance-aligned objective to ensure that the shared factor aligns with frequently visited directions in the data. The method incorporates a QR-reduced randomized SVD routine for efficient computation and introduces a selective restore policy to activate only the most impactful groups or layers during inference.
Results
GlowQ achieves a 5.6% reduction in time-to-first-byte (TTFB) and a 9.6% increase in throughput on average, along with a 0.17% reduction in perplexity on WikiText-2 and a 0.42 percentage point increase in downstream accuracy. The selective variant, GlowQ-S, further reduces TTFB by 23.4% and increases throughput by 37.4%, maintaining accuracy within 0.2 percentage points.
Implications
The GlowQ framework can significantly enhance the deployment of large language models in resource-constrained environments by improving efficiency without sacrificing accuracy. This has potential applications in real-time language processing tasks, mobile AI applications, and other scenarios where computational resources are limited.
A CDF-First Framework for Free-Form Density Estimation
Generative Models
Theory
- Introduces a CDF-first framework that reframes density estimation as learning a valid CDF, minimizing inductive bias.
- Extends the framework to multivariate outputs using an autoregressive decomposition with SMM-based conditional CDFs.
- Demonstrates superior performance in capturing multi-modality, skewness, and topological complexity compared to existing methods.
Read more
A CDF-First Framework for Free-Form Density Estimation
Summary
This paper introduces a novel CDF-first framework for conditional density estimation (CDE) that addresses the challenges of free-form density estimation, which involves capturing complex distributions that may be multi-modal, asymmetric, or topologically intricate. Traditional methods often estimate the probability density function (PDF) directly, which can lead to instability and poor performance due to the ill-posed nature of PDF estimation. The proposed framework instead focuses on estimating the cumulative distribution function (CDF), a more stable target that allows for the recovery of the PDF through differentiation. By employing Smooth Min-Max (SMM) networks to parameterize the CDF, the authors ensure that the resulting PDFs are valid and can accurately represent complex distributional shapes. For multivariate outputs, an autoregressive decomposition is utilized, allowing the model to maintain structural fidelity while simplifying the estimation process to a series of univariate CDF tasks. Experimental results demonstrate that the CDF-first approach outperforms existing state-of-the-art density estimators across various univariate and multivariate tasks, effectively capturing the nuances of complex distributions.
Methodology
The authors propose a CDF-first framework that estimates the cumulative distribution function (CDF) using Smooth Min-Max (SMM) networks, ensuring valid probability density functions (PDFs) through differentiation. For multivariate outputs, they utilize an autoregressive decomposition to model conditional CDFs, simplifying the estimation process while preserving the joint distribution.
Results
The proposed CDF-first framework outperformed state-of-the-art density estimators on various benchmarks, effectively capturing complex distributional characteristics such as multi-modality and topological intricacies. The method demonstrated improved accuracy in recovering ground-truth distributions with sharp boundaries and topological holes.
Implications
This framework has significant implications for applications requiring accurate uncertainty quantification and risk assessment in decision-making processes, particularly in fields where understanding complex distributions is crucial.
Train at Moving Edge: Online-Verified Prompt Selection for Efficient RL Training of Large Reasoning Model
Reinforcement Learning
Large Language Models
Efficient ML
- HIVE framework improves prompt selection efficiency in RL training for LLMs.
- Identifies the 'learning edge' where the most informative prompts reside.
- Utilizes historical data and real-time entropy to select high-utility prompts.
- Demonstrates significant reductions in computational overhead during training.
Read more
Train at Moving Edge: Online-Verified Prompt Selection for Efficient RL Training of Large Reasoning Model
Summary
This paper addresses the challenge of computational overhead in reinforcement learning (RL) for fine-tuning large language models (LLMs) in reasoning tasks. The authors propose a novel framework called HIVE (History-Informed and online-VErified prompt selection) that efficiently selects high-utility prompts before the rollout phase, thus reducing unnecessary computational costs. The study reveals that sample utility is non-uniform and evolves during training, with the most informative prompts located at the 'learning edge'—a balance of intermediate difficulty and high uncertainty. HIVE operates in two stages: it first uses historical reward trajectories for coarse selection, followed by real-time pruning of stale prompts based on their entropy. The framework is evaluated across multiple math reasoning benchmarks, demonstrating significant improvements in rollout efficiency and training speed without compromising performance. HIVE achieves up to 9.2 million fewer rollouts and maintains or exceeds the accuracy of existing methods like Dynamic Sampling and GRESO.
Methodology
The authors conducted an empirical analysis of training dynamics to understand prompt utility and its evolution. HIVE employs a two-stage process: first, it selects prompts based on historical reward trajectories; second, it verifies prompt utility in real-time using entropy as a proxy to prune ineffective prompts.
Results
HIVE outperforms baseline methods such as Dynamic Sampling and GRESO, achieving up to 3.8× speedup in rollout and 2.2× faster total training time. It also reduces the number of rollouts by up to 9.2 million while maintaining or exceeding reasoning accuracy.
Implications
The proposed HIVE framework can significantly enhance the efficiency of RL training for large language models, making it a valuable approach for applications requiring complex reasoning capabilities. This could lead to faster training cycles and reduced resource consumption in deploying LLMs.
Can an Actor-Critic Optimization Framework Improve Analog Design Optimization?
Optimization
- Introduces an Actor-Critic Optimization Framework (ACOF) for analog design optimization.
- Separates proposal and evaluation roles to enhance search efficiency and interpretability.
- Achieves significant improvements in design metrics over existing optimization methods.
- Maintains compatibility with standard simulation workflows.
Read more
Can an Actor-Critic Optimization Framework Improve Analog Design Optimization?
Summary
This paper introduces an Actor-Critic Optimization Framework (ACOF) aimed at enhancing analog design optimization, which is often hampered by the need for extensive simulation cycles and the complexity of navigating a vast design space. Traditional optimization methods lack the nuanced judgment that human designers apply when exploring design options. ACOF addresses this by separating the roles of proposal and evaluation in the optimization process. The 'actor' proposes promising regions of the design space, while the 'critic' evaluates these proposals, ensuring they adhere to design constraints and redirecting the search when necessary. This structured approach allows for a more deliberate, stable, and interpretable search process, compatible with existing simulation workflows. The framework was tested on various circuits, yielding an average improvement of 38.9% in the top-10 figure of merit (FoM) compared to the best existing methods, alongside a 24.7% reduction in regret, with peak improvements of 70.5% in FoM and 42.2% lower regret for individual circuits. ACOF combines iterative reasoning with simulation-driven search, paving the way for more automated and efficient analog sizing in complex design environments.
Methodology
The ACOF framework operates in a closed-loop optimization process where the actor proposes candidate search regions, the critic audits these proposals for legality and performance, and a Bayesian optimization method selects candidates for simulation. This iterative process allows for adjustments based on feedback from previous rounds, enhancing the search direction and efficiency.
Results
The ACOF framework demonstrated an average improvement of 38.9% in the top-10 figure of merit (FoM) and a 24.7% reduction in regret across test circuits. Peak gains included a 70.5% improvement in FoM and a 42.2% reduction in regret for specific circuits, showcasing the framework's effectiveness in optimizing analog designs.
Implications
The findings suggest that integrating actor-critic methodologies into analog design optimization can significantly streamline the design process, making it more efficient and interpretable. This approach could lead to advancements in automated design tools, reducing the reliance on extensive designer expertise and simulation cycles.
Missing-Aware Multimodal Fusion for Unified Microservice Incident Management
Multimodal
- ARMOR effectively addresses the issue of missing modalities in multimodal data for incident management.
- The framework employs a self-supervised approach, eliminating the need for extensive fault labels.
- It features a modality-specific asymmetric encoder and a missing-aware gated fusion mechanism.
- ARMOR demonstrates state-of-the-art performance in anomaly detection, failure triage, and root cause localization.
Read more
Missing-Aware Multimodal Fusion for Unified Microservice Incident Management
Summary
The paper presents ARMOR, a self-supervised framework designed to enhance automated incident management in microservice architectures by addressing the challenges posed by missing modalities in multimodal data. Traditional frameworks assume complete data, which is often unrealistic due to network fluctuations and agent failures. ARMOR introduces a modality-specific asymmetric encoder to isolate disparities among different data types (metrics, logs, and traces) and employs a missing-aware gated fusion mechanism that utilizes learnable placeholders and dynamic bias compensation to mitigate cross-modal interference. The framework jointly optimizes three critical tasks: anomaly detection (AD), failure triage (FT), and root cause localization (RCL), with AD and RCL requiring no fault labels and FT relying solely on failure-type annotations. Extensive experiments demonstrate that ARMOR achieves state-of-the-art performance under complete data conditions and retains robust diagnostic accuracy even with significant modality loss, showcasing its potential for practical deployment in real-world scenarios.
Methodology
ARMOR utilizes a self-supervised learning approach with a modality-specific asymmetric encoder to handle the disparities among different data types. It incorporates a missing-aware gated fusion mechanism that uses learnable placeholders and dynamic bias compensation to effectively manage incomplete inputs. The framework optimizes anomaly detection, failure triage, and root cause localization tasks simultaneously, leveraging mask-guided reconstruction techniques.
Results
The experimental results indicate that ARMOR outperforms existing methods in terms of diagnostic accuracy and robustness, particularly in scenarios with missing modalities. It achieves state-of-the-art performance under complete data conditions and demonstrates resilience against significant data loss, confirming its effectiveness for real-world applications in microservice incident management.
Implications
The development of ARMOR has significant implications for improving the reliability of microservice architectures by enabling more effective incident management. Its self-supervised approach reduces the dependency on labeled data, making it easier to deploy in diverse environments. This framework can enhance the operational efficiency of site reliability engineers (SREs) by streamlining the incident management process and reducing downtime.
Reaching Beyond the Mode: RL for Distributional Reasoning in Language Models
NLP
Large Language Models
Reinforcement Learning
- Multi-Answer RL enables language models to generate multiple plausible answers simultaneously.
- The approach improves diversity and coverage of responses while providing calibrated uncertainty estimates.
- Empirical results show over 50% improvement in accuracy on coding tasks with reduced token usage.
- The method is applicable to various domains, including medical diagnosis and ambiguous question answering.
Read more
Reaching Beyond the Mode: RL for Distributional Reasoning in Language Models
Summary
This paper addresses the limitations of traditional language models (LMs) that typically generate a single dominant answer to queries, which is insufficient for real-world tasks requiring multiple plausible responses. The authors propose a novel reinforcement learning (RL) approach called Multi-Answer RL, which trains LMs to generate distributions over multiple candidate answers in a single forward pass. This method allows for better representation of uncertainty and diversity in responses, particularly in complex domains such as medical diagnosis and coding. By modifying the RL objective to encourage the generation of multiple hypotheses, the authors demonstrate that their approach significantly improves answer diversity, coverage, and calibration scores compared to standard single-answer training methods. The empirical results show that Multi-Answer RL not only enhances accuracy but also reduces the number of tokens needed to generate multiple answers, making it a more efficient alternative to existing inference-time sampling techniques.
Methodology
The authors introduce Multi-Answer RL, which modifies the reinforcement learning objective to optimize for the generation of multiple candidate answers directly. This involves training models to reason over multiple hypotheses within a single generation and to verbalize structured sets of answers, supported by a reward function inspired by proper scoring rules to incentivize calibrated distributions.
Results
The proposed Multi-Answer RL approach demonstrated substantial improvements in various benchmarks, including increased answer diversity, better coverage, and improved calibration scores. On coding tasks, the approach achieved over 50% higher top-1 accuracy while reducing token usage by more than half compared to traditional single-answer trained models.
Implications
The findings suggest that Multi-Answer RL can enhance the performance of language models in high-stakes applications where multiple correct answers are possible, such as in medical diagnostics and complex decision-making scenarios. This approach could lead to more reliable and interpretable AI systems capable of handling uncertainty and ambiguity in real-world tasks.
Energy-Efficient Hierarchical Federated Anomaly Detection for the Internet of Underwater Things via Selective Cooperative Aggregation
Federated Learning
Efficient ML
Time Series
- Proposes a three-tier hierarchical federated learning framework for anomaly detection in IoUT.
- Introduces feasibility-aware sensor-to-fog associations and compressed model-update transmissions.
- Demonstrates that selective cooperative aggregation reduces energy consumption significantly while maintaining detection accuracy.
- Evaluates the framework using a physics-grounded model to assess communication energy and network participation.
Read more
Energy-Efficient Hierarchical Federated Anomaly Detection for the Internet of Underwater Things via Selective Cooperative Aggregation
Summary
This paper addresses the challenges of anomaly detection in the Internet of Underwater Things (IoUT), where traditional flat federated learning (FL) struggles due to low-bandwidth and energy-intensive acoustic communication. The authors propose a novel energy-efficient hierarchical federated learning framework that incorporates feasibility-aware sensor-to-fog associations, compressed model-update transmissions, and selective cooperative aggregation among fog nodes. This three-tier architecture minimizes long-range communications by clustering sensors and activating fog-to-fog exchanges only when beneficial. The framework is evaluated using a physics-grounded underwater acoustic model, which assesses detection quality, communication energy, and network participation. The results indicate that while only 48% of sensors can directly reach the surface gateway in a large deployment, the hierarchical approach maintains full participation through feasible fog paths. Selective cooperation achieves detection accuracy comparable to continuous inter-fog exchanges while reducing energy consumption by 31-33%, and compressed uploads decrease total energy usage by 71-95%. The findings demonstrate that low-overhead hierarchical methods can compete effectively in detection quality, offering practical design guidance for underwater deployments under severe communication constraints.
Methodology
The authors developed a three-tier hierarchical federated learning framework that includes feasibility-aware sensor-to-fog associations, compressed model-update transmissions, and selective cooperative aggregation among fog nodes. They utilized a physics-grounded underwater acoustic model to evaluate the system's performance in terms of detection quality, communication energy, and effective network participation.
Results
The hierarchical framework allowed for full participation of sensors in a deployment where only 48% could reach the surface gateway directly. Selective cooperation reduced energy consumption by 31-33% compared to always-on inter-fog exchanges, while compressed uploads led to a 71-95% reduction in total energy usage during sensitivity tests. The hierarchical methods maintained competitive detection quality against flat federated learning.
Implications
The proposed framework provides a practical approach for deploying anomaly detection systems in underwater environments, enabling efficient communication and energy usage. This has implications for various applications such as ocean observation, environmental monitoring, and autonomous underwater operations.
Amplified Patch-Level Differential Privacy for Free via Random Cropping
Computer Vision
Theory
Efficient ML
- Random cropping can enhance differential privacy by probabilistically excluding sensitive content from model inputs.
- A new patch-level neighboring relation is introduced to better align privacy definitions with the structure of vision data.
- The method does not require changes to existing training algorithms or additional computational resources.
- Empirical results demonstrate improved privacy-utility trade-offs across multiple segmentation architectures and datasets.
Read more
Amplified Patch-Level Differential Privacy for Free via Random Cropping
Summary
This paper explores the intersection of random cropping, a common data augmentation technique in computer vision, and differential privacy (DP) in machine learning. The authors argue that random cropping can probabilistically exclude sensitive content from images, such as faces or license plates, thus introducing an additional source of randomness in the training of differentially private models. They formalize this concept by introducing a patch-level neighboring relation specifically for vision data, which allows for a more nuanced understanding of privacy in images. The paper derives tight privacy bounds for differentially private stochastic gradient descent (DP-SGD) when combined with random cropping, demonstrating that this method can enhance privacy guarantees without altering the model architecture or training process. The authors validate their approach empirically using segmentation architectures on various datasets, showing that patch-level amplification improves the privacy-utility trade-off significantly. This work highlights the potential of leveraging existing sources of randomness in machine learning to achieve stronger privacy protections.
Methodology
The authors formalize random cropping as a privacy amplification mechanism and introduce a patch-level neighboring relation for vision data. They analyze the resulting privacy guarantees of DP-SGD when combined with random cropping, deriving tight theoretical privacy bounds and quantifying the patch inclusion probability.
Results
The empirical validation shows that random cropping significantly improves the privacy-utility trade-off in semantic segmentation tasks using architectures like DeepLabV3+ and PSPNet on datasets such as Cityscapes and A2D2.
Implications
This research suggests that existing data augmentation techniques can be leveraged to enhance privacy in machine learning models, potentially leading to broader applications in privacy-sensitive domains such as healthcare and finance where image data is prevalent.
How Pruning Reshapes Features: Sparse Autoencoder Analysis of Weight-Pruned Language Models
NLP
Large Language Models
Interpretability
- Rare features survive pruning better than frequent features, indicating implicit feature selection.
- Wanda pruning preserves feature structure up to 3.7 times better than magnitude pruning.
- Pre-trained Sparse Autoencoders remain effective on Wanda-pruned models up to 50% sparsity.
- Geometric feature survival does not correlate with causal importance in model outputs.
Read more
How Pruning Reshapes Features: Sparse Autoencoder Analysis of Weight-Pruned Language Models
Summary
This paper presents a systematic study of the effects of weight pruning on the internal feature representations of language models, utilizing Sparse Autoencoders (SAEs) for interpretability. The authors investigate how unstructured pruning reshapes feature geometry across three model families (Gemma 3 1B, Gemma 2 2B, Llama 3.2 1B) and two pruning methods (magnitude and Wanda) at various sparsity levels (0-60%). The study addresses five key research questions related to feature stability, survival, transferability, fragility, and causal relevance. A notable finding is that rare features with low firing rates are more resilient to pruning than frequent features, suggesting that pruning acts as an implicit feature selection mechanism. Additionally, the Wanda pruning method is shown to preserve feature structure significantly better than magnitude pruning. The results highlight the complex relationship between geometric feature survival and causal importance, with implications for the interpretability of pruned models.
Methodology
The authors employed Sparse Autoencoders to analyze the internal activations of language models before and after weight pruning. They conducted experiments across three model families and two pruning methods, measuring feature survival and stability under various sparsity levels. The study involved 22 experimental runs with multiple seeds to ensure robustness of findings.
Results
The study found that rare SAE features survived pruning significantly better than frequent ones, with Spearman correlations indicating a strong negative relationship between firing rate and survival rate. Wanda pruning was shown to outperform magnitude pruning in preserving feature structure. Despite the low seed stability, the degradation patterns were consistent across experiments. Importantly, the survival of geometric features did not predict their causal relevance to model outputs.
Implications
The findings suggest that practitioners can leverage pruning techniques while maintaining interpretability in language models. Understanding the dynamics of feature survival post-pruning can inform model deployment strategies and enhance the interpretability of compressed models.
Learning Mesh-Free Discrete Differential Operators with Self-Supervised Graph Neural Networks
Graph Learning
Theory
Efficient ML
- Introduction of a self-supervised graph neural network framework for learning mesh-free differential operators.
- Operators are learned based on polynomial consistency constraints, enhancing accuracy while maintaining computational efficiency.
- Demonstrated improved accuracy over traditional SPH methods and favorable trade-offs in computational cost.
- Framework is resolution-agnostic and applicable across various particle configurations and governing equations.
Read more
Learning Mesh-Free Discrete Differential Operators with Self-Supervised Graph Neural Networks
Summary
This paper presents a novel framework for learning mesh-free discrete differential operators using a self-supervised graph neural network, termed Neural Mesh-Free Differential Operator (NeMDO). Traditional mesh-free numerical methods often face a trade-off between computational efficiency and accuracy, particularly in complex geometries. The authors propose a method that learns local operator weights based on polynomial moment constraints derived from truncated Taylor expansions, allowing for the construction of operators that are robust to irregular geometries. The NeMDO framework is resolution-agnostic and can be reused across different particle configurations and governing equations. The authors validate their approach through numerical analysis diagnostics, demonstrating improved accuracy over Smoothed Particle Hydrodynamics (SPH) and a favorable accuracy-cost trade-off compared to high-order consistent mesh-free methods. Additionally, the applicability of the learned operators is showcased by solving the weakly compressible Navier–Stokes equations, indicating the potential of this framework for broader applications in computational fluid dynamics and other areas involving partial differential equations.
Methodology
The authors developed a graph neural network framework that learns discrete mesh-free differential operators by mapping local stencil positions to operator weights. The learning process is guided by polynomial moment constraints derived from Taylor expansions, ensuring that the learned operators maintain polynomial consistency. The framework is validated through various numerical analysis techniques, including convergence studies and stability analyses.
Results
The NeMDO framework showed significant improvements in accuracy compared to traditional SPH methods and provided a better accuracy-cost trade-off relative to high-order consistent mesh-free methods. The learned operators were successfully applied to solve the weakly compressible Navier–Stokes equations, demonstrating their practical utility in fluid dynamics.
Implications
The proposed framework has the potential to revolutionize numerical methods for solving partial differential equations, particularly in complex geometries where traditional mesh-based methods struggle. It opens avenues for more efficient simulations in various fields, including computational fluid dynamics, material science, and engineering applications.