AI-generated summaries
Today's ML research,
without the noise.
Daily summaries of the latest machine learning papers from arXiv, processed every 8 hours.
48
Papers today
8h
Update frequency
7
Days of history
Advancing multi-site emission control: A physics-informed transfer learning framework with mixture of experts for carbon-pollutant synergy
Theory
Efficient ML
Interpretability
- Introduces a physics-informed transfer learning framework for emission control in MSWI systems.
- Demonstrates the importance of considering physical constraints and operational heterogeneity in modeling.
- Achieves high predictive accuracy for emissions across multiple incineration plants.
- Shows that adaptation occurs through structured re-weighting of operating regimes rather than complete model re-learning.
Read more
Advancing multi-site emission control: A physics-informed transfer learning framework with mixture of experts for carbon-pollutant synergy
Summary
This paper addresses the challenge of controlling carbon emissions and multiple air pollutants from municipal solid waste incineration (MSWI) systems, which are essential for sustainable urban waste management. The authors propose a novel physics-informed transfer learning framework that utilizes a mixture-of-experts model to capture the complex interactions between carbon emissions and pollutants across different incineration plants. By considering physical constraints, operational heterogeneity, and the coupling of carbon and pollutant emissions, the framework enables the development of transferable models that can generalize across various facilities. The model was validated using data from 13 MSWI plants, demonstrating strong predictive performance for both pollutant-specific emissions and a carbon-pollutant synergistic index (CPSI). The results indicate that the model maintains high accuracy even when applied to new sites, thereby supporting scalable emission control strategies. This research highlights the potential for integrating physics-informed approaches with machine learning to enhance the robustness and transferability of emission prediction models in heterogeneous systems.
Methodology
The authors developed a physics-informed transfer learning framework that integrates a mixture-of-experts model. This model employs regime-dependent expert routing combined with conservation-based regularization and a carbon-pollutant synergistic index for risk evaluation. The framework was trained and validated using data from 13 municipal solid waste incineration plants, focusing on capturing both pollutant-specific emissions and system-level risks.
Results
The model achieved source-domain average pollutant R2 values ranging from 0.668 to 0.904 and CPSI R2 values from 0.666 to 0.970. After transferring the model from a reference facility to 12 target plants, the average pollutant R2 remained between 0.661 and 0.842, with CPSI retaining comparable transferability (R2 = 0.610โ0.841). The expert-utilization patterns indicated that the model adapts through structured re-weighting of operating regimes.
Implications
This framework offers a significant advancement in emission prediction and control for municipal solid waste incineration systems, enabling operators to implement scalable and effective emission control strategies across heterogeneous facilities. It also paves the way for future research in integrating physics-informed machine learning with complex industrial processes.
reward-lens: A Mechanistic Interpretability Library for Reward Models
Reinforcement Learning
Interpretability
Large Language Models
- Introduction of 'reward-lens', a toolkit for mechanistic interpretability of reward models.
- Unifies various interpretability techniques under a common framework based on the reward head's weight vector.
- Includes five theory-grounded extensions to enhance interpretability tools.
- Empirical validation shows that linear attribution fails to predict causal importance in reward models.
Read more
reward-lens: A Mechanistic Interpretability Library for Reward Models
Summary
The paper introduces 'reward-lens', an open-source library designed to enhance mechanistic interpretability for reward models used in reinforcement learning from human feedback (RLHF). Traditional interpretability tools have primarily focused on generative language models, but reward models, which output scalar values representing human preferences, require a different approach. The reward-lens library is built around the observation that the weight vector of the reward head serves as a natural axis for interpretability questions. It includes various tools such as the Reward Lens, component attribution, and contrastive activation patching, along with five extensions based on recent alignment theory results. The framework was validated on two production reward models, revealing that linear attribution does not effectively predict causal importance, indicating a need for better observational and causal analysis methods. The library aims to unify interpretability techniques and provide a systematic approach to understanding reward models, which are critical for ensuring the safety and alignment of AI systems.
Methodology
The reward-lens library organizes interpretability tools around the weight vector of the reward head, allowing for projections and decompositions relevant to reward models. It includes components like the Reward Lens, contrastive activation patching, and various extensions that operationalize recent alignment theory results. The library was validated on two reward models using approximately 695 preference pairs from RewardBench.
Results
The empirical findings indicate that linear attribution does not predict causal importance effectively, with Spearman correlation coefficients of -0.256 for Skywork and -0.027 for ArmoRM. This negative result is treated as a feature of the framework, highlighting the need for improved methods in observational and causal analysis.
Implications
The development of reward-lens has significant implications for the interpretability of reward models in AI, particularly in ensuring that these models align with human preferences. By providing a systematic approach to understanding reward models, it can help improve the safety and effectiveness of AI systems trained using RLHF.
Measuring the Sensitivity of Classification Models with the Error Sensitivity Profile
Theory
- Introduction of the Error Sensitivity Profile (ESP) for assessing model sensitivity to data errors.
- Development of the Dirtify tool suite to facilitate error injection and analysis.
- Extensive evaluation across 14 classification models reveals complex relationships between data errors and model performance.
- ESP allows for prioritization of data-cleaning efforts based on specific error types and features.
Read more
Measuring the Sensitivity of Classification Models with the Error Sensitivity Profile
Summary
This paper introduces the Error Sensitivity Profile (ESP), a novel metric designed to quantify the sensitivity of machine learning model performance to errors in training data. The ESP allows practitioners to assess how different types of errors in one or multiple features impact model performance, enabling targeted data-cleaning efforts. The author developed a suite of tools called Dirtify, which includes a Python library named PuckTrick for injecting specific error types into datasets. The paper presents an extensive experimental study using 14 classification models on two widely used datasets, demonstrating that performance degradation is not always predictable from simple correlations with the target variable. The ESP provides a multi-dimensional profile that captures the relationship between error severity and performance degradation, offering insights that can guide feature-level data cleaning decisions. This work contributes to the understanding of data quality in machine learning by providing a flexible and comprehensive approach to sensitivity analysis.
Methodology
The methodology involves defining the Error Sensitivity Profile (ESP) as a tuple that includes Error Performance Correlation (EPC), Area under Curve Error-Performance (AEPC), and the behavior of the model under varying error levels. The study utilizes a suite of tools (Dirtify and PuckTrick) to inject controlled errors into datasets and evaluate the performance of various classification models under these conditions.
Results
The experimental results indicate that the ESP effectively captures the nuanced relationship between data quality and model performance. The analysis shows that performance degradation is not solely dependent on simple correlations, highlighting the need for a more sophisticated approach to understanding data errors. The Dirtify suite demonstrated its utility in supporting this research by enabling systematic error injection and performance evaluation.
Implications
The findings suggest that the ESP can significantly enhance the process of data cleaning in machine learning by providing insights into which errors most adversely affect model performance. This can lead to more efficient data preparation strategies and improved model robustness. The tools developed can be applied in various domains where data quality is critical to machine learning success.
Shearlet Neural Operators for Anisotropic-Shock-Dominated and Multi-scale parametric partial differential equations
Theory
Efficient ML
- Introduction of Shearlet Neural Operator (SNO) to enhance neural operator architectures for PDEs.
- SNO replaces Fourier transforms with shearlet representations, improving handling of anisotropic features.
- Demonstrated significant accuracy improvements over Fourier Neural Operators across multiple PDE benchmarks.
- SNO integrates shearlet transforms into the neural operator pipeline for end-to-end training.
Read more
Shearlet Neural Operators for Anisotropic-Shock-Dominated and Multi-scale parametric partial differential equations
Summary
This paper introduces the Shearlet Neural Operator (SNO), a novel architecture designed to improve the predictive accuracy of neural operators for solving parametric partial differential equations (PDEs) that exhibit anisotropic and shock-dominated characteristics. Traditional Fourier Neural Operators (FNOs) struggle with these types of PDEs due to their reliance on global Fourier representations, which can be inefficient for capturing sharp gradients and localized discontinuities. The SNO architecture replaces the Fourier transform with a shearlet-based representation, which provides a directional, multiscale, and spatially localized approach to approximating anisotropic features. This allows for better handling of edges, fronts, and shocks in PDE solutions. The authors demonstrate the effectiveness of SNO across seven benchmark PDE families, showing significant improvements in predictive accuracy and feature fidelity compared to FNOs, especially in anisotropic and discontinuity-dominated scenarios. The paper highlights the integration of shearlet transforms into the neural operator framework, enabling end-to-end training and efficient spectral computation while addressing the limitations of existing methods.
Methodology
The authors developed the Shearlet Neural Operator (SNO) by embedding shearlet-domain mixing directly into the neural operator architecture. This involved integrating a differentiable shearlet transform into the operator learning framework, allowing for efficient training and representation of anisotropic features in PDE solutions.
Results
The SNO architecture consistently outperformed Fourier Neural Operators across seven benchmark PDE families, particularly in scenarios characterized by strong anisotropy and discontinuities. The results indicated improved predictive accuracy and feature fidelity, demonstrating the effectiveness of shearlet-based representations in capturing complex structures in PDE solutions.
Implications
The introduction of SNO has potential implications for various applications in computational science, particularly in fields requiring accurate modeling of transport, diffusion, and wave propagation phenomena. The ability to efficiently approximate solutions to complex PDEs can enhance real-time decision-making and optimization in engineering and scientific contexts.
Entropy Centroids as Intrinsic Rewards for Test-Time Scaling
NLP
Large Language Models
Efficient ML
- Introduces High Entropy Phases (HEPs) as a stable measure of model uncertainty during inference.
- Defines the Entropy Centroid as a weighted average of HEP positions to guide response selection.
- Proposes the Lowest Centroid method for selecting responses based on intrinsic rewards derived from model uncertainty.
- Demonstrates consistent performance improvements across various tasks and model sizes.
Read more
Entropy Centroids as Intrinsic Rewards for Test-Time Scaling
Summary
This paper addresses the challenge of scaling test-time computation for large language models (LLMs) by proposing a novel intrinsic reward mechanism based on entropy centroids. Traditional methods for selecting the best response from multiple candidates often rely on external reward models, which can be computationally expensive and introduce noise. The authors introduce the concept of High Entropy Phases (HEPs), which are segments of the inference process characterized by high token entropy, indicating uncertainty. By analyzing these phases, the authors define the Entropy Centroid as the weighted average position of all HEPs along the inference trajectory. The central hypothesis is that a lower entropy centroid correlates with higher response quality, as it suggests early exploration followed by confident generation. The proposed Lowest Centroid method selects the response with the lowest entropy centroid among candidates. Experiments across various tasks, including mathematics and code generation, demonstrate that this method consistently outperforms existing baselines, particularly as model size increases, indicating its effectiveness in improving response quality without the overhead of external reward models.
Methodology
The authors formalize the concept of High Entropy Phases (HEPs) to represent segments of high uncertainty during inference. They calculate the Entropy Centroid by treating the number of tokens in each HEP as mass and the midpoint of the HEP as its position, computing a weighted average along the trajectory. The Lowest Centroid method is then employed to select the best response based on the lowest entropy centroid among multiple candidates.
Results
Experiments show that the Lowest Centroid method outperforms existing selection baselines across various tasks, including mathematics, code generation, and logical reasoning. The performance gains are particularly pronounced as the model size increases, indicating that the method scales effectively with larger models.
Implications
This work suggests a new direction for improving the efficiency and effectiveness of response selection in large language models by leveraging intrinsic signals from the model's own generation process. It could lead to more robust applications in areas requiring high-quality outputs from LLMs, such as automated reasoning and coding tasks.
Investigation into In-Context Learning Capabilities of Transformers
Theory
Efficient ML
- Transformers can perform in-context learning effectively for unseen tasks using example input-output pairs.
- The study identifies critical factors affecting in-context test accuracy, including input dimension and the number of examples.
- Benign overfitting allows models to generalize well despite memorizing noisy labels under certain conditions.
- The research provides an empirical framework for understanding the scaling behavior of in-context classification.
Read more
Investigation into In-Context Learning Capabilities of Transformers
Summary
This paper explores the in-context learning (ICL) capabilities of transformer models, focusing on their performance in Gaussian-mixture binary classification tasks. While previous theoretical work has identified conditions for linear classification in-context, the authors aim to empirically characterize the scaling behavior of transformers in this context. They systematically analyze how in-context test accuracy is influenced by three key factors: input dimension, the number of in-context examples, and the number of pre-training tasks. Using a controlled synthetic setup and a linear in-context classifier, the study isolates geometric conditions that allow models to infer task structure from context. The authors also investigate the phenomenon of benign overfitting, where models can memorize noisy in-context labels while maintaining strong generalization on clean test data. Through extensive experimentation across various parameter settings, the paper identifies regions where benign overfitting occurs and characterizes its dependence on data geometry and training exposure. The findings provide a comprehensive empirical map of the scaling behavior in in-context classification, emphasizing the importance of dimensionality, signal strength, and contextual information in determining the success or failure of in-context learning.
Methodology
The authors conducted a systematic empirical study using a controlled synthetic setup to analyze in-context learning in transformers. They employed a linear in-context classifier formulation and varied parameters such as input dimension, sequence length, task diversity, and signal-to-noise ratios to isolate the geometric conditions for successful inference. Extensive sweeps across these parameters were performed to characterize the emergence of benign overfitting.
Results
The study found that in-context learning success is significantly influenced by the dimensionality of the input, the number of in-context examples, and the diversity of pre-training tasks. The authors identified specific parameter regions where benign overfitting occurs, allowing models to achieve high generalization performance despite the presence of noise in training data. The results provide a detailed empirical map of the conditions under which transformers excel in in-context classification tasks.
Implications
The findings have significant implications for optimizing transformer training processes, potentially reducing the need for extensive pre-training by leveraging in-context learning. Understanding benign overfitting can help in designing models that generalize better in real-world applications, thus saving computational resources and time in model training.
Biased Dreams: Limitations to Epistemic Uncertainty Quantification in Latent Space Models
Reinforcement Learning
- Latent dynamics models can exhibit attractor behavior, biasing transitions towards well-represented regions of latent space.
- This attractor behavior can obscure discrepancies between latent and true environment dynamics, undermining the reliability of epistemic uncertainty estimates.
- Latent rollouts systematically overestimate predicted rewards due to the bias towards high-reward regions.
- The findings highlight the inadequacy of directly transferring epistemic uncertainty quantification methods from physical to latent dynamics models.
Read more
Biased Dreams: Limitations to Epistemic Uncertainty Quantification in Latent Space Models
Summary
This paper investigates the limitations of epistemic uncertainty quantification in latent dynamics models, particularly within the context of Model-Based Reinforcement Learning (MBRL). The authors focus on the Recurrent State Space Model (RSSM) used in the Dreamer family, highlighting that while epistemic uncertainty quantification is well-established for physical dynamics models, its application to latent dynamics models has not been thoroughly examined. The study empirically demonstrates that latent transitions in these models exhibit an attractor behavior, where they are biased towards well-represented regions of latent space. This bias can mask discrepancies between the latent and true environment dynamics, leading to unreliable epistemic uncertainty estimates. Furthermore, the authors find that these attractor regions often coincide with high-reward states, resulting in a systematic overestimation of predicted rewards. The findings underscore the need for a more critical evaluation of epistemic uncertainty estimation methods in latent dynamics models, as they may lead to overconfidence in model predictions and suboptimal reinforcement learning updates.
Methodology
The authors conducted empirical experiments using Recurrent State Space Models (RSSM) to analyze the behavior of latent transitions and their implications for epistemic uncertainty quantification. They examined how these models respond to out-of-distribution states and assessed the reliability of uncertainty estimates in comparison to true environment dynamics.
Results
The study found that latent transitions in RSSM models exhibit an attractor behavior, leading to biased predictions that do not align with actual environment dynamics. This behavior results in unreliable epistemic uncertainty estimates and a systematic overestimation of rewards, particularly in high-reward regions of latent space.
Implications
The findings suggest that reliance on epistemic uncertainty quantification in latent dynamics models can lead to overconfident decision-making in reinforcement learning applications. This has implications for the design and evaluation of reinforcement learning algorithms, particularly in environments where accurate uncertainty estimation is critical for effective exploration and exploitation.
Why Search When You Can Transfer? Amortized Agentic Workflow Design from Structural Priors
NLP
Large Language Models
Optimization
- SWIFT reframes workflow design from iterative search to amortized synthesis, significantly reducing computational costs.
- The framework distills reusable structural priors from past workflows, enhancing efficiency and performance.
- SWIFT outperforms existing search-based methods across multiple benchmarks and generalizes well to unseen tasks.
- The study reveals that workflow demonstrations primarily transfer topological structures rather than specific operator names.
Read more
Why Search When You Can Transfer? Amortized Agentic Workflow Design from Structural Priors
Summary
The paper addresses the inefficiencies of automated agentic workflow design, which traditionally relies on per-task iterative search methods. These methods are computationally expensive and do not leverage structural knowledge across tasks. The authors propose a new framework called SWIFT (Synthesizing Workflows via Few-shot Transfer), which amortizes workflow design by utilizing reusable structural priors. SWIFT distills compositional heuristics and output-interface contracts from previous search trajectories, allowing it to synthesize complete, executable workflows for new tasks in a single generation pass without iterative search. The framework demonstrates significant improvements in performance and efficiency, outperforming state-of-the-art methods while reducing optimization costs by three orders of magnitude. Additionally, SWIFT shows strong generalization capabilities across various benchmarks and successfully transfers knowledge across different foundation models.
Methodology
SWIFT operates in two phases: an offline phase where it performs contrastive trajectory distillation to extract heuristics and output contracts from prior workflows, and an online phase where it synthesizes workflows for new tasks based on these priors and examples from other tasks, all in a single generation pass without iterative search.
Results
SWIFT outperformed the state-of-the-art search-based method across five benchmarks and demonstrated generalization to four additional unseen benchmarks. It also successfully transferred knowledge from one foundation model (GPT-4o-mini) to three others (Grok, Qwen, Gemma), while maintaining over 93% performance even when operator names were randomized.
Implications
The findings suggest that automated workflow design can be made significantly more efficient by leveraging structural knowledge across tasks, which could lead to broader applications in areas requiring complex reasoning and generation tasks, such as automated programming, data analysis, and AI-driven decision-making.
Uncertainty-Aware Reward Discounting for Mitigating Reward Hacking
Reinforcement Learning
- Introduces a dual-source uncertainty-aware reward framework to mitigate reward hacking.
- Employs a confidence-adjusted Reliability Filter to balance exploitation and caution in action selection.
- Achieves a 93.7% reduction in reward-hacking behavior across various environments.
- Demonstrates robustness to supervisory noise up to 30%, while maintaining statistical significance.
Read more
Uncertainty-Aware Reward Discounting for Mitigating Reward Hacking
Summary
This paper addresses the challenges of reward hacking and misalignment in reinforcement learning (RL) systems, which often rely on scalar reward functions that assume precise evaluations of outcomes. The author introduces the Uncertainty-Aware Reward Discounting (UARD) framework, which explicitly incorporates both epistemic uncertainty in value estimation and uncertainty in human preferences. By modeling these uncertainties, the framework employs a confidence-adjusted Reliability Filter that modulates action selection, promoting a balance between exploitation and caution. Empirical evaluations across various discrete grid configurations and continuous control environments demonstrate that UARD significantly reduces exploitative behaviors associated with reward ambiguity, achieving a 93.7% reduction in reward-hacking incidents. The results indicate that UARD not only enhances training stability but also maintains robustness against supervisory noise, albeit with a trade-off in peak observed rewards compared to unconstrained baselines. This work positions uncertainty as a critical component of the reward signal, offering a principled approach to developing more reliable and aligned RL systems.
Methodology
The UARD framework integrates model epistemic uncertainty and human preference uncertainty into the decision-making process. It uses ensemble disagreement to capture model uncertainty and variability in reward annotations to assess preference uncertainty. The Reliability Filter adjusts the reward signal based on these uncertainties, encouraging cautious behavior in the presence of ambiguous rewards.
Results
The empirical results show that UARD significantly reduces exploitative behaviors, with trap visitation frequencies dropping by 93.7% across multiple configurations. The approach also demonstrates statistical significance (p < 0.05) and resilience against supervisory noise, outperforming standard RL baselines and other uncertainty-aware methods.
Implications
This research has significant implications for the deployment of RL systems in safety-critical applications, as it provides a framework to ensure that agents behave reliably and align with human intentions, even in the presence of uncertain and noisy reward signals.
CoQuant: Joint Weight-Activation Subspace Projection for Mixed-Precision LLMs
Large Language Models
Efficient ML
- CoQuant proposes a joint weight-activation subspace projection method for mixed-precision quantization.
- The method addresses the limitations of existing quantization techniques that rely solely on activation statistics.
- CoQuant demonstrates superior performance in perplexity and zero-shot reasoning tasks compared to strong PTQ baselines.
- The approach provides a principled framework for low-bit quantization in Large Language Models.
Read more
CoQuant: Joint Weight-Activation Subspace Projection for Mixed-Precision LLMs
Summary
The paper introduces CoQuant, a novel method for post-training quantization (PTQ) aimed at improving the efficiency of Large Language Models (LLMs) through mixed-precision quantization. Traditional methods often rely solely on activation statistics to construct high-precision subspaces, neglecting the joint influence of weight and activation quantization noise on output perturbation. CoQuant addresses this limitation by jointly modeling the quantization effects of both weights and activations, leading to a closed-form weighted PCA solution that optimally selects the high-precision subspace. Extensive experiments conducted on Llama-3.2 and Qwen2.5 demonstrate that CoQuant consistently outperforms existing PTQ baselines, achieving better perplexity and zero-shot reasoning accuracy. The findings suggest that a joint approach to weight and activation subspace modeling is a promising direction for low-bit quantization in LLMs, providing a more effective strategy for maintaining model performance while reducing inference costs.
Methodology
CoQuant formulates a closed-form weighted PCA solution that balances activation and weight covariances to select the optimal high-precision subspace. The method derives a unified objective for joint weight-activation quantization based on the expected output error of linear layers, allowing for a comprehensive modeling of quantization effects.
Results
Experimental results indicate that CoQuant achieves the best accuracy-efficiency trade-off across various model scales, yielding lower perplexity and improved zero-shot performance in comparison to existing PTQ methods. This demonstrates the effectiveness of joint modeling in reducing quantization error.
Implications
The findings suggest that joint weight-activation modeling can significantly enhance the performance of quantized LLMs, making it a valuable approach for deploying large models in resource-constrained environments. This could lead to broader applications of LLMs in real-time systems where efficiency is critical.
A Multi-Dataset Benchmark of Multiple Instance Learning for 3D Neuroimage Classification
Computer Vision
Efficient ML
- Mean pooling MIL outperforms or matches advanced MIL and 3D CNN methods on multiple datasets.
- Attention-based MIL methods do not provide significant gains in performance compared to simple mean pooling.
- The study highlights the efficiency of mean pooling MIL, being 25 times faster to train than complex alternatives.
- A semi-synthetic dataset analysis reveals limitations in current MIL approaches, indicating potential for future improvements.
Read more
A Multi-Dataset Benchmark of Multiple Instance Learning for 3D Neuroimage Classification
Summary
This paper presents a comprehensive evaluation of multiple instance learning (MIL) techniques for classifying 3D neuroimages, specifically focusing on CT and MRI scans. The authors compare various architectures, including simple MIL, attention-based MIL, 3D convolutional neural networks (CNNs), and 3D vision transformers (ViTs) across three CT and four MRI datasets, including two large datasets with over 10,000 scans. The study aims to provide insights for practitioners with limited resources on effective neural network choices for 3D neuroimaging tasks. The findings reveal that a straightforward mean pooling MIL approach can match or outperform more complex methods, including attention-based models, on several tasks while being significantly faster to train. The authors also analyze the performance of learned attention mechanisms and propose that existing MIL methods may not leverage the data effectively, suggesting areas for future research and innovation in MIL design.
Methodology
The authors conducted a systematic comparison of different MIL architectures, including simple mean pooling, attention-based methods, and traditional 3D CNNs, across multiple datasets. They evaluated the performance based on AUROC metrics and analyzed the quality of learned attention using instance-level labels from the RSNA CT dataset. Additionally, a semi-synthetic dataset was created to better understand the limitations of existing MIL methods.
Results
The results indicated that mean pooling MIL consistently matched or outperformed attention-based methods and 3D CNNs on four out of six moderate-sized tasks and remained competitive on larger datasets. The analysis of learned attention showed that no attention-based method outperformed a simple Gaussian baseline, and the semi-synthetic dataset analysis revealed that current MIL methods scored significantly lower than the best possible classifier.
Implications
The findings suggest that practitioners should consider using mean pooling MIL as a strong baseline for 3D neuroimage classification tasks. The study also highlights the need for further research into improving MIL methodologies, particularly in leveraging instance interactions and enhancing attention mechanisms.
Optimization-Free Topological Sort for Causal Discovery via the Schur Complement of Score Jacobians
Graph Learning
Theory
Efficient ML
- Introduction of Score-Schur Topological Sort (SSTS) for causal discovery.
- Decoupling of representation learning from structural optimization to improve scalability.
- Exact algebraic mapping of acyclicity in linear Gaussian models using Schur complements.
- Development of Block-SSTS to address non-linear systems and reduce structural error.
Read more
Optimization-Free Topological Sort for Causal Discovery via the Schur Complement of Score Jacobians
Summary
This paper addresses the challenges of continuous causal discovery, which often relies on non-convex optimization methods that can lead to local optima and scalability issues in high-dimensional settings. The authors propose a novel approach called Score-Schur Topological Sort (SSTS), which decouples the process of representation learning from structural optimization. By utilizing the Schur complement of the Score-Jacobian Information Matrix (SJIM), SSTS allows for the extraction of topological order directly from generative models without the need for constrained optimization. The authors demonstrate that in linear Gaussian Additive Noise Models (ANMs), the causal hierarchy can be accurately represented through algebraic operations, achieving a computational complexity of O(d^3). For non-linear systems, they introduce Block-SSTS to manage structural errors and enhance scalability, allowing for causal analysis on graphs with up to 1000 variables. The findings suggest that by circumventing the non-convex optimization bottleneck, the fidelity of causal discovery is primarily limited by the finite-sample estimation variance of the score geometry.
Methodology
The methodology involves two main stages: first, a generative model is trained to estimate the score function of the data, and second, the topological order is extracted using algebraic operations based on the Schur complement of the Score-Jacobian Information Matrix. This approach eliminates the need for iterative model retraining or constrained optimization, thus streamlining the causal discovery process.
Results
The SSTS algorithm was empirically tested on non-linear graphs with up to 1000 variables, demonstrating that it effectively bypasses the non-convex optimization challenges typically faced in causal discovery. The results indicate that the structural fidelity of the discovered causal relationships is primarily influenced by the estimation variance of the score geometry rather than optimization constraints.
Implications
The proposed framework has significant implications for scalable causal discovery in high-dimensional data settings, potentially benefiting fields such as epidemiology, economics, and any domain where understanding causal relationships from observational data is crucial. By reframing the problem as a statistical estimation challenge, it opens new avenues for research and application in causal inference.
PPG-Based Affect Recognition with Long-Range Deep Models: A Measurement-Driven Comparison of CNN, Transformer, and Mamba Architectures
Time Series
- Comparison of CNN, CNNโLSTM, Transformer, and Mamba architectures for PPG-based affect recognition.
- Transformers and Mamba models show comparable performance to CNNs but do not consistently outperform them.
- CNNs provide the highest accuracy with the smallest model size, making them the most effective overall.
- Transformers achieve better F1 scores for specific emotional states like arousal and relaxation.
Read more
PPG-Based Affect Recognition with Long-Range Deep Models: A Measurement-Driven Comparison of CNN, Transformer, and Mamba Architectures
Summary
This paper investigates the effectiveness of various deep learning architectures for affect recognition using wrist-based photoplethysmography (PPG) signals. The authors compare Convolutional Neural Networks (CNNs), CNNโLSTM hybrids, Transformers, and Mamba models in classifying emotional states such as arousal, valence, and relaxation. The study addresses the challenges posed by small and noisy datasets in wearable affective computing. Using a subject-independent 5-fold cross-validation protocol, all models underwent identical preprocessing and training procedures. The results indicate that while Transformers and Mamba models perform comparably to CNNs, they do not consistently outperform them across all tasks. CNNs demonstrated the highest accuracy and smallest model size, making them the most effective overall. However, Transformers showed a better balance of F1 scores for specific emotional states like arousal and relaxation. This research provides the first evaluation of Transformer and Mamba architectures for PPG-based affect recognition and offers practical guidance for model selection in wearable affective monitoring systems.
Methodology
The authors conducted a measurement-driven comparison of four deep learning architectures (CNN, CNNโLSTM hybrid, Transformers, and Mamba) using the WARM-VR dataset. They employed a subject-independent 5-fold cross-validation protocol with identical preprocessing, segmentation, and training pipelines to evaluate model performance in classifying emotional states from wrist-based PPG signals.
Results
The study found that CNNs outperformed other architectures in terms of accuracy and model size. Transformers and Mamba models achieved performance levels comparable to CNNs but did not consistently exceed CNN performance across all tasks. Transformers provided a better balance of F1 scores for arousal and relaxation states.
Implications
The findings suggest that while advanced architectures like Transformers and Mamba can be beneficial, traditional CNNs remain a strong choice for PPG-based affect recognition, particularly in real-world applications where dataset limitations exist. This research can guide the development of more effective wearable affective monitoring systems.
VAE-Inf: A statistically interpretable generative paradigm for imbalanced classification
Generative Models
Theory
Interpretability
- Introduces VAE-Inf, a two-stage framework for imbalanced classification.
- First stage involves training a VAE on majority-class data to establish a reference distribution.
- Second stage fine-tunes the model with minority samples using a distribution-aware loss.
- Provides a statistically interpretable inference strategy with controlled error rates.
Read more
VAE-Inf: A statistically interpretable generative paradigm for imbalanced classification
Summary
The paper addresses the challenge of imbalanced classification, where the minority class is underrepresented, leading to unstable decision boundaries in conventional models. The authors propose a two-stage framework called VAE-Inf, which combines deep representation learning with statistically interpretable hypothesis testing. In the first stage, a Variational Autoencoder (VAE) is trained solely on majority-class data to learn a reference distribution, resulting in a global Gaussian model. In the second stage, the encoder is fine-tuned using limited minority samples with a novel distribution-aware loss that enhances class separability. This approach allows for a hypothesis testing interpretation of the inference process, ensuring finite-sample control of Type-I error without restrictive assumptions. The framework demonstrates competitive performance on various real-world benchmarks, highlighting its effectiveness in improving minority class detection even with minimal labeled data.
Methodology
The methodology consists of two main stages: (1) Training a Variational Autoencoder (VAE) on majority-class data to capture the underlying distribution and create a global Gaussian reference model. (2) Fine-tuning the encoder with limited minority samples using a distribution-aware loss function that promotes class separability while maintaining the majority class structure.
Results
The VAE-Inf framework shows significant improvements in detecting minority samples compared to traditional methods, achieving competitive performance across diverse real-world benchmarks. The results indicate that even a small amount of labeled minority data can enhance detection capabilities, demonstrating the framework's robustness in extreme class imbalance scenarios.
Implications
The proposed framework has potential applications in various fields where imbalanced classification is critical, such as medical diagnosis, fraud detection, and anomaly detection. Its ability to effectively utilize limited minority data while maintaining statistical rigor could lead to better decision-making in high-stakes environments.
Unifying Runtime Monitoring Approaches for Safety-Critical Machine Learning: Application to Vision-Based Landing
Computer Vision
Robotics
- Introduction of a unified framework for runtime monitoring in safety-critical ML applications.
- Categorization of monitoring approaches into ODD, OOD, and OMS types.
- Demonstration of the framework's application through runway detection in aviation.
- Establishment of common safety-oriented metrics for evaluating monitoring methods.
Read more
Unifying Runtime Monitoring Approaches for Safety-Critical Machine Learning: Application to Vision-Based Landing
Summary
This paper addresses the critical need for runtime monitoring in safety-critical machine learning (ML) applications, particularly in aviation. The authors identify the fragmentation in current monitoring approaches across different research communities and propose a unified framework called SwMF. This framework categorizes runtime monitoring into three types: Operational Design Domain (ODD) monitoring, which ensures compliance with expected operating conditions; Out-of-Distribution (OOD) monitoring, which rejects inputs that deviate from training data; and Out-of-Model-Scope (OMS) monitoring, which detects anomalous model behavior based on internal states or outputs. The authors validate their framework through an experimental study focused on runway detection during landing, demonstrating how the unified approach facilitates the design and evaluation of monitoring activities. The paper emphasizes the importance of a common understanding and terminology in the field to enhance safety in ML applications.
Methodology
The authors developed the SwMF framework to categorize and unify various runtime monitoring approaches. They conducted an experimental study on a vision-based landing task, specifically runway detection, to illustrate the practical application of their framework. The study involved designing monitoring activities and evaluating them using safety-oriented metrics.
Results
The experimental results demonstrated the effectiveness of the SwMF framework in organizing and implementing monitoring strategies for the runway detection task. The categorization allowed for a clearer evaluation of different monitoring methods, highlighting their complementary roles in ensuring safety in ML applications.
Implications
The proposed framework has significant implications for the integration of ML technologies in safety-critical domains, such as aviation and autonomous driving. By providing a structured approach to runtime monitoring, it enhances the reliability and safety of ML systems, potentially influencing regulatory standards and practices in these industries.
Cheeger--Hodge Contrastive Learning for Structurally Robust Graph Representation Learning
Graph Learning
- Introduction of CheegerโHodge joint signature for robust graph representation learning.
- CHCL framework aligns graph embeddings with a perturbation-stable structural consistency target.
- Demonstrated effectiveness through extensive experiments on standard benchmarks.
- Improves upon traditional GCL methods by reducing reliance on augmentation heuristics.
Read more
Cheeger--Hodge Contrastive Learning for Structurally Robust Graph Representation Learning
Summary
This paper introduces CheegerโHodge Contrastive Learning (CHCL), a novel framework for unsupervised graph representation learning that enhances robustness against structural perturbations. Traditional Graph Contrastive Learning (GCL) methods often rely on data augmentation to define invariances, which can be fragile under local structural changes. CHCL addresses this by proposing a perturbation-stable CheegerโHodge joint signature that combines a Cheeger-inspired connectivity signature derived from algebraic connectivity (ฮป2) with the low-frequency spectrum of the 1-Hodge Laplacian. This joint signature captures both global connectivity and higher-order structural information, providing a more stable target for contrastive learning. The framework aligns encoder representations with this signature across augmented views, leading to graph embeddings that are resilient to local structural changes. Extensive experiments demonstrate that CHCL consistently outperforms existing methods across various benchmarks, showcasing improved performance, robustness, and generalization capabilities.
Methodology
The CHCL framework integrates a Cheeger-inspired connectivity signature with the low-frequency spectrum of the 1-Hodge Laplacian to form a unified structural signature. This signature serves as a consistency target in the contrastive learning objective, allowing the model to learn robust representations that are less sensitive to local perturbations. The methodology involves aligning encoder representations across augmented views based on this joint signature.
Results
The experiments conducted across multiple graph domains and learning settings indicate that CHCL consistently outperforms strong baseline methods in terms of performance, robustness, and generalization. The results highlight the effectiveness of the proposed CheegerโHodge joint signature in enhancing the stability and quality of learned graph representations.
Implications
The findings suggest that incorporating graph-intrinsic structural targets can significantly improve the robustness of graph representation learning methods. This has potential applications in various domains, including social network analysis, molecular property prediction, and other areas where graph structures are prevalent and may be subject to perturbations.
Multiple Additive Neural Networks for Structured and Unstructured Data
Multimodal
Theory
Efficient ML
- MANN replaces decision trees with shallow neural networks in the Gradient Boosting framework.
- The approach integrates Capsule Neural Networks for feature extraction in structured data and CNNs for unstructured data.
- MANN incorporates continuous learning mechanisms to adapt to new data and combat overfitting.
- Empirical results show MANN's superior accuracy compared to traditional boosting methods like XGB.
Read more
Multiple Additive Neural Networks for Structured and Unstructured Data
Summary
This paper introduces Multiple Additive Neural Networks (MANN), an innovative enhancement to the traditional Gradient Boosting framework that employs shallow neural networks as base learners instead of decision trees. MANN utilizes Convolutional Neural Networks (CNNs) and Capsule Neural Networks to effectively handle both structured and unstructured data, such as images and audio. The methodology emphasizes continuous learning, allowing the model to adapt to new data dynamically while employing heuristics to mitigate overfitting. Empirical studies demonstrate that MANN outperforms traditional methods like Extreme Gradient Boosting (XGB) in terms of accuracy across various benchmark datasets. The research highlights MANN's superior precision and generalizability, making it a versatile tool for complex learning environments and diverse data types. The paper also discusses the architecture's ability to simplify hyperparameter tuning and enhance model robustness, ultimately contributing to the ongoing discourse on overfitting and adaptivity in machine learning.
Methodology
The MANN algorithm builds upon the Gradient Boosting framework by using shallow neural networks as base learners. It employs Capsule Neural Networks for structured data and CNNs for unstructured data, integrating heuristics to prevent overfitting. The architecture supports continuous learning through two modalities: modifying existing neural networks and utilizing residuals to fit new models.
Results
MANN demonstrated superior accuracy over traditional methods such as XGBoost across multiple benchmark datasets, showcasing its effectiveness in both structured and unstructured data contexts. The algorithm's design also facilitated easier management and reduced sensitivity to hyperparameter settings.
Implications
MANN's ability to handle diverse data types and its continuous learning capabilities make it suitable for real-world applications where data is constantly evolving. Its robustness against overfitting and ease of use could lead to broader adoption in various machine learning tasks.
KAYRA: A Microservice Architecture for AI-Assisted Karyotyping with Cloud and On-Premise Deployment
Computer Vision
- KAYRA utilizes a microservice architecture for karyotyping, enhancing deployment flexibility.
- The system integrates multiple machine learning models for improved segmentation and classification accuracy.
- Clinical evaluation shows KAYRA outperforms existing commercial karyotyping systems in key metrics.
- The architecture supports both cloud and on-premise deployments, addressing patient-data residency concerns.
Read more
KAYRA: A Microservice Architecture for AI-Assisted Karyotyping with Cloud and On-Premise Deployment
Summary
The paper presents KAYRA, an innovative end-to-end karyotyping system designed to function within the constraints of clinical cytogenetic laboratories. KAYRA employs a containerized microservice architecture that integrates a multi-model machine learning stack, including EfficientNet-B5 + U-Net for semantic segmentation, Mask R-CNN for instance detection, and ResNet-18 for classification. This architecture allows for flexible deployment options, supporting both cloud and on-premise installations, which is crucial for compliance with patient-data residency requirements. A pilot clinical evaluation was conducted using 459 chromosomes from 10 metaphase spreads, comparing KAYRA's performance against two commercial karyotyping systems. The results demonstrated superior segmentation accuracy (98.91% vs. 78.21% and 40.52%), classification accuracy (89.1% vs. 86.9% and 54.5%), and rotation accuracy (89.76% vs. 94.55% and 78.43%). KAYRA's architecture not only enhances performance metrics but also integrates a human-in-the-loop expert-review workflow essential for diagnostic practices. The findings suggest that a multi-model cytogenetic AI service can be effectively packaged as a microservice, achieving strong empirical performance while adhering to clinical deployment constraints.
Methodology
KAYRA's methodology involves a microservice architecture that decomposes karyotyping tasks into specialized containerized services. It employs a cascaded ROI-narrowing strategy, starting with Otsu thresholding for initial cropping, followed by U-Net for semantic refinement, and finally applying Mask R-CNN for instance segmentation. The system is validated through a pilot study comparing its performance against commercial systems.
Results
In a clinical evaluation involving 459 chromosomes, KAYRA achieved a segmentation accuracy of 98.91%, classification accuracy of 89.1%, and rotation accuracy of 89.76%. These results significantly surpassed those of the reference systems, with statistical significance for segmentation and classification metrics.
Implications
KAYRA's architecture and performance suggest significant potential for improving karyotyping processes in clinical settings, reducing the time and expertise required for chromosome analysis. Its dual deployment capability makes it suitable for various clinical environments, potentially enhancing diagnostic workflows and patient outcomes.
Sustained Gradient Alignment Mediates Subliminal Learning in a Multi-Step Setting: Evidence from MNIST Auxiliary Logit Distillation Experiment
Theory
- Positive gradient alignment between trait and distillation gradients persists throughout multi-step training.
- Removing the trait-aligned component of the distillation gradient effectively stops trait acquisition.
- Liminal training reduces alignment but does not prevent trait acquisition, indicating limitations in current mitigation methods.
- The study provides empirical evidence supporting the causal relationship between gradient alignment and subliminal learning.
Read more
Sustained Gradient Alignment Mediates Subliminal Learning in a Multi-Step Setting: Evidence from MNIST Auxiliary Logit Distillation Experiment
Summary
This paper investigates the phenomenon of subliminal learning in the context of knowledge distillation (KD) using the MNIST dataset. The authors focus on how a student model can unintentionally acquire traits from a teacher model, even when distilling only on no-class logits. They challenge the existing single-step gradient descent assumption of subliminal learning theory by empirically demonstrating that gradient alignment between the trait and distillation gradients persists throughout multi-step training. The study employs the MNIST MLP auxiliary logit distillation experiment to show that this sustained alignment causally contributes to the acquisition of teacher traits. Additionally, the authors evaluate a mitigation method called liminal training, which aims to reduce alignment but fails to suppress trait acquisition, suggesting that current mitigation strategies may not be effective when first-order effects dominate. Overall, the findings highlight the complexities of subliminal learning and the limitations of existing mitigation techniques in multi-step training scenarios.
Methodology
The authors conducted experiments using the MNIST MLP classifier in an auxiliary logit distillation setting, where both the teacher and student models shared identical initializations. They monitored gradient alignment during training and applied an ablation technique to remove the trait-aligned component of the distillation gradient to assess its impact on trait acquisition.
Results
The experiments revealed that the average final test accuracy of the student model was 55.28% with a consistent positive gradient alignment throughout training. When the trait-aligned component was removed, the accuracy dropped to 10.14%, indicating that alignment plays a crucial role in trait acquisition. Liminal training was shown to reduce alignment but did not effectively suppress trait acquisition.
Implications
The findings suggest that subliminal learning poses a risk in knowledge distillation, particularly when models inherit misaligned traits from teachers. This has implications for the deployment of distilled models in real-world applications, highlighting the need for improved mitigation strategies to prevent unintended trait acquisition.
Negative Ontology of True Target for Machine Learning: Towards Evaluation and Learning under Democratic Supervision
Theory
- Challenges the traditional assumption of the objective existence of the true target in ML.
- Introduces the concept of Democratic Supervision, promoting a participatory approach to supervision.
- Defines Multiple Inaccurate True Targets (MIATTs) as a practical application of Democratic Supervision.
- Develops the EL-MIATTs framework for evaluation and learning in ML.
Read more
Negative Ontology of True Target for Machine Learning: Towards Evaluation and Learning under Democratic Supervision
Summary
This paper explores the philosophical implications of the true target (TT) concept in machine learning (ML) by adopting a negative ontology perspective, which posits that the TT does not objectively exist in the real world. The author critiques mainstream ML paradigms that assume the objective existence of TT and introduces the concept of Democratic Supervision, which emphasizes collaborative and participatory approaches to supervision in ML. The paper presents Multiple Inaccurate True Targets (MIATTs) as a practical realization of Democratic Supervision, leading to the development of the evaluation and learning with MIATTs (EL-MIATTs) framework. This framework is designed to address the challenges posed by ambiguous and subjective TTs in real-world applications. A real-world case study demonstrates the framework's effectiveness in supporting education and professional development, highlighting its potential to reshape evaluation and learning paradigms in ML.
Methodology
The paper employs a philosophical analysis of the true target assumptions in existing ML paradigms, leading to the formulation of Democratic Supervision and MIATTs. It further develops the EL-MIATTs framework, which includes principles for generating and assessing MIATTs, as well as a logical assessment formulation for evaluation.
Results
The EL-MIATTs framework was successfully applied in a real-world scenario, demonstrating its potential to enhance evaluation and learning processes in ML by accommodating the complexities of ambiguous true targets.
Implications
The findings suggest a shift in how ML paradigms can be designed and evaluated, promoting inclusivity and flexibility in data construction and supervision. This approach could lead to more robust ML models that better reflect the realities of data acquisition and annotation.
Hankel and Toeplitz Rank-1 Decomposition of Arbitrary Matrices with Applications to Signal Direction-of-Arrival Estimation
Time Series
Optimization
Robotics
- Development of efficient algorithms for Hankel and Toeplitz rank-1 matrix approximations.
- Estimation methods are shown to be maximum-likelihood optimal under specific noise conditions.
- Robustness against outliers is achieved through the use of L1-norm formulations.
- Extensive validation through simulations and real-world data demonstrates practical applicability.
Read more
Hankel and Toeplitz Rank-1 Decomposition of Arbitrary Matrices with Applications to Signal Direction-of-Arrival Estimation
Summary
This paper addresses the computation of optimal rank-1 Hankel and Toeplitz approximations of arbitrary matrices under L2 and L1-norm errors, which are critical in various engineering applications, particularly in signal Direction-of-Arrival (DoA) estimation for autonomous systems. The authors propose efficient algorithms for structured matrix decomposition, ensuring that the resulting estimators are maximum-likelihood optimal under white Gaussian and Laplace noise conditions, respectively. The study emphasizes the importance of structured low-rank approximation (SLRA) in modeling dynamic systems and introduces robust estimators that are validated through extensive simulations and real-world data experiments. The findings indicate that the proposed methods significantly enhance the accuracy and reliability of DoA estimation in practical scenarios, especially when dealing with noisy measurements and outliers.
Methodology
The authors utilize structured low-rank approximation techniques to derive rank-1 Hankel and Toeplitz matrix decompositions. They analyze the optimization problem under both L2 and L1 norms, employing algorithms such as Cadzow's method for alternating projections. The estimators are derived analytically and validated through simulations and real-world experiments.
Results
The proposed estimators under L2 and L1 norms were validated to be maximum-likelihood optimal for their respective noise models. The simulation studies and real-world experiments confirmed the effectiveness of the methods in accurately estimating signal DoA, even in the presence of noise and outliers.
Implications
The findings have significant implications for the design of robust signal processing systems in autonomous applications, enhancing the reliability of DoA estimation in noisy environments. The methodologies can be extended to other areas requiring structured low-rank approximations.
Mini-Batch Class Composition Bias in Link Prediction
Graph Learning
- GNNs trained for link prediction may learn trivial heuristics based on mini-batch composition rather than meaningful graph features.
- Randomizing the class distribution in mini-batches improves alignment with node classification features, albeit at the cost of link prediction performance.
- The study challenges the assumption that link prediction models can generalize representations learned from node classification tasks.
Read more
Mini-Batch Class Composition Bias in Link Prediction
Summary
This paper investigates the biases introduced by mini-batch class composition in link prediction tasks using Graph Neural Networks (GNNs). The authors argue that while GNNs can learn transferable representations across graphs for node classification, this intuition does not hold for link prediction. They demonstrate that popular link prediction models can exploit a mini-batch dependent heuristic, facilitated by batch normalization layers, to predict edges without learning complex node class features. This leads to an overestimation of the models' ability to generalize across tasks. To address this issue, the authors propose randomizing the fraction of positive and negative edges in mini-batches, which, although resulting in decreased link prediction performance, enhances the alignment of learned representations with features relevant to node classification. Their findings suggest that current training practices may hinder the transferability of learned representations in link prediction tasks and highlight the need for improved training regimes to better capture the underlying properties of graphs.
Methodology
The authors analyze common link prediction models and their training procedures, focusing on the impact of mini-batch class composition. They implement a randomized mini-batch training regime to mitigate the identified bias and evaluate the effects on model performance and representation alignment.
Results
The results indicate that standard training procedures for link prediction can lead to biased learning, where models fail to capture relevant node class features. By randomizing mini-batch compositions, the authors observe improved alignment with node classification features, suggesting a more accurate representation of the graph's properties, despite a decline in link prediction accuracy.
Implications
The findings imply that current methodologies in link prediction may need reevaluation to ensure that models learn meaningful representations that can be generalized across tasks. This could influence future research directions in GNN training practices and the development of more robust link prediction models.
Momentum-Conserving Graph Neural Networks for Deformable Objects
Graph Learning
Robotics
- Introduction of MomentumGNN, a GNN architecture that conserves momentum.
- Utilizes per-edge impulses for predicting bending and stretching forces.
- Employs a layer-by-layer update mechanism for vertex positions.
- Trained using a physics-based loss function in an unsupervised manner.
Read more
Momentum-Conserving Graph Neural Networks for Deformable Objects
Summary
This paper introduces MomentumGNN, a novel graph neural network (GNN) architecture specifically designed to accurately model the dynamics of deformable materials while conserving momentum. Traditional GNNs struggle with predicting the temporal evolution of physical quantities like linear and angular momentum, often leading to non-physical behaviors. MomentumGNN addresses these issues by predicting per-edge stretching and bending impulses, ensuring the conservation of momentum by design. The architecture modifies existing GNN frameworks, particularly MeshGraphNets, by implementing a layer-by-layer approach that sequentially updates vertex positions using momentum-conserving impulses. This method is trained in an unsupervised manner using a physics-based loss function, demonstrating superior performance over existing baselines in scenarios where momentum conservation is critical. The proposed architecture not only enhances the physical fidelity of simulations but also broadens the applicability of GNNs in various fields such as computer graphics, soft robotics, and engineering.
Methodology
The authors propose a GNN architecture that predicts momentum-conserving impulses rather than unconstrained nodal accelerations. The model is built upon the MeshGraphNets framework, replacing per-vertex decoders with per-edge decoders to ensure physical accuracy. A layer-by-layer architecture is employed, allowing for sequential updates of vertex positions based on predicted impulses. The training is conducted in an unsupervised manner using a physics-based loss function to enforce momentum conservation.
Results
MomentumGNN demonstrates improved accuracy in simulating the dynamics of deformable objects, particularly in scenarios involving free motion and collisions. The architecture effectively preserves linear and angular momentum, leading to more realistic simulations compared to traditional GNN approaches. The results indicate that MomentumGNN outperforms existing baselines in various common scenarios where momentum conservation is essential.
Implications
The development of MomentumGNN has significant implications for fields requiring accurate simulations of deformable materials, such as computer graphics, soft robotics, and engineering applications. By ensuring momentum conservation, the model enhances the realism and reliability of simulations, making it a valuable tool for both academic research and practical applications in industries like gaming, animation, and robotics.
Knowledge Distillation Must Account for What It Loses
Theory
- Knowledge distillation should account for lost capabilities, not just retained scores.
- Current evaluation methods often overlook critical off-metric capabilities.
- A taxonomy of off-metric losses is proposed to better understand distillation impacts.
- Scenario-specific preservation targets and a Distillation Loss Statement are introduced.
Read more
Knowledge Distillation Must Account for What It Loses
Summary
This position paper emphasizes the necessity for knowledge distillation to consider the capabilities that may be lost during the process of transferring knowledge from a teacher model to a student model. The author argues that current evaluations of distilled models often focus solely on retained task scores, neglecting other critical capabilities such as uncertainty, boundary behavior, and process reliability that are essential for reliable deployment. The paper identifies a retention assumption in existing evaluations, which falsely equates matching a primary metric with preserving all relevant teacher capabilities. To address this, the author proposes a taxonomy of measurable off-metric losses and introduces scenario-specific preservation targets alongside a 'Distillation Loss Statement' to transparently report what capabilities were preserved or lost. The goal is to shift the focus from merely retaining scores to ensuring accountable distillation that acknowledges and justifies any losses in capabilities.
Methodology
The author synthesizes existing literature to create a taxonomy of off-metric losses and proposes a structured reporting framework (Distillation Loss Statement) to evaluate distillation outcomes beyond primary metrics.
Results
The paper highlights that many distillation studies fail to measure or report on the loss of important capabilities, leading to a misunderstanding of the effectiveness of distilled models. The proposed framework aims to improve transparency and accountability in distillation practices.
Implications
This work has significant implications for the deployment of distilled models in real-world applications, emphasizing the need for comprehensive evaluations that consider reliability and safety, which are often overlooked in traditional metrics.
GCA-BULF: A Bottom-Up Framework for Short-Term Load Forecasting Using Grouped Critical Appliances
Time Series
- GCA-BULF is the first bottom-up STLF framework that selects critical appliances and incorporates appliance correlations.
- The framework improves forecasting accuracy by 20.85%-57.88% compared to top-down methods and by 33.03%-92.48% compared to existing bottom-up methods.
- The Critical Appliance Filtering module effectively identifies a minimal set of appliances that significantly influence total load trends.
- The Related Appliance Grouping module clusters appliances based on their usage correlations, enhancing group-level forecasting.
Read more
GCA-BULF: A Bottom-Up Framework for Short-Term Load Forecasting Using Grouped Critical Appliances
Summary
The paper presents GCA-BULF, a novel bottom-up framework for short-term load forecasting (STLF) that focuses on grouped critical appliances. Traditional forecasting methods, particularly top-down approaches, struggle to accurately predict electricity consumption due to the complex patterns of mixed appliance loads. GCA-BULF addresses this issue by integrating appliance-level data while minimizing the need for exhaustive monitoring of all appliances. The framework consists of three main components: (1) a Critical Appliance Filtering module that identifies and ranks appliances based on their impact on total load, (2) a Related Appliance Grouping module that clusters these critical appliances based on their spatial and temporal correlations, and (3) a Collaborative Load Forecasting module that refines total load predictions by combining forecasts from the grouped appliances. The authors evaluate GCA-BULF on residential and office building datasets, demonstrating significant improvements in forecasting accuracy over existing methods. This framework not only enhances prediction reliability but also supports energy management strategies in the context of time-of-use pricing.
Methodology
GCA-BULF employs a three-module approach: (1) Critical Appliance Filtering to select impactful appliances based on consumption metrics, (2) Related Appliance Grouping to cluster appliances using spatial and temporal correlation measures, and (3) Collaborative Load Forecasting to combine forecasts from grouped appliances for improved total load predictions.
Results
Experimental results indicate that GCA-BULF achieves a reduction in forecasting errors of 20.85%-57.88% compared to top-down methods and 33.03%-92.48% compared to existing bottom-up methods across residential and office building datasets.
Implications
The GCA-BULF framework has significant implications for energy management systems, particularly in optimizing load forecasting under time-of-use pricing schemes. It enables more efficient energy consumption strategies, contributing to grid stability and cost reduction for consumers.
NeuroPlastic: A Plasticity-Modulated Optimizer for Biologically Inspired Learning Dynamics
Optimization
- NeuroPlastic introduces a plasticity-modulated optimizer that combines gradient updates with additional signals inspired by biological learning mechanisms.
- The optimizer features a stabilization mechanism to regulate update magnitudes, ensuring stable optimization dynamics across different learning rates.
- Empirical evaluations show significant performance improvements over traditional gradient-only methods, particularly in data-scarce and challenging tasks.
- NeuroPlastic remains competitive without requiring retuning, making it a practical alternative for standard deep learning applications.
Read more
NeuroPlastic: A Plasticity-Modulated Optimizer for Biologically Inspired Learning Dynamics
Summary
The paper introduces NeuroPlastic, a novel optimizer that enhances traditional gradient-based optimization methods by incorporating principles of biological synaptic plasticity. Unlike conventional optimizers that rely solely on local gradient statistics, NeuroPlastic employs a multi-signal modulation mechanism that dynamically scales gradient updates based on three interacting components: gradient magnitude, an exponential moving average of gradient activity, and a memory-like term derived from Adam's moment estimates. This approach aims to improve learning efficiency, particularly in scenarios where learning signals are weak or noisy. The authors empirically validate NeuroPlastic across various image classification benchmarks, demonstrating consistent performance improvements over standard optimizers like SGD and Adam, especially in challenging tasks and limited data conditions. The findings suggest that integrating multi-signal plasticity principles into optimization can enhance the adaptability and effectiveness of deep learning models.
Methodology
NeuroPlastic employs a modulation coefficient derived from three normalized signals: gradient magnitude, an exponential moving average of gradient activity, and a memory term based on Adam's moment estimates. This coefficient scales the gradient updates, allowing for a more nuanced learning signal that adapts to the context of the training data.
Results
NeuroPlastic consistently outperformed gradient-only ablation methods across various benchmarks, with notable improvements on the Fashion-MNIST dataset and stable performance on CIFAR-10 with ResNet-18 without the need for retuning.
Implications
The findings suggest that incorporating multi-signal plasticity into optimization strategies can enhance the performance of deep learning models, particularly in scenarios with limited or noisy data. This approach could lead to more robust and adaptable learning algorithms in various applications.
Lyapunov-Guided Self-Alignment: Test-Time Adaptation for Offline Safe Reinforcement Learning
Reinforcement Learning
Robotics
Theory
- Introduction of Self-Alignment for Safety (SAS) for offline safe RL.
- Utilization of Lyapunov stability as an occupancy-measure criterion for safety.
- Transformer-based architecture allows for hierarchical RL interpretation.
- SAS enables safe test-time adaptation without retraining.
Read more
Lyapunov-Guided Self-Alignment: Test-Time Adaptation for Offline Safe Reinforcement Learning
Summary
The paper addresses the challenge of ensuring safety in offline reinforcement learning (RL) agents when deployed in real-world environments, where discrepancies between training datasets and actual conditions can lead to unsafe behaviors. The authors introduce a novel framework called Self-Alignment for Safety (SAS), which allows for test-time adaptation without the need for retraining. SAS employs a transformer-based architecture that utilizes self-alignment, where the pretrained agent generates multiple imagined trajectories and selects those that meet the Lyapunov stability condition. These selected trajectories are then used as in-context prompts to realign the agent's behavior towards safety. The framework effectively transforms Lyapunov-guided imagination into control-invariant prompts, facilitating a hierarchical interpretation of RL that resembles Bayesian inference over latent skills. The authors demonstrate the effectiveness of SAS through experiments on Safety Gymnasium and MuJoCo benchmarks, showing that it significantly reduces costs and failures while maintaining or improving overall performance.
Methodology
The SAS framework leverages a transformer architecture to generate imagined trajectories during test time. It identifies safe segments based on the Lyapunov condition and uses these segments as prompts to guide the agent's behavior. The approach integrates concepts from hierarchical RL and Bayesian inference to facilitate in-context learning without parameter updates.
Results
SAS outperformed existing safe RL baselines in empirical tests, achieving reductions in cost and failure rates by up to two times while either maintaining or improving the return on performance metrics across various benchmarks.
Implications
The proposed SAS framework has significant implications for deploying RL agents in safety-critical applications, such as robotics and autonomous systems, where ensuring safe behavior in unpredictable environments is paramount. It offers a scalable solution for adapting pretrained models to real-world conditions without extensive retraining.
Laplace-Bridged Randomized Smoothing for Fast Certified Robustness
Computer Vision
Efficient ML
Theory
- LBS eliminates the need for noise-augmented training, preserving clean accuracy.
- The method significantly reduces the computational cost of certification, making it feasible for edge devices.
- LBS achieves up to 494ร speedup compared to traditional RS methods on devices like NVIDIA Jetson Orin Nano and Raspberry Pi 4.
- Theoretical foundations of LBS are established, ensuring the validity of the certification process.
Read more
Laplace-Bridged Randomized Smoothing for Fast Certified Robustness
Summary
This paper introduces Laplace-Bridged Smoothing (LBS), a novel approach to Randomized Smoothing (RS) that addresses two significant limitations of traditional RS: the reliance on noise-augmented training and the computational expense of certification. LBS reformulates RS by utilizing a low-dimensional probability space, allowing for efficient certification without the need for extensive noisy forward passes. The method analytically propagates input noise through a locally linearized feature extractor, approximating the noisy logit distribution and deriving a tractable Dirichlet surrogate for the smoothed predictive distribution. The authors demonstrate that LBS achieves stronger certified robustness on benchmark datasets CIFAR-10 and ImageNet while reducing the per-sample certification cost by nearly an order of magnitude. Furthermore, LBS shows remarkable speedups on resource-constrained devices, enabling practical certified deployment in real-world applications such as edge computing and robotics.
Methodology
The authors propose LBS as an analytic reformulation of RS, which replaces high-dimensional Monte Carlo sampling with low-dimensional Dirichlet sampling. This is achieved through a GaussianโDirichlet bridge that allows for efficient statistical certification within an โ2 radius. The method involves estimating the top-class probability using a one-sided Beta quantile applied to the Dirichlet parameters derived from linearized Gaussian logit statistics.
Results
LBS outperforms traditional RS in terms of certified robustness on CIFAR-10 and ImageNet, achieving a reduction in certification cost by nearly an order of magnitude. On resource-constrained devices, LBS demonstrates speedups of up to 494ร, facilitating practical deployment of certified defenses in real-time applications.
Implications
The advancements presented in this paper have significant implications for the deployment of certified defenses in safety-critical applications, such as autonomous vehicles and robotics, where real-time decision-making is essential. The ability to certify robustness efficiently on edge devices opens new avenues for secure AI implementations in various domains.
Compute Aligned Training: Optimizing for Test Time Inference
NLP
Large Language Models
Reinforcement Learning
- Introduction of Compute Aligned Training (CAT) to align training objectives with test-time inference strategies.
- Derivation of new loss functions that improve performance during test-time scaling for LLMs.
- Empirical validation of CAT across multiple test-time strategies, showing substantial performance improvements.
- Unified framework that generalizes existing methods and addresses misalignment issues in training and inference.
Read more
Compute Aligned Training: Optimizing for Test Time Inference
Summary
This paper introduces Compute Aligned Training (CAT), a novel framework designed to optimize training objectives in alignment with test-time inference strategies, particularly for Large Language Models (LLMs). The authors argue that traditional training methods, such as Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL), focus on maximizing the likelihood of individual outputs, which creates a misalignment with test-time procedures that often rely on aggregated outputs. CAT conceptualizes inference strategies as operators that transform the base policy of the model, allowing for the derivation of new loss functions that enhance performance during test-time scaling. The authors empirically validate CAT across various test-time strategies, demonstrating significant improvements in performance compared to standard training methods. This work addresses a critical gap in aligning training and inference processes, providing a unified theoretical framework that can be applied across different modalities and inference strategies, thus enhancing the effectiveness of models during deployment.
Methodology
The methodology involves conceptualizing inference strategies as operators on the base policy of the model, allowing for the derivation of new loss functions that maximize performance when these strategies are applied. The authors perform gradient descent to minimize loss with respect to the transformed distribution, effectively aligning training with deployment environments without increasing training-time compute costs.
Results
The empirical results demonstrate that CAT significantly enhances test-time scaling performance compared to standard training methods. The framework was validated across various test-time strategies, including Pass@N and Majority Vote, showing improved outcomes in both SFT and RL settings.
Implications
The implications of this work are substantial for the deployment of machine learning models, particularly in scenarios where test-time performance is critical. By aligning training with inference strategies, models can be more effective in real-world applications, leading to better utilization of computational resources and improved outcomes in tasks requiring high-quality outputs.
Heterogeneous Variational Inference for Markov Degradation Hazard Models: Discretized Mixture with Interpretable Clusters
Time Series
Interpretability
Efficient ML
- Introduces an 8-state discretization method that significantly improves the detection of degradation events.
- Develops a comprehensive feature engineering strategy that integrates various data types for better model performance.
- Establishes practical interpretability rules for model selection to prevent overfitting and ensure meaningful clusters.
- Demonstrates that ADVI outperforms traditional MCMC methods in terms of speed and stability for finite mixture models.
Read more
Heterogeneous Variational Inference for Markov Degradation Hazard Models: Discretized Mixture with Interpretable Clusters
Summary
This paper addresses the challenges in infrastructure asset management by proposing a novel framework for analyzing heterogeneous degradation patterns in industrial equipment, specifically focusing on pumps. Traditional survival analysis methods often assume homogeneous hazard rates, which fail to capture the significant variability in degradation speeds observed in real-world data. The proposed approach utilizes Bayesian finite mixture models to identify discrete risk clusters, overcoming critical challenges such as insufficient degradation signals, unstable cluster identification, and the computational inefficiency of Markov Chain Monte Carlo (MCMC) methods. The framework incorporates an 8-state global percentile discretization to enhance degradation event detection, a comprehensive feature engineering strategy that combines statistical trends, continuous health indicators, and text embeddings, and interpretable model selection rules to ensure meaningful clusters. Additionally, it employs Automatic Differentiation Variational Inference (ADVI) for efficient and stable estimation, significantly reducing computation time compared to traditional methods. The methodology is validated on a dataset of 280 industrial pumps with over 104,000 inspection records, demonstrating the effectiveness of the proposed framework in identifying risk clusters and improving computational efficiency.
Methodology
The proposed framework combines an 8-state global percentile discretization to enhance degradation signal detection, a 30-dimensional feature engineering approach integrating statistical, continuous, and text data, and interpretable model selection rules that enforce minimum cluster share and separation. It utilizes Automatic Differentiation Variational Inference (ADVI) with full-rank covariance for efficient and stable posterior estimation.
Results
The framework was validated on 280 industrial pumps, revealing that ADVI provides nearly identical estimates to the No-U-Turn Sampler (NUTS) with a 15-fold speedup. The finite mixture models identified two clusters, with 72.9% classified as low-risk and 27.1% as high-risk, while NUTS faced convergence issues and label switching.
Implications
The findings suggest that the proposed framework can significantly enhance maintenance decision-making in industrial asset management by providing interpretable risk classifications and improving computational efficiency, thus enabling more timely and effective maintenance strategies.
Categorical Optimization with Bayesian Anchored Latent Trust Regions for Structural Design under High-Dimensional Uncertainty
Optimization
- COBALT effectively tackles high-dimensional categorical optimization under uncertainty with costly evaluations.
- The framework locks latent catalog instances as discrete physical anchors to maintain design integrity.
- Additive SAAS-GP is used to model sparse effects in the presence of heteroscedastic noise.
- The trust-region graph acquisition method allows for valid design selection without rounding errors.
Read more
Categorical Optimization with Bayesian Anchored Latent Trust Regions for Structural Design under High-Dimensional Uncertainty
Summary
This paper presents COBALT, a novel framework for categorical optimization under uncertainty, specifically targeting high-dimensional categorical design problems in structural engineering. Traditional optimization methods often struggle with the discrete nature of categorical variables, particularly when these variables must be selected from a finite catalog of instances. COBALT addresses this challenge by embedding the physical catalog into a low-dimensional latent space while locking the mapped instances as discrete anchors. The framework employs a random tree decomposition to facilitate efficient modeling of high-dimensional categorical variables and utilizes an additive SAAS-GP (Sparse Additive Gaussian Process) to handle heteroscedastic noise in Monte Carlo Finite Element Analysis (MC-FEA) observations. A trust-region graph acquisition strategy is implemented to select the next admissible design configuration without the need for rounding off, thus preserving physical admissibility throughout the optimization process. The effectiveness of COBALT is demonstrated through robust design optimization of complex bar structures, focusing on key performance metrics such as structural weight, strain energy, and local buckling performance. The results indicate that COBALT significantly enhances the efficiency of robust categorical structural optimization by ensuring valid designs are evaluated during the active learning loop.
Methodology
COBALT employs a combination of latent space embedding, random tree decomposition, and additive Gaussian process modeling to optimize categorical design variables under uncertainty. It locks discrete instances as anchors in a low-dimensional space and utilizes a trust-region approach for acquisition of new designs, ensuring that only valid configurations are considered.
Results
The application of COBALT to structural design optimization yielded improved efficiency and accuracy in evaluating robust performance metrics. The framework successfully maintained physical admissibility of designs throughout the optimization process, demonstrating its effectiveness in handling high-dimensional categorical variables under uncertainty.
Implications
COBALT has significant implications for structural design optimization in engineering, particularly in scenarios where design variables are categorical and subject to uncertainty. Its ability to efficiently navigate high-dimensional spaces while preserving physical integrity can enhance the design process in various applications, including automotive and aerospace engineering.
Semi-supervised learning with max-margin graph cuts
Graph Learning
Theory
Optimization
- Introduction of a max-margin graph cuts algorithm for semi-supervised learning.
- Theoretical proof of a generalization error bound for the proposed method.
- Demonstrated superior performance compared to existing methods like manifold regularization of SVMs.
- Stability improvements for harmonic function solutions with soft labeling constraints.
Read more
Semi-supervised learning with max-margin graph cuts
Summary
This paper introduces a novel algorithm for semi-supervised learning that utilizes max-margin graph cuts to enhance the learning process from both labeled and unlabeled data. The proposed method computes graph cuts that maximize the margin concerning labels derived from a harmonic function solution. The authors provide a theoretical foundation by proving a bound on the generalization error of their approach. The algorithm is evaluated against existing methods, particularly manifold regularization of support vector machines (S3VMs), on synthetic datasets and three datasets from the UCI ML repository. The results demonstrate that the proposed method often outperforms S3VMs, especially for linear and cubic decision boundaries. Additionally, the paper highlights the limitations of manifold regularization in certain scenarios and offers a stable approach to harmonic function solutions with soft labeling constraints.
Methodology
The methodology involves two main steps: first, computing a regularized harmonic function solution on a data adjacency graph, and second, learning a max-margin discriminator based on the labels inferred from this solution. The optimization problem is framed to minimize hinge loss while ensuring the solution adheres to the regularized harmonic function constraints.
Results
The proposed algorithm was tested on synthetic datasets and three UCI ML datasets, showing improved performance over manifold regularization of SVMs in most cases. The results indicate that the method is particularly effective for linear and cubic decision boundaries, validating the theoretical claims regarding generalization error.
Implications
The findings suggest that the max-margin graph cuts approach can be a robust alternative for semi-supervised learning tasks, particularly in scenarios where labeled data is scarce. This method could be applied in various domains where semi-supervised learning is relevant, such as image classification, text categorization, and other applications involving large amounts of unlabeled data.
Nemotron 3 Nano Omni: Efficient and Open Multimodal Intelligence
Multimodal
Efficient ML
Audio & Speech
- Introduces native audio support alongside text, images, and video.
- Achieves significant accuracy improvements over previous models.
- Incorporates multimodal token-reduction techniques for efficiency.
- Demonstrates leading results in various multimodal tasks.
Read more
Nemotron 3 Nano Omni: Efficient and Open Multimodal Intelligence
Summary
The paper presents Nemotron 3 Nano Omni, a new multimodal model from NVIDIA that integrates audio inputs with text, images, and video for enhanced performance in various tasks. This model builds on the Nemotron 3 Nano 30B-A3B backbone and introduces significant improvements over its predecessor, Nemotron Nano V2 VL. Key advancements include native audio support, improved reasoning capabilities, and innovative multimodal token-reduction techniques that enhance inference efficiency. The model achieves leading results in document understanding, audio-video comprehension, and agentic computer use, while also demonstrating higher throughput and lower latency compared to similar models. The authors employ a multi-stage training strategy to address challenges in modality alignment and training stability, ultimately releasing model checkpoints and portions of the training data to promote further research.
Methodology
The model utilizes a mixture-of-experts architecture and incorporates modality-specific encoders for audio and vision. A multi-stage training strategy is employed to progressively introduce new modalities and scale context length, mitigating issues like catastrophic forgetting and ensuring stable cross-modal alignment.
Results
Nemotron 3 Nano Omni outperforms its predecessor and ranks highly on several benchmarks, achieving 3ร higher throughput at the same interactivity target and 9ร higher output token throughput per GPU. It is recognized as the most cost-efficient open video understanding model on MediaPerf.
Implications
The advancements in multimodal processing and efficiency could facilitate the development of more capable AI systems for real-world applications, including document understanding, interactive agents, and enhanced audio-visual comprehension.
PAINT: Partial-Solution Adaptive Interpolated Training for Self-Distilled Reasoners
NLP
Large Language Models
Reinforcement Learning
- PAINT combines overlap-adaptive solution masking with sparse teacher energy interpolation for enhanced reasoning.
- The method provides a contextual re-scoring view that identifies key bottlenecks in reasoning tasks.
- Empirical results show consistent improvements over existing self-distillation methods and GRPO.
- PAINT achieves better rollout-token efficiency with shorter training sequences.
Read more
PAINT: Partial-Solution Adaptive Interpolated Training for Self-Distilled Reasoners
Summary
The paper introduces PAINT, a novel training method aimed at enhancing the reasoning capabilities of large language models (LLMs) through a contextual re-scoring approach. The authors argue that effective reasoning requires supervision that aligns with the model's test-time states and provides informative, token-level guidance. PAINT addresses the limitations of existing methods, such as reinforcement learning with verifiable rewards (RLVR) and supervised fine-tuning (SFT), by combining on-policy rollouts with dense token-level supervision without relying on a stronger teacher model. The method employs a masking strategy based on rollout-reference overlap to determine how much solution context to reveal, and it utilizes sparse teacher energy interpolation to calibrate targets for specific token positions. The empirical evaluation demonstrates that PAINT consistently outperforms prior on-policy self-distillation baselines across multiple competition-level math benchmarks and model scales, achieving significant improvements in reasoning performance while maintaining rollout efficiency.
Methodology
PAINT employs a two-pronged approach: it masks the verified solution based on the overlap with the student's rollout to control context exposure, and it applies sparse energy interpolation on selected token positions where there is an entropy mismatch between the student and the privileged scorer. This allows for a more nuanced training process that balances guidance and diversity in reasoning.
Results
PAINT demonstrated improvements in macro Avg@12 scores across three model scales (Qwen3-8B, 4B, and 1.7B), with gains ranging from 0.8 to 2.1 points over the prior self-distillation baseline. It also matched or exceeded the performance of GRPO while using a more efficient single-rollout training method.
Implications
The findings suggest that PAINT could be a valuable method for enhancing the reasoning capabilities of LLMs in various applications, particularly in domains requiring complex problem-solving, such as mathematics and science. The approach may also inform future research on adaptive training strategies for LLMs.
Knowledge-Data Dually Driven Paradigm for Accurate Landslide Susceptibility Prediction under Data-Scarce Conditions Using Geomorphic Priors and Tabular Foundation Model
Theory
Efficient ML
- Introduces a novel paradigm for landslide susceptibility prediction that integrates geomorphic prior knowledge with limited data.
- Demonstrates comparable predictive accuracy to traditional methods using only 30% of available data in a data-rich region.
- Validates the approach in a data-scarce environment, confirming its applicability in complex geological settings.
- Utilizes a foundation model (TabPFN) tailored for small datasets to enhance prediction reliability.
Read more
Knowledge-Data Dually Driven Paradigm for Accurate Landslide Susceptibility Prediction under Data-Scarce Conditions Using Geomorphic Priors and Tabular Foundation Model
Summary
This paper addresses the challenge of landslide susceptibility prediction in data-scarce conditions, which is critical for geohazard risk assessment. Traditional data-driven methods require extensive landslide inventories and conditioning factors, which are often unavailable in mountainous and plateau regions. The authors propose a novel knowledge-data dually driven paradigm that integrates geomorphic prior knowledge with limited landslide data to enhance predictive accuracy. The methodology involves estimating geomorphic priors using a morphometric model, which are then combined with conventional predictive factors to create a prior-constrained dataset. A foundation model designed for small data, specifically TabPFN, is employed for predictions. The proposed paradigm was validated in a data-rich region in central Italy, achieving comparable accuracy to conventional methods while using only 30% of the available data. It was further tested in the Qilian Permafrost Region of the Tibetan Plateau, demonstrating reliable predictions even under severe data constraints. This approach effectively combines physical knowledge with data-driven inference, providing a robust solution for landslide susceptibility mapping in challenging environments.
Methodology
The proposed paradigm integrates geomorphic prior knowledge with scarce landslide data. It employs a morphometric model to estimate geomorphic priors, which are combined with conventional predictive factors to form a prior-constrained dataset. A foundation model, TabPFN, is then used for making predictions, ensuring robust inference even with limited data.
Results
The paradigm achieved comparable predictive accuracy to conventional data-driven methods in a data-rich region while using only 30% of the available data. In a data-scarce environment, it also provided reliable landslide susceptibility predictions, confirming its effectiveness and applicability.
Implications
This research has significant implications for geohazard risk assessment and mitigation, particularly in regions where data is scarce. The proposed paradigm can enhance the reliability of landslide susceptibility mapping, aiding in land-use planning and disaster management efforts.
DORA: A Scalable Asynchronous Reinforcement Learning System for Language Model Training
Reinforcement Learning
Large Language Models
Efficient ML
- DORA introduces a scalable asynchronous training paradigm for RL in LLMs.
- The system maintains multiple policy versions to enhance rollout efficiency.
- DORA achieves significant improvements in training throughput without compromising convergence.
- The centralized load-balancing orchestrator optimizes resource allocation dynamically.
Read more
DORA: A Scalable Asynchronous Reinforcement Learning System for Language Model Training
Summary
The paper presents DORA (Dynamic ORchestration for Asynchronous Rollout), a novel system designed to enhance the efficiency of reinforcement learning (RL) in training large language models (LLMs). The authors identify that the rollout phase, which constitutes 50-80% of the total training time, is hindered by skewed generation, particularly due to long-tailed trajectories that are essential for model performance. Traditional synchronous training methods exacerbate this issue by causing idle resources during the longest output generation. DORA addresses these challenges through a unique asynchronous paradigm that allows for overlapping generation and training while maintaining crucial algorithmic constraints such as intra-trajectory policy consistency, data integrity, and bounded staleness. The system employs multi-version streaming training, enabling multiple policy versions to operate concurrently, thus eliminating generation bottlenecks without sacrificing convergence. A centralized load-balancing orchestrator dynamically manages resources and request migrations to optimize performance. Experimental results demonstrate that DORA significantly improves efficiency, achieving up to 2.12x increase in end-to-end throughput and 8.2x in the rollout stage compared to synchronous training. In large-scale applications, DORA accelerates rollout by up to 6.2x, leading to competitive performance in reasoning benchmarks with the resulting models.
Methodology
DORA employs a multi-version streaming training approach that allows simultaneous rollout of multiple policy versions. It ensures intra-trajectory policy consistency, preserves data integrity, and maintains bounded staleness through a centralized load-balancing orchestrator that dynamically reallocates resources and manages request migrations.
Results
DORA achieves up to 2.12x improvement in end-to-end throughput and 8.2x in the rollout stage compared to synchronous training. In large-scale industrial applications, it accelerates the rollout stage by up to 6.2x, resulting in competitive performance on complex reasoning benchmarks.
Implications
The DORA system has significant implications for the training of large language models, particularly in improving efficiency and performance in RL applications. Its design can be applied to various industrial scenarios, enhancing the scalability and effectiveness of model training.
Accelerating RL Post-Training Rollouts via System-Integrated Speculative Decoding
Reinforcement Learning
Large Language Models
Efficient ML
- Speculative decoding is introduced as a method to accelerate RL rollouts without altering the output distribution of the target model.
- The integration of speculative decoding into the NeMo-RL framework supports both synchronous and asynchronous execution.
- The proposed method achieves a 1.8ร throughput improvement in rollout generation for 8B-scale models.
- Combining speculative decoding with asynchronous RL can potentially lead to a 2.5ร speedup in end-to-end training at larger model scales.
Read more
Accelerating RL Post-Training Rollouts via System-Integrated Speculative Decoding
Summary
This paper addresses the bottleneck in reinforcement learning (RL) post-training of frontier language models, specifically focusing on the inefficiencies associated with autoregressive rollout generation. The authors propose a novel approach called speculative decoding, which serves as a lossless acceleration primitive for RL rollouts, preserving the output distribution of the target model. By integrating speculative decoding into the NeMo-RL framework with a vLLM backend, the authors enable both synchronous and asynchronous pipelines for RL rollouts. The study demonstrates that speculative decoding can significantly enhance rollout throughput, achieving a 1.8ร improvement in an 8B-scale reasoning workload under synchronous RL. Furthermore, projections indicate that combining speculative decoding with asynchronous RL could yield up to a 2.5ร end-to-end training speedup at a 235B scale. The paper also explores the operational choices that influence speedup and characterizes the design space for speculative decoding in RL post-training, providing insights into how various factors affect performance at deployment scale.
Methodology
The authors implemented speculative decoding within the NeMo-RL framework, utilizing a vLLM backend. They conducted experiments on 8B-scale reasoning workloads under synchronous RL, analyzing various operational parameters such as draft initialization, draft length, and online adaptation. The study also employed a high-fidelity performance simulator to project the potential speedup when combining speculative decoding with asynchronous RL.
Results
The integration of speculative decoding resulted in a 1.8ร improvement in rollout throughput for 8B-scale models under synchronous RL. Projections using a performance simulator indicated that the combination of speculative decoding with asynchronous RL could achieve up to a 2.5ร speedup in end-to-end training at 235B scale.
Implications
The findings suggest that speculative decoding can significantly enhance the efficiency of RL post-training processes, making it a valuable technique for improving the performance of large language models in reasoning tasks. This could lead to faster training cycles and more effective deployment of RL models in various applications.
On the Trainability of Masked Diffusion Language Models via Blockwise Locality
NLP
Large Language Models
Generative Models
- Standard random-masking MDMs exhibit instability and high variance in training for certain tasks.
- Proposed models Jigsaw and Scatter incorporate left-to-right locality to improve trainability and stability.
- Jigsaw matches AR-LLM stability on linear regression while maintaining performance on Sudoku.
- Scatter retains diffusion's advantages in planning tasks like path-finding.
Read more
On the Trainability of Masked Diffusion Language Models via Blockwise Locality
Summary
This paper investigates the trainability of Masked Diffusion Language Models (MDMs) in comparison to Autoregressive Large Language Models (AR-LLMs). The authors identify that while MDMs show promise, they often exhibit instability during optimization, particularly in tasks requiring structured generation. The study evaluates MDMs on three tasks: in-context linear regression, graph path-finding, and Sudoku solving. The findings reveal that standard random-masking MDMs struggle with linear regression and exhibit high variance in training dynamics for path-finding, yet outperform AR-LLMs in Sudoku. To address these challenges, the authors propose two locality-aware blockwise models, Jigsaw and Scatter, which introduce left-to-right inductive bias while maintaining iterative refinement. Empirical results demonstrate that Jigsaw achieves stability comparable to AR-LLMs in linear regression and performs well in Sudoku, while Scatter retains the planning advantages of diffusion models in path-finding. The study concludes that traditional random-masking MDMs may not be optimal for ordered generation tasks, suggesting the need for models that go beyond random masking.
Methodology
The authors systematically evaluate MDMs and their block-structured variants on three controlled tasks. They propose two new models, Jigsaw and Scatter, which enforce autoregressive locality within blocks while allowing for iterative refinement across blocks. The performance of these models is compared against standard MDMs and AR-LLMs using empirical experiments that measure training dynamics and compute efficiency.
Results
The experiments show that Jigsaw improves stability and reduces variance in linear regression tasks, while also performing well in Sudoku. Scatter maintains the planning advantages of diffusion models in path-finding tasks. Overall, the proposed models demonstrate a better trade-off between trainability and stability compared to standard MDMs.
Implications
The findings suggest that incorporating locality-aware structures in diffusion models can enhance their applicability in tasks requiring ordered generation. This could lead to more effective models for natural language processing and other structured generation tasks.
Learning with Embedded Linear Equality Constraints via Variational Bayesian Inference
Theory
Optimization
- Introduces a Bayesian framework for embedding linear equality constraints in BNNs.
- Utilizes variational inference to provide uncertainty quantification while enforcing constraints.
- Demonstrates improved performance on a single particle battery model compared to standard BNNs.
- Treats constraint tolerance as a learnable random variable, enhancing model flexibility.
Read more
Learning with Embedded Linear Equality Constraints via Variational Bayesian Inference
Summary
This paper addresses the challenge of integrating known physical constraints into machine learning models while providing meaningful uncertainty estimates. The authors propose a Bayesian framework that embeds linear equality constraints directly into Bayesian neural networks (BNNs) using variational inference. This approach allows for the characterization of predictive uncertainty over model parameters and the enforcement of constraints, which is crucial in scientific and engineering applications. The method is evaluated on a single particle battery model, demonstrating its ability to produce reduced credible intervals and fewer constraint violations compared to standard BNNs. The framework treats constraint tolerance as a random variable, enabling the model to learn the strictness of constraints from the data, thus bridging structured probabilistic modeling with modern deep learning techniques.
Methodology
The authors develop a probabilistic framework that incorporates linear equality constraints into BNNs. They utilize variational inference to derive a posterior predictive distribution that satisfies constraints while maintaining calibrated uncertainty. The method involves conditioning on the constraint residuals and allows for a learnable tolerance level, which adjusts the strictness of constraint enforcement based on the data.
Results
The proposed method shows a significant reduction in credible intervals and constraint violations when applied to the single particle battery model, outperforming traditional BNNs. The results indicate that the framework effectively balances the enforcement of physical constraints with the need for uncertainty quantification.
Implications
This work has potential applications in fields requiring predictive modeling under physical constraints, such as battery modeling, fluid dynamics, and other engineering domains. The integration of uncertainty quantification with structured knowledge can enhance decision-making processes in scientific research and engineering design.
Comparative Study of Bending Analysis using Physics-Informed Neural Networks and Numerical Dynamic Deflection in Perforated nanobeam
Theory
Optimization
- Introduction of a novel Physics-Informed Functional Link Constrained Framework (DFL-TFC) for analyzing perforated nanobeams.
- Establishment of a relationship between static and dynamic deflections in perforated nanobeams.
- Demonstration of computational efficiency and accuracy without the need for complex neural network architectures.
- Utilization of the Theory of Functional Connections to embed differential equation constraints effectively.
Read more
Comparative Study of Bending Analysis using Physics-Informed Neural Networks and Numerical Dynamic Deflection in Perforated nanobeam
Summary
This paper presents a comparative study of the bending behavior of perforated nanobeams subjected to sinusoidal loading, utilizing a Physics-Informed Functional Link Constrained Framework with Domain Mapping (DFL-TFC) method. The authors aim to establish the relationship between static bending response and dynamic deflection for various perforation cases. The static bending response is computed using the FL-TFC method, while the dynamic deflection is determined through the Galerkin method. The study employs the Theory of Functional Connections (TFC) to incorporate governing differential equation constraints into a constrained expression that satisfies all initial and boundary conditions. The results indicate that the proposed FL-TFC method is computationally efficient and accurate, eliminating the need for complex deep neural network architectures. The relationship between static and dynamic deflections is analyzed, revealing that the ratio of dynamic to static deflection remains constant across the domain for fixed values of filling ratio, number of rows of holes, and non-local parameter.
Methodology
The methodology involves the use of the FL-TFC with Domain Mapping to compute static bending responses, while dynamic deflections are calculated using the Galerkin method. The approach integrates the Theory of Functional Connections to ensure that the governing differential equations are satisfied, and the free function is represented through a Functional Link Neural Network (FLNN). The optimization is achieved by minimizing the mean square residual of the differential equations.
Results
The study finds that the proposed FL-TFC method provides accurate results for the bending analysis of perforated nanobeams, with the ratio of dynamic to static deflection remaining constant across various configurations. The method demonstrates superior performance compared to standard Physics-Informed Neural Networks (PINNs) in terms of efficiency and adherence to boundary conditions.
Implications
The findings have significant implications for the design and analysis of nanostructures in engineering applications, particularly in optimizing the mechanical properties of perforated beams for various structural applications. The methodology can be extended to other types of structural analyses involving complex geometries and loading conditions.
Nautile-370M: Spectral Memory Meets Attention in a Small Reasoning Model
NLP
Large Language Models
Reinforcement Learning
- Introduction of SeqCond Attention (SCA) as a novel sequence operator for language models.
- Demonstration of SCA's expressiveness, capable of retrieving individual tokens and replicating softmax attention outputs.
- Hybrid architecture comprising 16 SCA layers and 8 transformer layers for efficient reasoning.
- Innovative training strategies, including gradient-balanced GRPO and scored self-distillation, leading to improved accuracy.
Read more
Nautile-370M: Spectral Memory Meets Attention in a Small Reasoning Model
Summary
The paper introduces Nautile-370M, a 371-million-parameter language model designed for efficient reasoning while adhering to strict parameter and inference budgets. The model employs a hybrid architecture that alternates between two SeqCond Attention (SCA) layers and one transformer layer, aiming to combine the efficiency of structured sequential models with the expressive routing capabilities of attention mechanisms. The SCA layer is based on a linear-time spectral sequence operator that computes a compressed summary of the input sequence, allowing for efficient state updates during inference. The model was trained on a single Cloud TPU v4-64 pod and further refined using reinforcement learning on an NVIDIA DGX Spark. The authors demonstrate that the SCA readout mechanism can accurately retrieve individual tokens and replicate outputs of softmax attention, establishing its expressiveness. The training pipeline involved a diverse dataset, and the authors propose two innovations to enhance reinforcement learning performance: a gradient-balanced GRPO variant and a scored self-distillation stage, which significantly improved reasoning accuracy on benchmark tasks.
Methodology
Nautile-370M employs a hybrid architecture with alternating SCA and transformer layers. The SCA layer computes a compressed summary of the input sequence using a characteristic function and its gradient, allowing for efficient token retrieval. The model was trained on a large dataset and refined through reinforcement learning, incorporating novel techniques to stabilize training and enhance reasoning capabilities.
Results
The model's architecture and training methods resulted in a significant improvement in reasoning accuracy on the GSM8K benchmark, increasing from 28.0% to 33.4%. The SCA mechanism proved to be as expressive as traditional self-attention, enabling effective token retrieval and state tracking.
Implications
Nautile-370M's architecture and training innovations could lead to more efficient language models that maintain high reasoning capabilities while operating within strict resource constraints. This model may be particularly useful in applications requiring efficient processing of long-context data.
A Multimodal and Explainable Machine Learning Approach to Diagnosing Multi-Class Ejection Fraction from Electrocardiograms
Multimodal
Time Series
Interpretability
- Developed a multimodal machine learning framework for LVEF classification using ECG and EHR data.
- Achieved high AUROC scores across multiple LVEF categories, outperforming unimodal models.
- Utilized SHAP attributions to enhance model explainability and identify key features influencing predictions.
- Demonstrated the model's temporal generalizability, indicating robustness over time.
Read more
A Multimodal and Explainable Machine Learning Approach to Diagnosing Multi-Class Ejection Fraction from Electrocardiograms
Summary
This paper presents a novel multimodal machine learning framework designed to classify left ventricular ejection fraction (LVEF) from 12-lead electrocardiograms (ECGs) and electronic health record (EHR) data. Traditional LVEF assessment relies on echocardiography, which is often inaccessible in primary care settings. The authors developed a model that integrates engineered ECG time-series features with structured EHR variables to classify LVEF into four categories: normal (โฅ50%), mildly reduced (40โ50%), moderately reduced (30โ40%), and severely reduced (<30%). The model was trained using retrospective data from Hartford HealthCare, encompassing 36,784 ECG-echocardiogram pairs from 30,952 outpatients, and evaluated for temporal generalizability on 19,966 ECGs from a subsequent period. The multimodal model demonstrated superior performance compared to unimodal baselines, achieving one-vs-rest AUROCs of 0.95 for severe, 0.92 for moderate, 0.82 for mild, and 0.91 for normal LVEF. Additionally, the study employed SHAP attributions to enhance model explainability by identifying influential ECG and EHR features. This research highlights the potential of ECG-based multimodal LVEF stratification as a practical tool for screening and triage in resource-limited settings.
Methodology
The authors employed an XGBoost-based tabular learning framework that integrates engineered features from 12-lead ECG time series with structured EHR data. The model was trained on a large dataset of ECG-echocardiogram pairs and evaluated using AUROC metrics for performance assessment. SHAP attributions were used to analyze feature importance and support model explainability.
Results
The multimodal model achieved one-vs-rest AUROCs of 0.95 for severe LVEF reduction, 0.92 for moderate reduction, 0.82 for mild reduction, and 0.91 for normal LVEF. The model outperformed both ECG-only and EHR-only baselines, particularly in the moderate LVEF category, where it showed an improvement from 0.88 to 0.92 AUROC. The study also reported variability in performance metrics due to class imbalance, emphasizing AUROC as the primary performance summary.
Implications
This research supports the use of ECG-based multimodal approaches for LVEF stratification, potentially transforming heart failure risk assessment and triage in clinical settings, especially where echocardiography is not readily available. The findings could lead to improved patient outcomes through timely identification and management of heart failure risks.
Transformer Approximations from ReLUs
Theory
Efficient ML
Generative Models
- Introduction of a translation theorem for ReLU to softmax Transformer approximations.
- Establishment of constructive transformer-native universal approximation results.
- Derivation of economic transformer constructions for specific approximation targets.
- Improved resource bounds for Transformers compared to traditional universal approximation methods.
Read more
Transformer Approximations from ReLUs
Summary
This paper presents a systematic approach to translating ReLU approximation results into the context of softmax attention mechanisms used in Transformers. The authors introduce a translation theorem that allows for the construction of explicit softmax Transformers with accuracy matching that of existing ReLU approximators. This approach not only provides target-specific resource bounds but also addresses the limitations of universal approximation statements that often yield loose bounds in practical scenarios. The authors demonstrate the applicability of their theorem through various common approximation targets, including multiplication, reciprocal computation, and min/max operations. The results indicate that the proposed method yields more economical resource requirements compared to traditional universal approximation methods, thereby enhancing the efficiency of Transformer models in practical applications. The paper emphasizes the importance of developing a target-specific approximation theory for Transformers, paralleling the advancements made in ReLU networks, which have established a mature framework for specific function approximations.
Methodology
The authors develop a translation theorem that constructs softmax attention Transformers based on existing ReLU approximators. This involves analyzing the resource requirements and accuracy of the resulting Transformer models, ensuring that they match the performance of their ReLU counterparts. The methodology includes case studies on specific functions to illustrate the practical implications of the theorem.
Results
The paper successfully proves a translation theorem that allows for the construction of softmax Transformers with the same accuracy as ReLU approximators. It also presents new economic constructions for approximating rational functions and simulating composite ReLU modules, demonstrating improved size and depth bounds for Transformers on important function classes.
Implications
The findings of this paper have significant implications for the design and efficiency of Transformer models in various applications, particularly in generative models and large language models. By providing a framework for target-specific approximations, the research could lead to more efficient architectures that require fewer resources while maintaining high accuracy.
A Survey of Multi-Agent Deep Reinforcement Learning with Graph Neural Network-Based Communication
Reinforcement Learning
Graph Learning
- Communication enhances coordination among agents in MARL.
- GNNs provide a robust framework for modeling communication in multi-agent systems.
- The paper identifies a lack of classification frameworks for GNN-based communication methods in MARL.
- A generalized GNN-based communication process is proposed to clarify existing approaches.
Read more
A Survey of Multi-Agent Deep Reinforcement Learning with Graph Neural Network-Based Communication
Summary
This paper surveys the integration of communication mechanisms in Multi-Agent Reinforcement Learning (MARL) using Graph Neural Networks (GNNs). It highlights the importance of communication for agents to coordinate actions and achieve shared objectives by exchanging information. The authors note a lack of structured frameworks to classify existing MARL approaches that utilize GNNs for communication. They propose a generalized GNN-based communication process to clarify and make accessible the underlying concepts of these methods. The survey analyzes twelve major recent GNN-based communication methods, comparing their approaches, extracting trends, and identifying limitations. The paper also discusses how realistic communication constraints are addressed in state-of-the-art methods and concludes with future research directions in the field.
Methodology
The authors conducted a comprehensive survey of recent literature on MARL with GNN-based communication. They analyzed various methods, compared their effectiveness, and proposed a generalized communication process based on GNNs. The survey also included a discussion on how communication constraints are integrated into these methods.
Results
The survey revealed significant diversity in GNN-based communication methods within MARL, highlighting trends and common challenges. The proposed generalized communication process aims to provide clarity and accessibility to the underlying concepts of these approaches.
Implications
The findings of this survey can guide future research in MARL by providing a structured understanding of GNN-based communication. It may lead to the development of more effective communication strategies among agents, enhancing their ability to work collaboratively in complex environments.
PolyKV: A Shared Asymmetrically-Compressed KV Cache Pool for Multi-Agent LLM Inference
Large Language Models
Efficient ML
NLP
- Introduces a shared KV cache pool for concurrent inference agents, reducing memory overhead.
- Achieves a stable 2.91x compression ratio across different configurations.
- Demonstrates significant memory savings, reducing KV cache memory from 19.8 GB to 0.45 GB.
- Shows that perplexity degradation is minimal and improves with longer context lengths.
Read more
PolyKV: A Shared Asymmetrically-Compressed KV Cache Pool for Multi-Agent LLM Inference
Summary
The paper introduces PolyKV, a novel system designed to optimize memory usage during multi-agent inference of large language models (LLMs) by implementing a shared, asymmetrically-compressed key-value (KV) cache pool. Traditional methods allocate separate KV caches for each agent, leading to significant memory overhead. PolyKV addresses this by allowing multiple agents to share a single compressed KV cache, which is created once and injected into each agent's context. The compression technique is asymmetric, using int8 quantization for keys to maintain softmax stability and TurboQuant MSE for values, which employs a Fast Walsh-Hadamard Transform (FWHT) followed by 3-bit Lloyd-Max quantization. The authors evaluate PolyKV across various model scales and context lengths, demonstrating a consistent compression ratio of 2.91x and significant memory savings, particularly with 15 concurrent agents. The results indicate minimal degradation in performance metrics, such as perplexity and BERTScore, suggesting that the method effectively balances memory efficiency with model performance. This work represents a significant advancement in the integration of shared memory and compression techniques for multi-agent LLM inference.
Methodology
PolyKV employs a shared-pool architecture where a single compressed KV state is computed and injected into multiple agents' contexts. The keys are quantized using int8, while the values are compressed using TurboQuant MSE, which combines FWHT and 3-bit Lloyd-Max quantization. The system is evaluated across different model scales and context lengths, measuring performance in terms of memory usage and model accuracy.
Results
The evaluation shows that PolyKV achieves a 97.7% reduction in KV cache memory usage with 15 agents sharing a 4K-token context, while maintaining a perplexity degradation of only +0.57%. The perplexity delta improves with longer context lengths, inverting to -0.26% at 1,851 tokens of coherent context. The mean BERTScore F1 remains high at 0.928, indicating semantic equivalence between compressed and baseline outputs.
Implications
PolyKV has the potential to significantly enhance the efficiency of multi-agent LLM inference systems, making them more scalable and accessible for applications requiring concurrent processing of large contexts. This could lead to advancements in real-time applications of LLMs, such as conversational agents and collaborative AI systems.
Diverse Image Priors for Black-box Data-free Knowledge Distillation
Computer Vision
Efficient ML
Theory
- Introduces Diverse Image Priors as a new class of synthetic images for knowledge distillation.
- Utilizes a primer student for contrastive optimization, enhancing the quality of distillation signals.
- Demonstrates that data diversity is crucial for effective knowledge transfer in black-box scenarios.
- Achieves state-of-the-art performance across multiple benchmarks in black-box data-free KD.
Read more
Diverse Image Priors for Black-box Data-free Knowledge Distillation
Summary
This paper addresses the challenge of knowledge distillation (KD) in black-box, data-free scenarios where access to the teacher model's internal data is restricted. The authors propose a novel framework called Diverse Image Priors Knowledge Distillation (DIP-KD) that operates in three phases: (1) Synthesis of diverse image priors to capture a wide range of visual patterns and semantics; (2) Contrastive learning to enhance the distinction between synthetic samples; and (3) Distillation using a primer student that allows for the extraction of soft probabilistic signals from the teacher's top-1 predictions. The proposed method overcomes limitations of previous approaches that relied on synthetic data, which often lacked diversity and optimal distillation signals. The evaluation of DIP-KD across 12 benchmarks demonstrates its state-of-the-art performance, highlighting the importance of data diversity in knowledge acquisition under restricted conditions.
Methodology
The methodology consists of a three-phase pipeline: (1) Synthesis of diverse image priors to create synthetic samples that reflect a broad range of visual characteristics; (2) Contrastive learning to ensure these samples are distinct and informative; (3) Knowledge distillation using both hard labels from the teacher and soft probabilistic labels from the primer student to optimize the learning process.
Results
The proposed DIP-KD framework outperforms existing methods in black-box data-free knowledge distillation, achieving state-of-the-art results across 12 different benchmarks. Ablation studies confirm that the diversity of synthetic data significantly enhances the effectiveness of knowledge acquisition.
Implications
The findings suggest that DIP-KD can be effectively applied in decentralized AI systems where data privacy is a concern, enabling efficient knowledge transfer from complex models to lightweight student models without requiring access to original training data.
A Randomized PDE Energy driven Iterative Framework for Efficient and Stable PDE Solutions
Efficient ML
Theory
Optimization
- Introduces a PDE energy-driven framework that avoids traditional matrix-based solvers.
- Utilizes physically constrained diffusion iterations for solving PDEs.
- Demonstrates stable convergence and accurate resolution of sharp gradients.
- Achieves competitive accuracy compared to analytical solutions.
Read more
A Randomized PDE Energy driven Iterative Framework for Efficient and Stable PDE Solutions
Summary
This paper presents a novel framework for solving partial differential equations (PDEs) that is both efficient and stable, addressing the limitations of traditional numerical solvers and data-driven methods. The proposed PDE energy-driven framework utilizes physically constrained diffusion iterations, eliminating the need for matrix-based finite element assembly or extensive neural network training. The method evolves random initial fields through implicit iterations combined with Gaussian smoothing, ensuring boundary conditions are strictly enforced at each step. The framework is tested on one-dimensional Poisson, Heat, and viscous Burgers equations, demonstrating stable convergence to unique physical solutions from random initializations. The results show accurate resolution of sharp gradients and controlled Mean Squared Error (MSE) across various discretization parameters. Comparisons with analytical solutions reveal that the framework achieves competitive accuracy and stability, offering a fast and flexible alternative to traditional numerical solvers. This approach has significant potential for scalable PDE solutions in both research and engineering applications.
Methodology
The methodology involves evolving arbitrary random initial fields through PDE energy-driven implicit iterations, combined with Gaussian smoothing. Boundary conditions are strictly enforced at each iteration, allowing for the solution of various PDEs without reliance on classical discretization methods.
Results
The framework successfully converges to unique physical solutions from random initializations, with stable performance across different discretization parameters. Numerical experiments on Poisson, Heat, and viscous Burgers equations show accurate resolution of sharp gradients and controlled MSE, with results comparable to analytical solutions.
Implications
The proposed framework provides a new approach for solving PDEs that could enhance computational efficiency and stability in scientific and engineering applications. Its flexibility and speed make it suitable for real-time predictive modeling and complex simulations in various fields.
Asynchronous Federated Unlearning with Invariance Calibration for Medical Imaging
Federated Learning
- Introduces an asynchronous paradigm for Federated Unlearning in medical imaging.
- Addresses the latency issues of synchronous unlearning methods.
- Implements a server-side invariance calibration mechanism to ensure genuine data erasure.
- Achieves unlearning efficacy comparable to retraining while maintaining high fidelity.
Read more
Asynchronous Federated Unlearning with Invariance Calibration for Medical Imaging
Summary
This paper introduces Asynchronous Federated Unlearning with Invariance Calibration (AFU-IC), a novel framework designed to address the challenges of Federated Unlearning (FU) in medical imaging. Traditional FU methods often rely on synchronous coordination, which can lead to significant delays due to device heterogeneity and the blocking of operations for other clients while waiting for a target client to complete data erasure. Additionally, existing methods may only suppress the influence of erased data temporarily, allowing it to resurface in future training. AFU-IC decouples the erasure process from the global training workflow, enabling asynchronous unlearning without interrupting ongoing training. A key feature of AFU-IC is the server-side invariance calibration mechanism, which ensures that the model does not relearn erased data by enforcing a structural regularization that minimizes the KL-divergence between clean and triggered predictions. Extensive experiments on three medical benchmarks demonstrate that AFU-IC achieves unlearning efficacy and model fidelity comparable to traditional retraining methods while significantly reducing wall-clock latency, making it suitable for cross-silo medical environments.
Methodology
The AFU-IC framework allows target clients to perform unlearning asynchronously, decoupling the erasure process from global training. It employs a server-side invariance calibration mechanism that minimizes KL-divergence between clean and triggered predictions, ensuring that erased data features are permanently removed from the model's decision logic.
Results
The experiments conducted on three medical datasets show that AFU-IC achieves unlearning efficacy and model fidelity similar to gold-standard retraining methods, while also significantly improving wall-clock efficiency. The framework ensures deep erasure that resists model reverting during subsequent training phases.
Implications
AFU-IC has significant implications for medical institutions that require compliance with data protection regulations, such as GDPR, by providing a reliable method for data erasure in federated learning settings. This framework can enhance collaborative medical image analysis while ensuring patient data privacy.