AI-generated summaries
Today's ML research,
without the noise.
Daily summaries of the latest machine learning papers from arXiv, processed every 8 hours.
68
Papers today
8h
Update frequency
7
Days of history
Multimodal Graph-based Classification of Esophageal Motility Disorders
Multimodal
Graph Learning
- Proposes a multimodal ML approach combining HRIM data with patient-specific information.
- Uses graph-based modeling to represent HRIM data, enhancing the analysis of esophageal motility.
- Demonstrates improved classification accuracy over traditional methods and vision-based classifiers.
- Highlights the importance of integrating multiple data modalities for better diagnostic outcomes.
Read more
Multimodal Graph-based Classification of Esophageal Motility Disorders
Summary
This paper addresses the challenges in diagnosing esophageal motility disorders, particularly dysphagia, by proposing a multimodal machine learning approach that combines high-resolution impedance manometry (HRIM) data with patient-specific information. The authors collected data from 104 patients and represented HRIM recordings as spatio-temporal graphs, where nodes correspond to pressure values and edges represent spatial adjacency and impedance dynamics. A graph neural network (GNN) was employed to learn meaningful representations from these graphs, which were then fused with patient embeddings for multi-category classification of swallow events. The study demonstrated that integrating patient-specific information significantly improved classification accuracy compared to models relying solely on HRIM features. Additionally, the graph-based approach outperformed vision-based classifier baselines, highlighting the importance of multimodal data integration in enhancing diagnostic precision. The findings suggest that this method could lead to more accurate and personalized assessments of esophageal motility disorders, although further validation with larger datasets is necessary to confirm these results.
Methodology
The study utilized HRIM recordings and patient data from 104 patients, transforming HRIM data into spatio-temporal graphs for analysis. A graph neural network was applied to learn representations from these graphs, which were then combined with patient embeddings for classification tasks. Ablation studies were conducted to evaluate the impact of different features and modeling approaches.
Results
The multimodal approach showed significant improvements in classification accuracy across all categories compared to models that used only HRIM-derived features. The graph-based modeling also provided advantages over vision-based classifier baselines, indicating the effectiveness of the proposed method.
Implications
The findings suggest that integrating patient-level data with graph-based representations of HRIM signals can lead to more accurate classifications of esophageal motility disorders, potentially improving clinical decision-making and patient outcomes. This approach could pave the way for personalized medicine in the diagnosis and treatment of dysphagia and related disorders.
HodgeCover: Higher-Order Topological Coverage Drives Compression of Sparse Mixture-of-Experts
Efficient ML
Large Language Models
Theory
- Introduces HodgeCover, a learning-free expert-selection method based on higher-order topological structures.
- Identifies the harmonic kernel of the simplicial Laplacian as a key component for expert mergeability.
- Demonstrates that HodgeCover can achieve significant expert reduction while maintaining or improving model accuracy.
- Presents a hybrid approach (HodgeCover+Wanda) that combines expert selection with weight pruning for enhanced compression.
Read more
HodgeCover: Higher-Order Topological Coverage Drives Compression of Sparse Mixture-of-Experts
Summary
This paper introduces HodgeCover, a novel learning-free compression method for Sparse Mixture-of-Experts (MoE) layers that addresses the limitations of existing expert selection techniques. Traditional compressors often rely on pairwise compatibility scores, which fail to account for the structural complexities of expert mergeability, particularly when three experts form an irreducible cycle. The authors mathematically characterize this obstruction using the harmonic kernel of the simplicial Laplacian on a 2-complex representing the experts and their merge barriers. HodgeCover utilizes this harmonic kernel to create a selection objective that greedily covers critical edges and triangles, effectively identifying experts that can be merged without loss of performance. Additionally, a hybrid variant, HodgeCover+Wanda, combines this method with weight pruning to enhance compression further. The results demonstrate that HodgeCover achieves state-of-the-art performance in expert reduction while maintaining accuracy across various MoE architectures, significantly outperforming existing methods. This work highlights the importance of higher-order topological structures in optimizing the performance of sparse models.
Methodology
The authors construct a 2-dimensional simplicial complex representing the experts and their merge barriers, applying the simplicial Laplacian to perform Hodge decomposition. This decomposition isolates the harmonic component, which is then used to guide the selection of experts through a greedy coverage strategy. The hybrid method integrates this selection with unstructured weight pruning to optimize the overall compression process.
Results
HodgeCover+Wanda achieves a 66% reduction in the number of experts across three different MoE architectures, outperforming the state-of-the-art method STUN+Wanda by up to 12.6 percentage points in downstream accuracy. An ablation study indicates that omitting the HodgeCover step results in a significant accuracy drop of 5.74 percentage points.
Implications
The findings suggest that incorporating higher-order topological insights can lead to more effective compression strategies for large-scale models, particularly in scenarios where maintaining performance while reducing complexity is critical. This approach could be beneficial in deploying efficient models in resource-constrained environments.
What if Tomorrow is the World Cup Final? Counterfactual Time Series Forecasting with Textual Conditions
Time Series
Multimodal
- Introduction of counterfactual time series forecasting with textual conditions.
- Development of the TADIFF model that utilizes a text-attribution mechanism.
- Creation of a comprehensive evaluation framework for factual and counterfactual settings.
- Implementation of counterfactual data augmentation to improve model adaptability.
Read more
What if Tomorrow is the World Cup Final? Counterfactual Time Series Forecasting with Textual Conditions
Summary
This paper addresses the limitations of traditional time series forecasting methods that rely solely on historical data and factual future conditions. The authors introduce the novel task of counterfactual time series forecasting with textual conditions, which allows for more flexible and condition-aware predictions. They propose a multimodal diffusion model called TADIFF, which incorporates a text-attribution mechanism to distinguish between intrinsic historical features and extrinsic textual conditions. This model aims to improve forecasting accuracy under complex and stochastic scenarios. The authors also present a comprehensive evaluation framework that assesses forecasting performance in both factual and counterfactual settings, even in the absence of ground truth data. By synthesizing diverse counterfactual training samples, TADIFF enhances adaptability to various future conditions. The paper highlights the importance of considering counterfactual scenarios in forecasting, particularly in real-world applications where future events can significantly influence outcomes.
Methodology
The authors propose the TADIFF model, which employs a text-attribution mechanism to separate intrinsic historical patterns from extrinsic textual conditions. This model is trained using counterfactual data augmentation to synthesize diverse training samples, enhancing its adaptability to various future scenarios. A novel semantic metric is introduced for evaluating forecasting consistency with both historical sequences and textual conditions.
Results
The TADIFF model demonstrates improved forecasting accuracy in counterfactual scenarios compared to traditional methods. The evaluation framework allows for systematic benchmarking, showing that the model can effectively adapt to complex future conditions and provide reliable forecasts even without ground truth data.
Implications
The findings suggest that incorporating counterfactual scenarios into time series forecasting can significantly enhance predictive performance in various domains, such as finance, healthcare, and traffic management. This approach may lead to more robust decision-making processes by better anticipating the impact of future events.
Rethinking Layer Relevance in Large Language Models Beyond Cosine Similarity
NLP
Large Language Models
Interpretability
- Cosine similarity is an unreliable metric for assessing layer relevance in LLMs.
- The correlation between cosine similarity and performance degradation is often weak or moderate.
- A proposed alternative metric based on actual accuracy drop offers a more accurate assessment of layer importance.
- Empirical results show that significant performance can be maintained even after removing a substantial number of layers.
Read more
Rethinking Layer Relevance in Large Language Models Beyond Cosine Similarity
Summary
This paper critiques the prevalent use of cosine similarity as a metric for assessing layer relevance in large language models (LLMs). The authors argue that cosine similarity often fails to accurately reflect the performance degradation caused by the removal of layers, leading to misleading interpretations of a model's internal mechanisms. Through theoretical analysis, they demonstrate that a layer can have a low cosine similarity score while still being crucial for model performance. Empirical evidence across various LLMs supports their claim, revealing a weak correlation between cosine similarity and actual performance degradation in over 90% of cases studied. The authors propose a more robust metric: the actual drop in model accuracy resulting from layer removal, despite its computational cost. This approach provides a clearer understanding of layer importance and informs better pruning strategies for model optimization. Their findings emphasize the need for a shift away from cosine similarity in evaluating layer relevance, which has significant implications for the development of interpretable LLMs.
Methodology
The authors conducted a theoretical analysis to demonstrate the limitations of cosine similarity and performed empirical evaluations on various LLMs to compare the correlation between cosine similarity scores and actual performance degradation. They introduced a new metric based on the accuracy drop from layer removal and replicated previous studies using this metric to assess layer relevance.
Results
The study found that removing layers deemed irrelevant by cosine similarity often led to significant performance drops, with one layer causing a 66% accuracy reduction. In contrast, using the proposed accuracy drop metric revealed that over 75% accuracy could be maintained even after removing 22% of the layers, challenging previous assumptions about layer necessity in reasoning tasks.
Implications
The findings suggest that relying on cosine similarity can mislead interpretations of LLMs, and adopting the proposed accuracy drop metric could lead to more effective model pruning and optimization strategies. This has implications for the development of more interpretable and efficient LLM architectures.
Proposal and study of statistical features for string similarity computation and classification
NLP
- Introduction of COM and RLM features for string similarity computation.
- Features are language-agnostic and purely statistical.
- COM and RLM outperform traditional statistical measures in synthetic experiments.
- RLM features achieve the best results in a real text plagiarism dataset.
Read more
Proposal and study of statistical features for string similarity computation and classification
Summary
This paper presents novel statistical features derived from visual computing techniques, specifically co-occurrence matrix (COM) and run-length matrix (RLM), for computing string similarity across various contexts, including words, phrases, and texts. Unlike traditional methods, these features are language-agnostic and focus purely on statistical properties. The authors evaluate these features against established statistical measures such as longest common subsequence, mutual information, and edit distances. Through synthetic experiments, the COM and RLM features demonstrate superior performance, achieving statistical significance over competing methods in 75% of the cases tested (p-value < 0.001). In practical applications, particularly in a text plagiarism dataset, RLM features yielded the best results, indicating their effectiveness in real-world scenarios. The findings suggest that these statistical features can enhance string similarity computations and classifications in diverse applications, including text mining and optical character recognition.
Methodology
The authors adapted features from visual computing, specifically COM and RLM, for string similarity computation. They conducted synthetic experiments to compare these features against traditional statistical measures, evaluating performance based on statistical significance and accuracy in a real-world text plagiarism dataset.
Results
The COM and RLM features outperformed traditional statistical measures in synthetic experiments, with RLM features achieving the highest accuracy in a real text plagiarism dataset. The statistical significance of the results was confirmed with p-values less than 0.001 in most comparisons.
Implications
The proposed statistical features can significantly improve string similarity computations in various applications, including text mining, optical character recognition, and plagiarism detection, offering a robust solution that is not influenced by language-specific characteristics.
NodeSynth: Socially Aligned Synthetic Data for AI Evaluation
NLP
Large Language Models
Generative Models
- Introduction of NodeSynth, a methodology for generating socially aligned synthetic data for AI evaluation.
- NodeSynth significantly outperforms human-authored benchmarks in eliciting model failures.
- The methodology utilizes a fine-tuned taxonomy generator (TaG) grounded in real-world evidence.
- Ablation studies confirm the importance of granular taxonomic depth in identifying model vulnerabilities.
Read more
NodeSynth: Socially Aligned Synthetic Data for AI Evaluation
Summary
NodeSynth introduces a novel methodology for generating socially relevant synthetic queries aimed at evaluating AI models, particularly in sensitive domains. The authors highlight the limitations of existing synthetic data generation methods, which often fail to capture the sociotechnical nuances necessary for effective model evaluation. NodeSynth employs a fine-tuned taxonomy generator (TaG) that is grounded in real-world evidence to create queries that reflect complex societal contexts. The methodology is evaluated against four mainstream large language models (LLMs), revealing that NodeSynth-generated queries lead to failure rates up to five times higher than those generated from human-authored benchmarks. The paper emphasizes the importance of granular taxonomic expansion in driving these failure rates and identifies deficiencies in existing guard models. The authors provide an open-source research prototype and datasets to facilitate scalable model evaluation and targeted safety interventions, thereby contributing to the field of AI safety and evaluation.
Methodology
NodeSynth employs a systematic approach that includes a multi-scale taxonomy generation process, evidence grounding, and automated annotation to create synthetic queries. The fine-tuned taxonomy generator (TaG) is used to ensure that the generated data reflects real-world complexities and sensitive attributes, allowing for a more nuanced evaluation of AI models.
Results
The evaluation of NodeSynth against four mainstream LLMs demonstrated that the synthetic queries generated led to failure rates that were up to five times higher than those generated from human-authored data. The ablation studies indicated that the depth of the taxonomic expansion was a significant factor in driving these higher failure rates, exposing shortcomings in existing guard models.
Implications
NodeSynth has the potential to enhance the evaluation frameworks for AI models, particularly in sensitive domains such as health and education. By providing a method to generate contextually relevant synthetic data, it can help identify and mitigate biases and vulnerabilities in AI systems, ultimately contributing to safer AI deployment.
Enjoy Your Layer Normalization with the Computational Efficiency of RMSNorm
Efficient ML
- Introduces a framework for replacing Layer Normalization with RMSNorm in DNNs.
- Defines 'foldable LNs' and develops a graph-based detection algorithm.
- Achieves 2% to 12% inference-time acceleration without changing model predictions.
- Maintains competitive performance compared to standard Layer Normalization in practical training settings.
Read more
Enjoy Your Layer Normalization with the Computational Efficiency of RMSNorm
Summary
This paper addresses the computational inefficiencies associated with Layer Normalization (LN) in deep neural networks (DNNs) by proposing a framework that allows for the replacement of LN with RMSNorm without compromising model performance. While LN provides benefits in stabilizing training, its per-sample centering and scaling operations introduce significant overhead during inference. RMSNorm, which eliminates the centering operation, offers a more efficient alternative but may lose some advantages of centering. The authors introduce a method to determine when LN can be replaced by RMSNorm by folding the centering operation into upstream linear layers using a column-centered constraint (CCC) and column-based weight centering (CBWC). They define 'foldable LNs' and develop a graph-based detection algorithm to identify these layers in arbitrary DNNs. Their analysis reveals that many LNs in popular architectures are foldable, enabling inference-time conversion that results in 2% to 12% acceleration without altering model predictions. Furthermore, experiments demonstrate that even when the exact conditions for folding are not met, the proposed method remains competitive with standard LN while enhancing efficiency, particularly in long-sequence tasks.
Methodology
The authors propose a framework that characterizes when the centering operation of LN can be removed by enforcing zero-mean outputs through CCC and CBWC. They extend their analysis to general DNNs, defining foldable LNs and utilizing a graph-based approach to detect these layers. An algorithm is presented to identify the necessary upstream layers for applying CBWC.
Results
The analysis shows that many LNs in widely used architectures are foldable, allowing for significant inference-time acceleration of 2% to 12%. Experiments across various tasks indicate that the proposed CBWC+RMSNorm method performs comparably to vanilla LN, particularly in scenarios where exact folding conditions are not fully satisfied.
Implications
This work has significant implications for the design of efficient neural network architectures, particularly in applications requiring real-time inference, such as NLP and computer vision. By enabling the use of RMSNorm in place of LN, the proposed method can lead to faster model inference times while maintaining accuracy.
Contextual Bandits for Resource-Constrained Devices using Probabilistic Learning
Reinforcement Learning
Efficient ML
Theory
- Introduces probabilistic HD-CB, a low-precision variant of HD-CB for resource-constrained devices.
- Replaces deterministic accumulation with a probabilistic update rule to enhance decision-making efficiency.
- Demonstrates improved performance over binarized HD-CB while maintaining low precision.
- Addresses the overflow issue in low-precision components without the need for periodic binarization.
Read more
Contextual Bandits for Resource-Constrained Devices using Probabilistic Learning
Summary
This paper addresses the challenges of deploying contextual bandit (CB) algorithms on resource-constrained devices, where standard linear CB methods are often impractical due to their high computational and memory costs. The authors introduce a novel approach called probabilistic HD-CB, which enhances the existing hyperdimensional computing-based CB method (HD-CB) by replacing deterministic accumulation with a probabilistic update rule. This new method updates only a random subset of vector components at each step, allowing for low-precision components while preventing overflow and reducing update costs. The authors demonstrate that probabilistic HD-CB outperforms the previously proposed binarized HD-CB method at equal precision levels and approaches the performance of the original HD-CB with significantly lower precision requirements. The findings suggest that probabilistic HD-CB is a promising solution for implementing CB algorithms on devices with strict resource constraints, making it suitable for various applications in edge computing and adaptive services.
Methodology
The authors developed probabilistic HD-CB by implementing a probabilistic update mechanism that randomly selects a subset of vector components to update at each decision step. This method is inspired by low-precision neural networks and incorporates time-decaying update probabilities to control the learning rate, allowing for efficient memory usage and preventing overflow in low-precision components.
Results
Probabilistic HD-CB consistently outperformed the binarized HD-CB approach in off-policy evaluations on standardized synthetic CB benchmarks, achieving performance levels comparable to the original HD-CB with as few as 3 bits per component. This indicates a significant reduction in resource requirements while maintaining decision quality.
Implications
The findings suggest that probabilistic HD-CB can facilitate the deployment of contextual bandit algorithms on resource-constrained devices, enabling real-time decision-making in various applications such as online advertising, personalized recommendations, and adaptive resource management in edge computing environments.
Discovery of Hidden Miscalibration Regimes
Large Language Models
NLP
Interpretability
- Introduces the concept of hidden miscalibration regimes that are not detectable through traditional calibration methods.
- Defines an input-dependent miscalibration field to measure calibration error across the input space.
- Demonstrates the prevalence of calibration heterogeneity in large language models across various datasets.
- Provides a diagnostic framework that supports local confidence corrections, enhancing model reliability.
Read more
Discovery of Hidden Miscalibration Regimes
Summary
This paper addresses the issue of model calibration in machine learning, particularly focusing on the limitations of traditional calibration evaluation methods that rely solely on confidence scores. The authors argue that such methods can obscure significant calibration failures by treating all inputs with the same confidence as exchangeable, leading to a lack of insight into the model's performance across different input types. To tackle this problem, the authors propose a novel framework for discovering hidden miscalibration regimes without requiring predefined data slices. They introduce the concept of a miscalibration field, which captures the signed calibration error across the input space. By learning a calibration-aware representation of the input space, the framework enables the identification of regions where models are systematically overconfident or underconfident. The authors validate their approach through synthetic experiments and a large-scale study involving four real-world benchmarks and twelve large language models (LLMs). The findings reveal that input-dependent calibration heterogeneity is common, and the discovered miscalibration fields can inform targeted local confidence corrections, improving model reliability in areas where traditional methods fall short.
Methodology
The authors develop a diagnostic framework that learns a representation of the input space, allowing for the estimation of a miscalibration field. This field captures local calibration errors by averaging residuals in a learned geometry, facilitating the identification of coherent regions of overconfidence and underconfidence without relying on predefined data slices.
Results
The framework successfully identifies hidden miscalibration structures in controlled synthetic experiments and reveals significant calibration heterogeneity in real-world LLM benchmarks. The results indicate that some model-dataset pairs exhibit pronounced regions of miscalibration, and the learned fields effectively reduce calibration error in these areas, outperforming traditional confidence-based methods.
Implications
The findings suggest that understanding and addressing hidden miscalibration regimes can lead to more reliable machine learning models, particularly in complex domains like natural language processing. The proposed framework can be applied to improve model calibration in various applications, enhancing decision-making processes that rely on model predictions.
Population Risk Bounds for Kolmogorov-Arnold Networks Trained by DP-SGD with Correlated Noise
Theory
Optimization
- First population risk bounds for KANs trained with mini-batch SGD and correlated noise.
- Establishes bounds for both non-private and differentially private settings.
- Introduces a novel analysis framework for correlated-noise DP training in non-convex regimes.
- Demonstrates that correlated noise can improve the privacy-utility tradeoff compared to independent noise.
Read more
Population Risk Bounds for Kolmogorov-Arnold Networks Trained by DP-SGD with Correlated Noise
Summary
This paper presents the first population risk bounds for Kolmogorov-Arnold Networks (KANs) trained using mini-batch Stochastic Gradient Descent (SGD) with gradient clipping, addressing both non-private and differentially private (DP) settings with correlated noise. The authors highlight that traditional analyses have been limited to full-batch gradient descent and independent noise, which do not reflect practical training scenarios. The study introduces a novel analysis framework that accounts for the challenges posed by temporal correlations in noise, which can improve the privacy-utility tradeoff. The results extend existing KAN theory by providing sharper bounds and covering various special cases, including non-private mini-batch SGD and independent-noise DP-SGD. The paper's contributions include establishing population risk bounds for two-layer KANs, demonstrating that correlated noise can enhance performance, and providing a new analytical route for DP training in non-convex settings. This work is significant as it fills a gap in the literature regarding population risk guarantees for non-convex neural networks under practical training conditions.
Methodology
The authors develop a new analysis route that incorporates an auxiliary unprojected dynamics and a shifted iterate to manage the challenges posed by correlated noise in the optimization process. They utilize high-probability bootstrapping to certify projection inactivity, allowing for a more accurate assessment of population risk bounds in the context of mini-batch SGD and DP training.
Results
The paper establishes that the population risk bounds for two-layer KANs trained by clipped mini-batch SGD are valid under specific width regimes. In the DP setting, the results provide the first population risk bounds for correlated-noise DP training in non-convex settings, matching the convex DP-SCO lower bound up to logarithmic factors. The findings also confirm that the bounds for KANs encompass various training scenarios, including non-private and independent-noise cases.
Implications
The results have significant implications for the training of neural networks in sensitive applications, such as healthcare and finance, where privacy is crucial. The improved understanding of population risk in the context of correlated noise can lead to better model performance and privacy guarantees in practical machine learning deployments.
Reliability-Gated Source Anchoring for Continual Test-Time Adaptation
Computer Vision
Theory
Optimization
- Identification of 'blind anchoring' as a systematic failure in CTTA methods when relying on unreliable sources.
- Introduction of RMEMSAFE, which gates source-coupled uses based on a runtime reliability signal derived from source entropy.
- Demonstration of an analytical graceful-decay property, ensuring performance stability as source reliability decreases.
- Empirical validation showing RMEMSAFE outperforms existing methods across multiple benchmarks and degradation scenarios.
Read more
Reliability-Gated Source Anchoring for Continual Test-Time Adaptation
Summary
This paper addresses the challenges of continual test-time adaptation (CTTA) in machine learning, particularly the issue of 'blind anchoring' where models continue to rely on a deteriorating source reference during adaptation. The authors introduce RMEMSAFE, a reliability-gated extension of the ROID method, which utilizes the normalized predictive entropy of the frozen source to modulate the strength of the source anchor dynamically. This approach allows the model to adapt more safely by reducing reliance on the source when its reliability is low. The paper demonstrates that RMEMSAFE achieves superior performance on the CCC benchmark, particularly in scenarios where the source accuracy is significantly degraded. The method not only improves the adaptation process but also ensures a graceful decay in performance when the source quality declines, thereby addressing a critical failure mode in existing CTTA methods.
Methodology
The authors propose RMEMSAFE, which modifies the ROID optimization framework by introducing a reliability gate that adjusts the strength of source-coupled terms based on the normalized predictive entropy of the source. This gate is combined with auxiliary stabilizers like marginal calibration and confidence-scaled learning rates to maintain performance when the source is unreliable.
Results
RMEMSAFE achieves the lowest error on 8 out of 9 matched-split continual-corruption benchmark cells, improving mean error rates by 1.05 percentage points on ResNet-50 and 0.48 percentage points on ViT-B/16 compared to the best prior method. Additionally, it exhibits a shallower harm slope under controlled source degradation, indicating a more graceful performance decline.
Implications
The findings suggest that incorporating reliability signals into adaptation processes can significantly enhance model robustness in dynamic environments. This approach could be applied in various real-world applications where models must adapt to changing data distributions without labeled feedback.
Understanding Imbalanced Forgetting in Rehearsal-Based Class-Incremental Learning
Theory
- Imbalanced forgetting is a systematic issue in rehearsal-based CIL, leading to unequal class retention.
- Three last-layer coefficients were developed to quantify gradient-level interference affecting class performance.
- Self-induced interference is identified as the most significant predictor of class forgetting.
- The study provides a mechanistic understanding of how rehearsal impacts class retention in CIL.
Read more
Understanding Imbalanced Forgetting in Rehearsal-Based Class-Incremental Learning
Summary
This paper investigates the phenomenon of imbalanced forgetting in rehearsal-based class-incremental learning (CIL), where neural networks tend to forget certain classes more than others despite balanced rehearsal allocation. The authors identify that this imbalanced forgetting is systematic and significant, posing challenges in real-world applications where all classes are equally important. They introduce three last-layer coefficients that capture different sources of gradient-level interference affecting class performance during incremental learning. These coefficients are shown to reliably predict the extent of forgetting for past classes. The study highlights that self-induced interference is the strongest predictor of forgetting, and controlled experiments support the relationship between this interference and new-class interference. The findings provide insights into mitigating imbalanced forgetting by addressing disparities in interference sources across classes.
Methodology
The authors conducted comprehensive benchmarks across various CIL scenarios to observe patterns of forgetting. They derived an upper bound, termed the Class-Wise R-SGD lemma, to identify gradient-level factors influencing class performance. From this analysis, they constructed three coefficients representing different sources of interference, focusing on last-layer gradients for computational efficiency.
Results
The experiments revealed that imbalanced forgetting is a robust phenomenon, with significant disparities in forgetting rates among classes. The last-layer coefficients effectively predicted class performance outcomes, with the self-induced interference coefficient emerging as the strongest predictor. Controlled experiments confirmed the influence of new-class interference on forgetting rates.
Implications
The insights from this study can inform the development of more effective rehearsal strategies in CIL, potentially leading to improved performance in applications where class retention is critical. Understanding the sources of interference can guide future research in continual learning and help design algorithms that mitigate forgetting.
Spatiotemporal downscaling and nowcasting of urban land surface temperatures with deep neural networks
Time Series
- Introduces a novel deep learning model for downscaling LST from geostationary to high-resolution satellite data.
- Achieves high accuracy in LST forecasting with low RMSE and bias errors.
- Demonstrates the applicability of the model across major European cities.
- Provides a framework for intraday LST nowcasting, enhancing urban climate studies.
Read more
Spatiotemporal downscaling and nowcasting of urban land surface temperatures with deep neural networks
Summary
This paper addresses the challenge of accurately forecasting Land Surface Temperature (LST) in urban areas by combining high spatial and temporal resolution satellite data. The authors propose a two-step approach: first, a U-Net model is developed to downscale LST fields from geostationary satellites (SEVIRI/MSG) to a higher resolution (1 km) using collocated data from polar orbiting satellites (Terra/Aqua MODIS). This model is trained on LST data from major European cities with populations over 1 million, achieving an RMSE of 1.92 °C and a mean bias error of 0.01 °C on the test set. The second step involves a nowcasting model based on ConvLSTM architecture, which forecasts LST fields for lead times of 15 to 75 minutes. This model outperforms traditional benchmarks, yielding RMSEs between 0.57 and 1.15 °C and biases ranging from -0.1 to 0.14 °C. The validation against independent MODIS overpasses confirms the robustness of the forecasts. This research represents a significant advancement in the field of urban climate monitoring, providing a practical tool for operational satellite-based LST monitoring.
Methodology
The study employs a U-Net architecture for downscaling LST fields from SEVIRI/MSG to MODIS resolution and a ConvLSTM model for nowcasting LST fields. The models are trained on extensive datasets from large European cities, focusing on both spatial and temporal resolution improvements.
Results
The downscaling model achieved an RMSE of 1.92 °C and a mean bias error of 0.01 °C. The nowcasting model outperformed benchmarks with RMSEs of 0.57 to 1.15 °C for lead times of 15 to 75 minutes, demonstrating effective short-term forecasting capabilities.
Implications
The proposed models can significantly enhance urban climate monitoring, allowing for better management of urban heat islands and related phenomena. The high-resolution, high-frequency LST data can be utilized in various applications, including energy demand forecasting, ecosystem monitoring, and climate change studies.
Novel Dynamic Batch-Sensitive Adam Optimiser for Vehicular Accident Injury Severity Prediction
Optimization
Time Series
- Introduction of DBS-Adam, a novel optimiser for deep learning.
- DBS-Adam dynamically adjusts learning rates based on batch difficulty.
- Integration with Bi-LSTM networks improves prediction of injury severity.
- Significant performance improvements over traditional optimisers.
Read more
Novel Dynamic Batch-Sensitive Adam Optimiser for Vehicular Accident Injury Severity Prediction
Summary
This paper introduces the Dynamic Batch-Sensitive Adam (DBS-Adam) optimiser, designed to enhance the training of deep learning models on imbalanced and sequential datasets, particularly in the context of predicting vehicular accident injury severity. Traditional optimisers often struggle with class imbalance and noisy data, which are prevalent in road traffic accident records. DBS-Adam addresses these challenges by dynamically adjusting the learning rate based on a batch difficulty score derived from exponential moving averages of gradient norms and batch loss. The authors integrate DBS-Adam with Bi-Directional Long Short-Term Memory (Bi-LSTM) networks to predict injury severity, employing SMOTE-ENN resampling and Focal Loss to tackle class imbalance. The performance of DBS-Adam is rigorously evaluated against state-of-the-art optimisers such as AMSGrad, AdamW, and AdaBound across multiple experimental configurations. The results demonstrate that DBS-Adam significantly improves model performance, achieving a test accuracy of 95.22%, precision of 96.11%, recall of 95.28%, F1-score of 95.39%, and a test loss of 0.0086. This study highlights the potential of DBS-Adam for real-time accident severity classification, which can inform emergency response strategies and enhance road safety interventions.
Methodology
The study employs a novel optimiser, DBS-Adam, which adjusts the learning rate based on a difficulty score derived from the gradients and losses of batches. This optimiser is integrated with Bi-Directional LSTM networks for predicting injury severity from vehicular accident data. The study also uses SMOTE-ENN for resampling to address class imbalance and Focal Loss to enhance model training.
Results
DBS-Adam outperformed traditional optimisers, achieving a test accuracy of 95.22%, precision of 96.11%, recall of 95.28%, F1-score of 95.39%, and a test loss of 0.0086. The results were statistically significant, with a p-value of 0.020, indicating a robust improvement in model performance.
Implications
The findings suggest that DBS-Adam can significantly enhance the training of deep learning models on imbalanced datasets, making it particularly useful for applications in real-time accident severity classification, which can improve emergency response and resource allocation in traffic incidents.
TAPIOCA: Why Task-Aware Pruning Improves OOD model Capability
Large Language Models
NLP
Efficient ML
- Task-aware pruning significantly improves performance on OOD data but not on ID data.
- OOD inputs induce a mismatch in representation geometry compared to ID inputs.
- Certain layers can amplify distortions in representation, affecting model performance based on input distribution.
- Pruning layers that amplify mismatches can realign OOD representations with the adapted geometry.
Read more
TAPIOCA: Why Task-Aware Pruning Improves OOD model Capability
Summary
This paper investigates the phenomenon of task-aware layer pruning, specifically through the TALE method, which has been shown to enhance model performance on out-of-distribution (OOD) data while not benefiting in-distribution (ID) data. The authors provide a geometric interpretation of this effect, demonstrating that OOD inputs distort the representation geometry learned from ID inputs. By pruning layers that amplify this distortion, task-aware pruning realigns OOD representations with the adapted geometry, leading to improved performance. The study employs controlled polynomial regression tasks and large language models (LLMs) to establish that the benefits of pruning are specific to OOD scenarios. The authors also present causal evidence through distribution shifts and residual-scaling interventions, confirming that certain layers can act as amplifiers of distortion, thus affecting performance based on the input distribution. The findings suggest that task-aware pruning is not only task-specific but also distribution-dependent, providing insights into the mechanisms that underlie the effectiveness of pruning in enhancing OOD model capabilities.
Methodology
The authors analyze the effects of task-aware pruning using the TALE method in a controlled setting that distinguishes between ID and OOD data. They employ geometric statistics to assess changes in representation spaces and conduct causal experiments involving distribution shifts and residual-scaling interventions to validate their findings across different model scales.
Results
The results indicate that task-aware pruning leads to consistent improvements in OOD performance across various model scales, while showing no benefits for ID data. The analysis reveals that OOD inputs create a significant distortion in the representation geometry, which can be mitigated through strategic layer pruning.
Implications
The findings suggest that task-aware pruning can be an effective strategy for enhancing the robustness of models to distribution shifts, which is particularly relevant in real-world applications where data may not always conform to training distributions. This could lead to more resilient AI systems in various domains, especially in NLP tasks.
EvolveMem: Self-Evolving Memory Architecture via AutoResearch for LLM Agents
Large Language Models
NLP
Optimization
- EVOLVEMEM allows for self-evolution of both memory content and retrieval mechanisms.
- The architecture employs a closed-loop diagnosis system powered by LLMs to optimize retrieval configurations.
- The AutoResearch process enables the system to autonomously improve its performance without manual tuning.
- EVOLVEMEM shows significant performance improvements over strong baselines on multiple benchmarks.
Read more
EvolveMem: Self-Evolving Memory Architecture via AutoResearch for LLM Agents
Summary
The paper introduces EVOLVEMEM, a self-evolving memory architecture designed for large language model (LLM) agents that require long-term memory across multiple sessions. Traditional memory systems maintain a fixed retrieval infrastructure, which becomes suboptimal as the complexity of stored memories increases. EVOLVEMEM addresses this issue by allowing both the stored knowledge and the retrieval mechanisms to co-evolve. The architecture features a structured action space optimized by an LLM-powered diagnosis module that analyzes failure logs, identifies root causes, and proposes configuration adjustments. This closed-loop self-evolution process, termed AutoResearch, enables the system to autonomously conduct iterative research cycles on its architecture, leading to the discovery of effective retrieval strategies. The results demonstrate that EVOLVEMEM significantly outperforms existing baselines on benchmark datasets, showcasing its ability to adaptively enhance retrieval strategies over time.
Methodology
The methodology involves a four-step evolution loop: EVALUATE, DIAGNOSE, PROPOSE, and GUARD. The LLM-powered diagnosis module evaluates per-question failure logs to diagnose issues and propose targeted adjustments to the retrieval configuration. A guarded meta-analyzer applies these adjustments while ensuring safeguards against regression and stagnation.
Results
On the LoCoMo benchmark, EVOLVEMEM outperforms the strongest baseline by 25.7% relative and achieves a 78.0% improvement over a minimal baseline. On MemBench, it exceeds the strongest baseline by 18.9% relative. The configurations evolved by EVOLVEMEM demonstrate positive transfer across benchmarks, indicating their robustness and generalizability.
Implications
The findings suggest that adaptive memory systems can significantly enhance the performance of LLM agents in various applications, such as personal assistants, coding agents, and customer service systems. The ability to autonomously evolve retrieval strategies could lead to more efficient and effective memory management in AI systems.
Learning with Shallow Neural Networks on Cluster-Structured Features
Theory
Efficient ML
Optimization
- Introduces a model for learning with shallow neural networks on clustered, correlated features.
- Demonstrates that sample complexity can be independent of input dimension in high SNR regimes.
- Proposes a layerwise gradient descent method that leverages correlations among input features.
- Empirical tests support theoretical claims using synthetic and real-world data.
Read more
Learning with Shallow Neural Networks on Cluster-Structured Features
Summary
This paper investigates the learning dynamics of shallow neural networks when applied to data with cluster-structured features, emphasizing the importance of correlations among input features. The authors propose a model where the target function is influenced by a limited number of latent Boolean variables, while the input features are grouped into clusters that correlate with these latent variables. The study reveals that under certain conditions, particularly a high signal-to-noise ratio (SNR), the sample complexity of learning can become independent of the input dimension, scaling instead with the number of hidden variables. This contrasts with traditional models that assume unstructured input distributions, where sample complexity typically depends on the ambient dimension. The authors empirically validate their theoretical findings using both synthetic and real datasets, demonstrating that gradient descent can effectively exploit the underlying correlations in the data without requiring explicit latent representations.
Methodology
The authors develop a tractable model that incorporates clustered, correlated features and a limited number of latent Boolean variables. They analyze the sample complexity of learning using a layerwise gradient descent approach on a two-layer fully connected neural network, under an identifiability assumption. The model is tested empirically on both synthetic and real datasets to validate the theoretical findings.
Results
The study finds that in high SNR conditions, the sample complexity of learning with shallow networks is governed by the number of latent variables rather than the input dimension, leading to a significant reduction in the required sample size for effective learning. Empirical results corroborate the theoretical predictions, showing that the proposed model can successfully learn from data with strong feature correlations.
Implications
The findings suggest that shallow neural networks can be effectively utilized in scenarios with high-dimensional, redundant data, such as genomics and image processing, where latent structures are present. This could lead to more efficient learning algorithms that require fewer samples, making them applicable in fields where data collection is expensive or limited.
Mini-JEPA Foundation Model Fleet Enables Agentic Hydrologic Intelligence
Multimodal
Computer Vision
Efficient ML
- Mini-JEPAs achieve high accuracy in predicting environmental variables specific to their satellite sensors.
- The fleet of Mini-JEPAs demonstrates distinct embedding manifold geometries, reflecting the physics of their respective sensors.
- A routing agent effectively selects the appropriate Mini-JEPA for specialized hydrologic questions, enhancing retrieval performance.
- Mini-JEPAs provide a cost-effective alternative to large-scale foundation models for hydrologic intelligence applications.
Read more
Mini-JEPA Foundation Model Fleet Enables Agentic Hydrologic Intelligence
Summary
This paper introduces a fleet of small, sensor-specialized Joint Embedding Predictive Architecture (JEPA) foundation models, termed Mini-JEPAs, designed to enhance hydrologic intelligence. Unlike generalist models like Google AlphaEarth, which may compromise on specialized hydrologic signals, Mini-JEPAs are tailored to specific satellite sensors, allowing for more accurate environmental variable predictions. The study pretrains five Mini-JEPAs, each with 22 million parameters, utilizing a shared Vision Transformer backbone and a consistent training recipe. The models are trained on various satellite data sources, including Sentinel-1, Sentinel-2, and MODIS, achieving high cross-validated R2 scores for environmental variables such as elevation, temperature, and precipitation. The paper also discusses the distinct geometric structures of the embedding manifolds produced by each Mini-JEPA and their operationalization through a routing agent that selects the appropriate sensor for specific queries. The results indicate that the Mini-JEPAs can outperform a single generalist model on specialized tasks, suggesting a viable alternative for institutions with limited computational resources.
Methodology
The study employs a Joint Embedding Predictive Architecture (JEPA) framework to train five Mini-JEPAs, each specialized for different satellite sensor data. The models share a common Vision Transformer backbone and are trained on a curated dataset of satellite images. A routing agent is utilized to select the appropriate Mini-JEPA based on the specific environmental query.
Results
The Mini-JEPAs achieved high R2 scores (0.97 for elevation and temperature, 0.81 for precipitation) and demonstrated distinct geometric structures in their embedding manifolds. The routing agent successfully matched sensor modalities to queries with a perfect hit rate, and the dual retrieval system outperformed the AlphaEarth model on specialized questions, indicating the effectiveness of the Mini-JEPA fleet.
Implications
The findings suggest that locally-trained, sensor-specialized models can significantly enhance hydrologic intelligence systems, making them more accessible and efficient for research groups and institutions lacking extensive computational resources. This approach could lead to improved environmental monitoring and decision-making in hydrology and related fields.
SeesawNet: Towards Non-stationary Time Series Forecasting with Balanced Modeling of Common and Specific Dependencies
Time Series
- SeesawNet addresses the challenge of balancing common and instance-specific dependencies in non-stationary time series forecasting.
- The architecture utilizes Adaptive Stationary–Nonstationary Attention (ASNA) for dynamic dependency modeling.
- Incorporates specialized layers for temporal and cross-channel dependency learning.
- Demonstrates superior performance compared to state-of-the-art forecasting methods on real-world datasets.
Read more
SeesawNet: Towards Non-stationary Time Series Forecasting with Balanced Modeling of Common and Specific Dependencies
Summary
SeesawNet introduces a novel approach to non-stationary multivariate time series forecasting by addressing the limitations of instance normalization (IN), which often oversmooths instance-specific information critical for capturing temporal and cross-channel heterogeneity. The proposed architecture employs Adaptive Stationary–Nonstationary Attention (ASNA) to dynamically balance the modeling of common and instance-specific dependencies. ASNA utilizes two attention branches to extract common dependencies from normalized sequences and specific dependencies from raw sequences, fusing them through an instance-adaptive gating mechanism. Additionally, SeesawNet incorporates a Patch Dependency Learning Layer and a Channel Relationship Learning Layer to effectively model temporal and cross-channel dependencies. Extensive experiments on various real-world benchmarks demonstrate that SeesawNet consistently outperforms existing state-of-the-art methods, showcasing its effectiveness in handling non-stationary time series data.
Methodology
The methodology involves the development of SeesawNet, which integrates Adaptive Stationary–Nonstationary Attention (ASNA) to capture both common and instance-specific dependencies. The model alternates between dedicated temporal and channel relationship modeling, utilizing a gating mechanism for adaptive fusion of dependencies. The architecture also includes Patch Dependency Learning and Channel Relationship Learning layers to enhance the modeling of temporal and cross-channel interactions.
Results
SeesawNet was evaluated on multiple real-world benchmarks, showing consistent improvements in forecasting accuracy over existing state-of-the-art methods. The results indicate that the model effectively captures both common patterns and instance-specific trends, leading to better generalization in non-stationary environments.
Implications
The findings suggest that SeesawNet can be applied to various domains requiring accurate time series forecasting, such as traffic management, power load prediction, and financial forecasting. Its ability to adaptively model dependencies can enhance predictive performance in scenarios characterized by non-stationarity.
Cognitive-Uncertainty Guided Knowledge Distillation for Accurate Classification of Student Misconceptions
NLP
Efficient ML
Interpretability
- Introduces a two-stage knowledge distillation framework for classifying student misconceptions.
- Addresses challenges of data scarcity, annotation noise, and model deployment paradox.
- Utilizes cognitive uncertainty to identify critical samples for improved model training.
- Achieves significant performance improvements over existing large models with a lightweight approach.
Read more
Cognitive-Uncertainty Guided Knowledge Distillation for Accurate Classification of Student Misconceptions
Summary
This paper addresses the critical challenge of accurately identifying student misconceptions in educational contexts, which is essential for personalized learning. The authors identify three main challenges: data scarcity with long-tail distributions, fuzzy boundaries between error categories leading to high annotation noise, and the deployment paradox where large models fail to capture unconventional reasoning due to pretraining biases. To overcome these challenges, the authors propose a two-stage knowledge distillation framework. The first stage involves standard distillation to transfer task capabilities, while the second stage employs a dual-layer marginal selection mechanism based on cognitive uncertainty to identify critical samples. This mechanism distinguishes between different types of errors and adapts the loss functions to balance contributions from hard and soft labels. The experimental results demonstrate that their approach, using only 10.30% of filtered samples, achieves a MAP@3 score of 0.9585 on the MAP-Charting dataset and 84.38% accuracy on middle school algebra misconception benchmarks, significantly outperforming state-of-the-art models. The proposed method not only enhances classification accuracy but also addresses practical deployment concerns in educational technology.
Methodology
The authors propose a two-stage knowledge distillation framework. The first stage involves standard distillation to transfer task capabilities from a teacher model to a student model. The second stage introduces a dual-layer marginal selection mechanism that leverages cognitive uncertainty to identify critical samples, including Near-miss and Hard-hard samples. A difficulty-adaptive loss function is employed to balance the contributions of hard and soft labels, enabling the student model to learn nuanced inter-class relationships and distinguish ambiguous error types.
Results
The proposed method achieves a MAP@3 score of 0.9585 on the MAP-Charting dataset, representing a 17.8% improvement. Additionally, it reaches 84.38% accuracy on cross-topic tests of middle school algebra misconceptions using a 4B parameter model, outperforming state-of-the-art large language models (67.73%) and standard fine-tuned 72B models (81.25%).
Implications
This research has significant implications for personalized education, as it provides a framework for accurately identifying and addressing student misconceptions. The lightweight model can be deployed on edge devices, making it suitable for real-time applications in educational technology while ensuring privacy protection.
Active Learners as Efficient PRP Rerankers
NLP
Large Language Models
Efficient ML
- Active learning can improve the efficiency of PRP reranking by adaptively selecting comparisons.
- A randomized-direction oracle reduces the cost of pairwise comparisons by halving the number of calls needed.
- Active rankers significantly outperform traditional sorting algorithms in terms of NDCG@10 under call constraints.
- The proposed methods maintain robustness against noise and position bias in LLM judgments.
Read more
Active Learners as Efficient PRP Rerankers
Summary
This paper addresses the challenges of Pairwise Ranking Prompting (PRP) in the context of reranking for Retrieval-Augmented Generation (RAG) systems. Traditional PRP methods rely on classical sorting algorithms that assume transitive comparisons, which do not align well with the noisy and sometimes intransitive nature of judgments from large language models (LLMs). The authors propose reframing PRP reranking as an active learning problem, where the goal is to adaptively select which pairwise comparisons to query in order to maximize the quality of the top-K results within a budget. They introduce a noise-robust framework that includes a randomized-direction oracle, which allows for a single LLM call per pair, effectively converting systematic position bias into zero-mean noise. The study evaluates the performance of active rankers against traditional PRP rerankers and demonstrates significant improvements in ranking quality while reducing the number of calls needed. The findings suggest that active learning strategies can effectively enhance the efficiency and effectiveness of LLM-based reranking systems.
Methodology
The authors frame PRP reranking as an active learning problem, utilizing a randomized-direction oracle to gather pairwise preferences from LLMs. They adaptively select which pairs to query based on their potential impact on the top-K ranking quality, focusing on uncertain comparisons. The performance of active rankers is compared to traditional sorting methods using metrics like NDCG@10 across various datasets.
Results
The experiments show that the active ranking algorithm, particularly the Mohajer scheduler, outperforms the best sorting baseline by 9.7 NDCG@10 at a fixed budget of 300 calls. Additionally, the randomized-direction oracle enhances the performance of both active and traditional rankers, with active rankers achieving comparable NDCG@10 scores to QuickSort while using up to 7 times fewer calls.
Implications
The findings suggest that integrating active learning into LLM-based reranking systems can lead to more efficient use of computational resources, reducing costs and improving the quality of results in applications such as information retrieval and recommendation systems.
Strategic PAC Learnability via Geometric Definability
Theory
- Strategic behavior can significantly impact the learnability of hypothesis classes.
- The authors provide a counterexample showing that learnability is not preserved under strategic behavior in simple cases.
- Introducing geometric definability allows for the preservation of learnability and manageable sample complexity.
- The framework accommodates a variety of cost functions and hypothesis classes commonly used in machine learning.
Read more
Strategic PAC Learnability via Geometric Definability
Summary
This paper investigates the concept of strategic classification, where individuals can alter their features at a cost to influence a classifier's decision. The authors explore how the sample complexity of the induced strategic hypothesis class is affected by the complexities of the underlying hypothesis class and the cost structure governing feature manipulations. They demonstrate that previous assumptions about learnability under strategic behavior are not universally valid, presenting a counterexample where a hypothesis class with VC dimension 1 leads to an induced class with infinite VC dimension. To address this, the authors introduce a geometric definability assumption, allowing both the hypothesis class and cost-induced neighborhood relations to be described using first-order formulas over the reals. This framework captures a wide range of natural classes and cost functions. The authors prove that under this geometric structure, learnability is preserved, and the sample complexity is manageable, depending on the complexity of the defining formulas. This work highlights the necessity of geometric structure in maintaining learnability in strategic settings.
Methodology
The authors utilize a theoretical approach, constructing counterexamples and proving results based on first-order definability over the reals. They analyze the implications of strategic behavior on sample complexity and learnability through geometric structures.
Results
The paper establishes that strategic behavior can lead to non-learnable hypothesis classes, even in simple scenarios. However, by imposing a geometric definability structure, the authors prove that learnability can be preserved, with sample complexity linked to the complexity of the defining formulas.
Implications
The results suggest that strategic classification systems need to account for geometric structures to ensure robust learnability. This has practical implications for designing classifiers in various fields where individuals can manipulate their features to influence outcomes.
R2R2: Robust Representation for Intensive Experience Reuse via Redundancy Reduction in Self-Predictive Learning
Reinforcement Learning
Robotics
Theory
- Introduction of R2R2, a regularization method that reduces representation-level instability in SPL.
- Theoretical analysis reveals the conflict between zero-centering and SPL's spectral properties.
- Integration of SPL into the SimbaV2 architecture, creating SimbaV2-SPL, which achieves state-of-the-art performance.
- R2R2 improves TD7 performance by approximately 22% at high UTD ratios.
Read more
R2R2: Robust Representation for Intensive Experience Reuse via Redundancy Reduction in Self-Predictive Learning
Summary
The paper addresses the challenge of overfitting in reinforcement learning (RL) when reusing data intensively, particularly in data-scarce environments like robotics. The authors introduce a novel regularization method called Robust Representation via Redundancy Reduction (R2R2) that targets representation-level instability in Self-Predictive Learning (SPL). They identify a conflict between standard zero-centering techniques and the spectral properties of SPL, proposing a non-centered objective that preserves essential global dynamics information. R2R2 is validated on SPL-native algorithms such as TD7 and is shown to improve performance significantly, particularly at high Update-to-Data (UTD) ratios. Additionally, the authors enhance the state-of-the-art SimbaV2 architecture by integrating a tailored SPL module, resulting in SimbaV2-SPL, which sets a new benchmark in continuous control tasks. The experiments demonstrate that R2R2 effectively mitigates overfitting and enhances the performance of both TD7 and SimbaV2-SPL, confirming its compatibility with existing value-centric strategies.
Methodology
The authors propose R2R2 as a regularization method that employs redundancy reduction principles without zero-centering, thus preserving critical global dynamics information. They validate R2R2 on TD7 and extend SimbaV2 by integrating a tailored SPL module, termed SimbaV2-SPL, to evaluate performance improvements across various continuous control tasks.
Results
R2R2 significantly mitigates overfitting, improving TD7's performance by approximately 22% at a UTD ratio of 20. The enhanced architecture, SimbaV2-SPL, establishes a new state-of-the-art performance in continuous control benchmarks, with further gains observed when R2R2 is applied.
Implications
The findings suggest that R2R2 can enhance the efficiency and effectiveness of reinforcement learning algorithms in data-scarce environments, particularly in robotics. The integration of SPL into existing architectures may lead to further advancements in sample efficiency and performance in continuous control tasks.
MahaVar: OOD Detection via Class-wise Mahalanobis Distance Variance under Neural Collapse
Theory
Computer Vision
- Identification of a structural asymmetry in class-wise Mahalanobis distances between ID and OOD samples.
- Theoretical grounding of the observation in Neural Collapse geometry, linking variance to OOD detection.
- Introduction of MahaVar, an effective post-hoc OOD detector that incorporates class-wise distance variance.
- MahaVar achieves state-of-the-art performance on CIFAR-100 and ImageNet benchmarks.
Read more
MahaVar: OOD Detection via Class-wise Mahalanobis Distance Variance under Neural Collapse
Summary
This paper addresses the critical issue of out-of-distribution (OOD) detection in deep neural networks, which is essential for ensuring reliability in safety-critical applications. The authors present a novel empirical observation that class-wise Mahalanobis distances for in-distribution (ID) samples exhibit a sharp minimum structure, characterized by a small distance to the nearest class and large distances to other classes, leading to high variance. In contrast, OOD samples show a less pronounced minimum structure and lower variance. The authors provide a theoretical analysis based on Neural Collapse geometry, explaining why ID samples have higher class-wise distance variance. This observation motivates the development of MahaVar, a post-hoc OOD detector that enhances the Mahalanobis distance with a class-wise distance variance term. Extensive experiments following the OpenOOD v1.5 benchmark demonstrate that MahaVar outperforms existing Mahalanobis-based methods on CIFAR-100 and ImageNet, achieving state-of-the-art performance in both AUROC and FPR@95 metrics.
Methodology
The authors propose MahaVar, which augments the standard Mahalanobis distance with a class-wise distance variance term. This method leverages the geometric properties of the feature space, particularly under the Neural Collapse phenomenon, to distinguish between ID and OOD samples effectively.
Results
MahaVar consistently outperformed all Mahalanobis-based baselines across three datasets (CIFAR-10, CIFAR-100, and ImageNet) in both AUROC and FPR@95 metrics, achieving state-of-the-art results on CIFAR-100 and ImageNet. Notably, it demonstrated the best average performance across various backbone architectures on ImageNet.
Implications
The findings suggest that incorporating class-wise distance variance can significantly enhance OOD detection methods, making them more reliable for deployment in safety-critical applications such as autonomous driving and medical imaging.
A Hierarchical Language Model with Predictable Scaling Laws and Provable Benefits of Reasoning
NLP
Large Language Models
Theory
- Introduces synthetic languages with hierarchical structures for precise analysis of context and reasoning in autoregressive generation.
- Derives explicit asymptotic predictions for distributional statistics in two broadcast process settings.
- Establishes a lower bound on context length for faithful sampling and demonstrates an exponential improvement using reasoning models.
- Empirical results validate theoretical predictions, showing the relationship between context size and model performance.
Read more
A Hierarchical Language Model with Predictable Scaling Laws and Provable Benefits of Reasoning
Summary
This paper introduces a family of synthetic languages characterized by a hierarchical structure generated through a broadcast process on trees. The authors analyze the impact of context length and reasoning in autoregressive generation using an exact k-gram ansatz, which serves as a substitute for traditional transformers. They derive asymptotic predictions for the distributional statistics of sequences produced by trained models in two settings: the Ising broadcast process and the coloring broadcast process. The findings reveal that the variance of generated sequences scales log-linearly with context depth, and the kurtosis approaches that of a Gaussian distribution, indicating deviations from the true language for sublinear contexts. Furthermore, the authors establish a lower bound on the context length required for accurate sampling of sequences, contrasting this with an autoregressive reasoning model that can sample exactly from the true language using significantly less memory. Empirical validation with transformers confirms the theoretical predictions across various context sizes, demonstrating the effectiveness of the proposed model in understanding the trade-offs between context length and reasoning in language modeling.
Methodology
The authors utilize a k-gram ansatz to analyze autoregressive processes, replacing traditional transformers. They derive theoretical results based on a broadcast process on trees, focusing on two instantiations: the Ising broadcast process and the coloring broadcast process. Empirical validation is conducted using transformers trained on the synthetic languages to confirm theoretical predictions.
Results
The study finds that the variance of generated sequences scales log-linearly with context depth, while kurtosis converges to that of a Gaussian distribution. A lower bound of Ω(n) on context length is established for accurate sampling, while an autoregressive reasoning model with Θ(log n) memory can sample exactly from the true language, demonstrating significant efficiency gains.
Implications
The findings suggest that understanding the interplay between context length and reasoning can lead to more efficient language models, potentially reducing computational costs and improving performance in tasks requiring long-range dependencies.
The Rate-Distortion-Polysemanticity Tradeoff in SAEs
Theory
Interpretability
- Introduction of the Rate-Distortion-Polysemanticity tradeoff in Sparse Autoencoders.
- Theoretical and empirical evidence that enforcing monosemanticity increases rate and distortion.
- Polysemanticity is determined by the co-occurrence patterns of features in the training data.
- Development of necessary conditions for evaluating polysemanticity measures in real-world applications.
Read more
The Rate-Distortion-Polysemanticity Tradeoff in SAEs
Summary
This paper investigates the tradeoff between rate, distortion, and polysemanticity in Sparse Autoencoders (SAEs). While SAEs are effective in reconstructing inputs with minimal distortion and using few features, they often produce polysemantic representations, which complicates interpretability. The authors introduce the Rate-Distortion-Polysemanticity (RDP) tradeoff, demonstrating that enforcing monosemanticity—where each feature corresponds to a single concept—leads to increased rate and distortion. Through theoretical and empirical analysis, the paper shows that the degree of polysemanticity in optimal SAEs is influenced by the training data distribution, particularly the co-occurrence of features. The authors establish necessary conditions for polysemanticity measures in real-world scenarios and benchmark existing metrics on SAEs trained on Large Language Models. The findings suggest that polysemanticity is fundamentally a data issue that must be addressed in the design and optimization of SAEs.
Methodology
The authors employ a combination of theoretical modeling and empirical analysis to explore the RDP tradeoff. They construct a toy model to illustrate the relationship between feature co-occurrence and polysemanticity, and derive conditions for valid polysemanticity metrics. They also benchmark existing metrics on SAEs trained on Large Language Models to evaluate their effectiveness.
Results
The study confirms that restricting polysemanticity in SAEs leads to a tradeoff where rate and distortion increase. The toy model demonstrates that the optimal level of polysemanticity is predictable based on the training data's feature co-occurrence. The evaluation of existing polysemanticity metrics indicates that many widely-used measures do not adequately capture polysemanticity, with simpler metrics often outperforming more complex ones.
Implications
The findings highlight the importance of considering polysemanticity in the design of SAEs, suggesting that architectures and optimization strategies should account for the co-occurrence of concepts in training data to enhance interpretability. This work provides a framework for developing better metrics for evaluating interpretability in machine learning models.
Orthrus: Memory-Efficient Parallel Token Generation via Dual-View Diffusion
NLP
Large Language Models
Efficient ML
- Introduces Orthrus, a dual-architecture framework that combines autoregressive and diffusion models.
- Achieves up to 7.8× speedup in token generation while maintaining exact predictive fidelity.
- Utilizes a shared Key-Value cache to eliminate redundant memory usage.
- Incorporates a consensus mechanism for lossless inference.
Read more
Orthrus: Memory-Efficient Parallel Token Generation via Dual-View Diffusion
Summary
Orthrus introduces a novel dual-architecture framework that combines the high fidelity of autoregressive Large Language Models (LLMs) with the parallel token generation capabilities of diffusion models. Traditional autoregressive models face inefficiencies during the decoding phase due to their sequential nature, leading to high inference latency. On the other hand, diffusion models can generate tokens in parallel but often suffer from performance degradation and high training costs. Orthrus addresses these challenges by integrating a lightweight, trainable diffusion module alongside a frozen autoregressive model, allowing both components to share the same high-fidelity Key-Value (KV) cache. This design enables lossless inference while achieving significant speedups—up to 7.8 times faster—without incurring substantial memory overhead. The framework maintains the exact predictive distribution of the autoregressive model through a consensus mechanism that validates the outputs of the diffusion head against the autoregressive head, ensuring high-quality token generation. Orthrus is lightweight, requiring only a small fraction of the model parameters to be fine-tuned, making it a practical solution for enhancing the efficiency of existing LLMs.
Methodology
Orthrus integrates a frozen autoregressive model with a lightweight diffusion module. During the pre-filling stage, the autoregressive model constructs a high-fidelity Key-Value cache, which is then utilized by the diffusion head for parallel token generation. The framework employs a two-head consensus mechanism to ensure that the generated tokens match the autoregressive model's predictive distribution, achieving lossless inference.
Results
The Orthrus framework demonstrates significant improvements in inference speed, achieving up to 7.8 times faster token generation compared to traditional autoregressive methods. It also maintains the exact predictive distribution of the base autoregressive model, ensuring high-quality outputs.
Implications
Orthrus has the potential to enhance the efficiency of existing large language models, making them more suitable for real-time applications where speed is critical. Its lightweight design allows for easy integration into current systems, promoting broader adoption of efficient token generation techniques in natural language processing tasks.
Support Before Frequency in Discrete Diffusion
NLP
Large Language Models
Generative Models
- DLMs first learn the structure of valid sequences (support) before refining the probabilities of these sequences (frequency).
- The reverse edit probabilities can be decomposed into support and frequency components, influenced by the corruption mechanism.
- Uniform diffusion shows a trichotomy of edits, while absorbing diffusion focuses on validity-improving moves.
- Experiments demonstrate that support localization emerges before frequency ranking in DLMs.
Read more
Support Before Frequency in Discrete Diffusion
Summary
This paper investigates the learning dynamics of discrete diffusion models (DLMs) in language modeling, focusing on how these models prioritize learning data support before data frequencies. The authors introduce the 'Support-before-Frequency Hypothesis,' positing that DLMs first identify where valid sequences exist (support) and subsequently refine the probabilities of these sequences (frequency). Through a small-noise analysis of the reverse denoising process, they demonstrate that the reverse edit probabilities can be decomposed into two components: a leading scale indicating movement towards data support and a finer coefficient representing frequency information. The study reveals that uniform diffusion processes exhibit a trichotomy in edits, while absorbing diffusion primarily focuses on validity-improving moves. Experimental results confirm that support localization occurs earlier than frequency ranking in DLMs, supporting the hypothesis that discrete diffusion models learn data support before data frequencies.
Methodology
The authors conducted a theoretical analysis of the reverse denoising process in discrete diffusion models, focusing on the small-noise regime. They derived key theorems regarding the structure of reverse edit probabilities and designed experiments to test their predictions using both synthetic and real data, particularly with a masked language diffusion model.
Results
The results indicated that support localization in DLMs occurs significantly earlier than frequency ranking, validating the Support-before-Frequency Hypothesis. Additionally, the experiments showed a clear distinction between the effects of uniform and absorbing diffusion mechanisms on the learning process.
Implications
These findings suggest that improving the design of DLMs could involve focusing on enhancing support identification mechanisms, which may lead to more efficient training and better performance in language modeling tasks. Understanding the hierarchy of learning in DLMs could also inform future research on model architectures and training strategies.
Artificial Intelligence-Assistant Cardiotocography: Unified Model for Signal Reconstruction, Fetal Heart Rate Analysis, and Variability Assessment
Time Series
- Development of a unified AI model for FHR monitoring that addresses traditional method limitations.
- High sensitivity and specificity in detecting critical FHR changes, improving clinical decision-making.
- Utilization of a large dataset for training and validation, enhancing model robustness.
- Introduction of the IOL approach for more accurate categorical analysis of FHR data.
Read more
Artificial Intelligence-Assistant Cardiotocography: Unified Model for Signal Reconstruction, Fetal Heart Rate Analysis, and Variability Assessment
Summary
This paper presents a novel AI-based model for fetal heart rate (FHR) monitoring, addressing the limitations of traditional methods that often suffer from noise interference and subjective assessments. The proposed FHrCTG model is designed to effectively reconstruct FHR signals and analyze their variability. It was pre-trained on a substantial dataset of 558,412 unlabeled data points and fine-tuned with 7,266 expert-reviewed entries. The model employs the Intersection Overlapping Labels (IOL) approach to convert FHR analysis into categorical judgments, enhancing the accuracy of detecting critical FHR decelerations and accelerations. Testing results indicate that the model achieves high sensitivity (89.13% for decelerations) and specificity (92.04% for accelerations), alongside impressive AUC scores for periodicity and amplitude variation verification. This unified model not only improves the precision of FHR monitoring but also has the potential to significantly enhance clinical outcomes by providing timely and accurate assessments of fetal health.
Methodology
The FHrCTG model was developed using a two-step training process: pre-training on a large dataset of unlabeled FHR data followed by fine-tuning with expert-reviewed entries. The model incorporates the IOL approach to transform continuous FHR data into categorical assessments, facilitating improved detection of FHR decelerations and accelerations.
Results
The model demonstrated a sensitivity of 89.13% and specificity of 87.78% for detecting critical FHR decelerations, and a sensitivity of 62.5% and specificity of 92.04% for accelerations. AUC scores of 0.7214 and 0.9643 were achieved for verifying FHR periodicity and amplitude variation, respectively.
Implications
The FHrCTG model has significant implications for clinical practice in obstetrics, potentially leading to better fetal monitoring and timely interventions. By improving the accuracy of FHR assessments, it may reduce the risk of adverse outcomes for fetuses and enhance overall maternal-fetal health care.
bde: A Python Package for Bayesian Deep Ensembles via MILE
Optimization
Theory
Efficient ML
- bde provides a user-friendly implementation of Bayesian Deep Ensembles using MILE.
- The package integrates seamlessly with scikit-learn, enhancing accessibility for practitioners.
- It offers robust uncertainty quantification metrics, crucial for modern machine learning applications.
- Benchmarks show bde's competitive performance in predictive accuracy and uncertainty estimation.
Read more
bde: A Python Package for Bayesian Deep Ensembles via MILE
Summary
The paper introduces 'bde', a Python package designed to facilitate Bayesian Deep Learning (BDL) through a user-friendly interface, leveraging the performance of JAX and blackjax. Targeting tabular supervised learning tasks, bde implements Bayesian Deep Ensembles (BDEs) using a two-stage inference process based on Microcanonical Langevin Ensembles (MILE). The first stage optimizes independent instances of a configurable feed-forward neural network, while the second phase employs Microcanonical Langevin Monte Carlo for sampling. This approach allows for the generation of an ensemble of models that approximates the posterior distribution. The package is designed for efficient parallelization across various hardware setups and provides tools for uncertainty quantification (UQ), enabling users to obtain credible intervals and other metrics without deep knowledge of MCMC methods. The authors benchmark bde against other models, demonstrating its competitive performance in predictive accuracy and UQ metrics, particularly in distributional regression tasks. The package aims to bridge the gap between high-performance MCMC research and practical applications in data science, ensuring ease of use and reproducibility.
Methodology
The bde package employs a two-stage inference process for Bayesian Deep Ensembles. Initially, it optimizes multiple independent neural network instances using regularized empirical risk minimization. Subsequently, it transitions to a sampling phase utilizing Microcanonical Langevin Monte Carlo, enhanced for Bayesian Neural Networks, to explore local posterior structures and generate an ensemble of models.
Results
Benchmarks on the airfoil and bikesharing datasets indicate that bde achieves competitive performance, with RMSE values of 0.1215 and 0.2261 respectively, alongside strong uncertainty quantification metrics. The results demonstrate that bde outperforms traditional models and provides calibrated credible intervals, confirming its efficacy in Bayesian inference tasks.
Implications
The bde package has significant implications for researchers and practitioners in machine learning, particularly in fields requiring reliable uncertainty quantification. Its ease of use and integration with existing workflows can enhance the adoption of Bayesian methods in practical applications, promoting better decision-making based on model uncertainty.
Towards Resource-Efficient LLMs: End-to-End Energy Accounting of Distillation Pipelines
Large Language Models
Efficient ML
- Introduces a comprehensive energy accounting framework for distillation pipelines.
- Highlights the significant teacher-side energy costs often ignored in previous studies.
- Provides empirical measurements and comparisons of energy use across different distillation methods.
- Establishes design rules for selecting distillation methods based on energy and budget constraints.
Read more
Towards Resource-Efficient LLMs: End-to-End Energy Accounting of Distillation Pipelines
Summary
This paper addresses the growing concerns regarding the energy consumption and environmental impact of large language models (LLMs) by proposing a comprehensive energy accounting framework for distillation pipelines. The authors highlight that while distillation is often viewed as a method to create more efficient models, previous studies have overlooked the full energy costs associated with teacher-side workloads such as data generation and evaluation. The proposed framework allows for detailed stage-wise tracking of GPU power consumption, enabling a more accurate assessment of energy use across different distillation methods. The authors conduct experiments comparing classic logit-based knowledge distillation and synthetic-data supervised fine-tuning, constructing energy-quality Pareto frontiers to reveal the hidden costs of these methods. They derive practical design rules for selecting distillation methods under energy constraints and release an open-source measurement harness to standardize energy accounting in distillation research. The findings emphasize the importance of considering the entire distillation pipeline when evaluating efficiency, ultimately guiding researchers and practitioners in making informed decisions about resource allocation in AI development.
Methodology
The authors developed an energy accounting framework that delineates the stages of distillation pipelines and measures energy consumption at each stage using NVML-based GPU telemetry. They benchmarked various distillation methods under controlled conditions, logging empirical energy use and constructing energy-quality Pareto frontiers to analyze trade-offs.
Results
The study found that teacher-side costs can rival or exceed the energy used for student training, challenging the assumption that smaller models are inherently more efficient. The constructed Pareto frontiers revealed specific conditions under which distillation methods are more energy-efficient compared to alternatives, providing insights into optimal hyperparameter settings.
Implications
The findings of this research have significant implications for the development of more sustainable AI systems. By providing a framework for energy accounting, the study encourages researchers to consider the environmental impact of their models and make informed decisions regarding resource allocation in AI development.
Architecture-Aware Explanation Auditing for Industrial Visual Inspection
Computer Vision
Interpretability
- Introduces an architecture-aware explanation audit protocol based on the native-readout hypothesis.
- Demonstrates that explanation methods' faithfulness is influenced by their structural alignment with the model's decision mechanism.
- Finds that ViT-Tiny + Attention Rollout, despite lower accuracy, provides more faithful explanations than other models.
- Highlights the importance of co-designing explanation pathways with model architectures.
Read more
Architecture-Aware Explanation Auditing for Industrial Visual Inspection
Summary
This paper addresses the limitations of current explanation methods used in industrial visual inspection systems, particularly focusing on the discrepancies between visually plausible heatmap explanations and the actual decision-driving image regions. The authors introduce an architecture-aware explanation audit protocol based on the native-readout hypothesis, which posits that the faithfulness of explanation methods is constrained by their structural alignment with the model's decision-making mechanisms. The study utilizes the WM-811K wafer map dataset, comprising 172,950 labeled images across nine classes, to evaluate various models including ViT-Tiny with Attention Rollout, Swin-Tiny, ResNet18+CBAM, and DenseNet121 with Grad-CAM. The results reveal that despite lower classification accuracy, ViT-Tiny + Attention Rollout achieves a Deletion AUC of 0.211, significantly lower than the 0.432–0.525 range of the other models. The findings indicate that the architecture's readout structure, rather than its family type, is crucial for explanation effectiveness. Additionally, a model-agnostic method (RISE) demonstrated a consistent compression of faithfulness across model families, suggesting that the choice of explainer is pivotal. The paper concludes with actionable guidelines for co-designing explanation pathways with model architectures and emphasizes the need for quantitative metrics to accompany deployed heatmaps.
Methodology
The authors operationalize the native-readout hypothesis through a series of experiments on the WM-811K dataset, comparing various deep learning models and their associated explanation methods. They employ a three-seed zero-fill perturbation protocol to assess the faithfulness of explanations, measuring Deletion AUC and Insertion AUC as key metrics. The study also includes a sensitivity analysis and a boundary-condition study on pretrained models from the MVTec AD dataset.
Results
The study reveals that ViT-Tiny + Attention Rollout achieves a Deletion AUC of 0.211, while Swin-Tiny, ResNet18+CBAM, and DenseNet121 with Grad-CAM achieve AUCs between 0.432 and 0.525. RISE, a model-agnostic explainer, compresses the faithfulness ranking across families to approximately 0.1, indicating that the architecture's representations significantly influence explanation effectiveness. The findings also suggest that faithfulness rankings are dependent on the specific model, explainer, and perturbation operator used.
Implications
The findings have significant implications for the design and deployment of explainable AI systems in industrial settings, particularly in enhancing trust and reliability in AI-driven decision-making processes. The study advocates for the integration of explanation pathways with model architectures to improve the interpretability of AI systems, which is crucial for human validation in critical applications such as quality inspection.
Peng's Q(λ) for Conservative Value Estimation in Offline Reinforcement Learning
Reinforcement Learning
Theory
Optimization
- CPQL is the first multi-step Q-learning algorithm for model-free offline reinforcement learning.
- The algorithm effectively mitigates over-pessimistic value estimation without requiring additional models or networks.
- Theoretical analyses guarantee that CPQL's learned policy performs at least as well as the behavior policy.
- Extensive experiments show CPQL consistently outperforms existing offline single-step algorithms.
Read more
Peng's Q(λ) for Conservative Value Estimation in Offline Reinforcement Learning
Summary
This paper introduces Conservative Peng’s Q(λ) (CPQL), a novel model-free offline multi-step reinforcement learning algorithm aimed at improving value estimation in offline settings. CPQL adapts the Peng’s Q(λ) operator for conservative value estimation, diverging from traditional Bellman operators. The authors assert that this is the first instance of a multi-step operator being utilized for conservative value estimation in offline reinforcement learning, demonstrating both theoretical and empirical effectiveness. The fixed point of the PQL operator is shown to be closer to the value function of the behavior policy, which helps in mitigating over-pessimistic value estimations and achieving performance that is at least equal to that of the behavior policy. The paper presents rigorous theoretical analyses confirming that CPQL reduces the sub-optimality gap compared to existing methods. Extensive experiments on the D4RL benchmark reveal that CPQL significantly outperforms existing offline single-step baselines, while also facilitating a smoother transition from offline to online learning by pre-training Q-functions that enhance online performance.
Methodology
The authors propose CPQL by adapting the Peng’s Q(λ) operator to leverage entire offline trajectories for value estimation, avoiding the need for importance sampling and additional model estimations. The algorithm is theoretically analyzed to ensure it achieves better performance than the behavior policy and reduces over-pessimism in value estimates.
Results
Numerical experiments on the D4RL benchmark demonstrate that CPQL consistently and significantly outperforms existing offline single-step reinforcement learning algorithms, achieving near-optimal performance guarantees and addressing the limitations of previous conservative approaches.
Implications
The findings suggest that CPQL could be applied in various offline reinforcement learning scenarios, particularly where conservative value estimation is crucial. Additionally, the method's ability to facilitate smoother transitions from offline to online learning could enhance practical applications in real-world environments.
Uncertainty-Aware Prediction of Lung Tumor Growth from Sparse Longitudinal CT Data via Bayesian Physics-Informed Neural Networks
Time Series
- Introduces a Bayesian physics-informed framework for tumor growth prediction under sparse CT data.
- Combines mechanistic Gompertz constraints with probabilistic inference for improved prediction accuracy.
- Utilizes a two-stage inference procedure for stable posterior inference and efficient sampling.
- Demonstrates the model's capability to provide calibrated uncertainty estimates alongside predictions.
Read more
Uncertainty-Aware Prediction of Lung Tumor Growth from Sparse Longitudinal CT Data via Bayesian Physics-Informed Neural Networks
Summary
This paper addresses the challenge of predicting lung tumor growth using sparse and irregular longitudinal CT data, which often suffers from measurement variability. The authors propose a novel Bayesian physics-informed neural network (PINN) framework that integrates Gompertz growth dynamics with low-dimensional Bayesian inference in the log-volume domain. The framework employs a two-stage inference strategy that combines maximum a posteriori (MAP) estimation and Hamiltonian Monte Carlo (HMC) sampling to estimate posterior predictive distributions and quantify uncertainty. The model was evaluated using longitudinal data from the National Lung Screening Trial involving 30 patients. The results demonstrate that the proposed method effectively captures heterogeneous tumor growth patterns while providing calibrated uncertainty estimates, which are crucial for clinical decision-making. The model achieved a cohort-level log-space RMSE of approximately 0.20 and maintained well-calibrated 95% credible interval coverage across the patient cohort. These findings indicate that the Bayesian physics-informed modeling approach is promising for uncertainty-aware tumor growth assessment, particularly in scenarios with limited longitudinal follow-up scans.
Methodology
The authors developed a Bayesian physics-informed neural network that integrates Gompertz growth dynamics with Bayesian inference. The methodology includes a two-stage inference strategy involving MAP initialization followed by HMC sampling to estimate posterior distributions and quantify uncertainty in tumor growth predictions.
Results
The proposed framework successfully captured heterogeneous tumor growth patterns and achieved a cohort-level log-space RMSE of approximately 0.20. It also provided well-calibrated 95% credible interval coverage across the 30 patients, indicating reliable uncertainty quantification.
Implications
The findings suggest that Bayesian physics-informed modeling can significantly enhance uncertainty-aware tumor growth assessment, which is critical for clinical decision-making, especially when only limited longitudinal follow-up scans are available.
WriteSAE: Sparse Autoencoders for Recurrent State
NLP
Large Language Models
Theory
- WriteSAE is the first sparse autoencoder that effectively addresses matrix cache write operations in recurrent language models.
- The method allows for closed-form predictions of logit shifts, achieving high accuracy (R² = 0.98).
- Substitution of learned rank-1 atoms consistently outperforms traditional matched-norm ablation tests.
- WriteSAE demonstrates significant improvements in performance metrics, including a 3× lift in midrank target-in-continuation tasks.
Read more
WriteSAE: Sparse Autoencoders for Recurrent State
Summary
This paper introduces WriteSAE, a novel sparse autoencoder designed to enhance the performance of state-space and hybrid recurrent language models by addressing the limitations of existing sparse autoencoders (SAEs) that primarily read residual streams. WriteSAE innovatively decomposes and edits the matrix cache write operations of recurrent models like Gated DeltaNet, Mamba-2, and RWKV-7, which utilize rank-1 updates that cannot be replaced by vector atoms. The proposed method factors each decoder atom into a native write shape, allowing for a closed-form expression for per-token logit shifts and training under a matched Frobenius norm. The results demonstrate that atom substitution significantly outperforms matched-norm ablation in 92.4% of cases across 4,851 firings, achieving an R² of 0.98 for the closed-form predictions. Furthermore, WriteSAE shows sustained improvements in target-in-continuation tasks, marking a significant advancement in the behavioral installation at the matrix-recurrent write site.
Methodology
The methodology involves the development of a sparse autoencoder that factors decoder atoms into rank-1 outer products, allowing for targeted cache-slot substitutions. The model is trained using a matched Frobenius norm, and a three-factor closed form is derived to predict logit shifts based on observable quantities from forward passes. The performance of WriteSAE is evaluated through various tests, including cache-slot substitution and matched substitution tests across different architectures.
Results
The results indicate that WriteSAE's atom substitution outperforms matched-norm ablation in 92.4% of the tested cases. The closed-form predictions for logit shifts achieved a median R² of 0.98 across 200 atom-by-ε cells. Additionally, the method demonstrated a significant increase in performance for target-in-continuation tasks, achieving a 100% success rate under greedy decoding for midrank targets.
Implications
The findings suggest that WriteSAE can enhance the performance of recurrent language models by improving how they manage matrix cache writes. This could lead to more efficient training and better performance in various natural language processing tasks, particularly in applications requiring real-time updates and adaptations.
Tight Sample Complexity Bounds for Entropic Best Policy Identification
Reinforcement Learning
Theory
- Introduces a new lower bound for best policy identification in risk-sensitive reinforcement learning.
- Develops the Entropic-BPI algorithm that achieves optimal sample complexity.
- Improves concentration bounds for exponential utilities, enhancing exploration strategies.
- Demonstrates that the maximal achievable reward Gmax is a better metric for sample complexity than the horizon H.
Read more
Tight Sample Complexity Bounds for Entropic Best Policy Identification
Summary
This paper addresses the problem of best-policy identification in finite-horizon risk-sensitive reinforcement learning, specifically under the entropic risk measure. Previous research highlighted a significant gap between the lower and upper bounds on the sample complexity required to identify an approximately optimal policy, with lower bounds scaling as Ω(pe|β|Hq) and upper bounds achieving O(pe2|β|Hq). The authors identify that the extra exponential factor in the upper bound arises from loose concentration control for exponential utilities. To bridge this gap, they propose a forward-model based algorithm that incorporates KL-based exploration bonuses tailored to the entropic criterion. The authors introduce two main innovations: sharper concentration bounds derived from the smoothness properties of exponential utility and a new stopping rule that optimally exploits this tightness. The paper presents a new lower bound for the best policy identification problem, expressed in terms of the maximal achievable reward Gmax, which is argued to be more suitable for the entropic risk measure. The proposed algorithm, Entropic-BPI, achieves optimal sample complexity, matching the lower bound with only logarithmic factors and an exponential dependence on Gmax, thus eliminating the additional exponential factor seen in prior work.
Methodology
The authors utilize a forward-model based approach, adapting KL-based exploration bonuses to the entropic risk measure. They derive sharper concentration bounds and propose a new stopping rule to optimize sample complexity in identifying the best policy.
Results
The paper successfully closes the gap between lower and upper bounds on sample complexity for entropic best policy identification, demonstrating that the proposed algorithm achieves optimal sample complexity with improved bounds based on Gmax.
Implications
The findings have significant implications for risk-sensitive decision-making in various fields, including finance and robotics, where understanding and managing downside risk is crucial. The proposed methods can enhance the efficiency of reinforcement learning algorithms in uncertain environments.
Multi-Quantile Regression for Extreme Precipitation Downscaling
Time Series
Generative Models
Theory
- Q-SRDRN significantly improves detection rates of extreme precipitation events compared to traditional methods.
- The use of pinball loss allows for better handling of heavy-tail distributions in precipitation data.
- Data augmentation through cVAE is beneficial when aligned with the model architecture and regional characteristics.
- The architecture shows strong performance across diverse climatic conditions, indicating its robustness.
Read more
Multi-Quantile Regression for Extreme Precipitation Downscaling
Summary
This paper addresses the limitations of deep super-resolution networks in predicting extreme precipitation events, which are critical for flood risk assessment. The authors introduce Q-SRDRN, a multi-quantile super-resolution network that utilizes pinball loss to improve the detection of heavy-tail precipitation events. They identify that traditional data augmentation methods fail due to the averaging effect of intensity-weighted MAE loss, which dilutes the predictive power for extreme events. The proposed architecture includes IncrementBound to enforce monotonicity and separate output heads for different quantiles, allowing for better specialization of convolutional filters. The methodology is validated across three distinct U.S. climates: Florida, California, and a Texas substate, demonstrating significant improvements in detection rates for extreme precipitation events compared to deterministic baselines. The findings suggest that multi-quantile regression can effectively capture extreme events, and when paired with appropriate data augmentation, it can enhance model performance without introducing bias.
Methodology
The authors developed Q-SRDRN, a multi-quantile super-resolution network that employs pinball loss for training. The architecture includes IncrementBound for monotonicity and separate convolutional heads for different quantiles, allowing for specialized learning. The model was validated using precipitation data from Florida, California, and Texas, with a focus on extreme events.
Results
The Q-SRDRN model achieved an 18-fold increase in detection rates for extreme precipitation events in Florida, detecting 1,598 out of 2,111 events at 200 mm/day compared to only 88 by the deterministic baseline. In California, the model reached near-perfect detection rates for extreme events, while in Texas, it detected 8,776 out of 10,720 events at the same threshold. The median channel's performance improved significantly with the introduction of cVAE-generated samples.
Implications
The findings suggest that multi-quantile regression can enhance predictive modeling for extreme weather events, which is crucial for flood risk management and infrastructure planning. The methodology can be applied to other regions and types of extreme weather, potentially improving climate resilience strategies.
WarmPrior: Straightening Flow-Matching Policies with Temporal Priors
Robotics
Generative Models
Reinforcement Learning
- WarmPrior replaces the standard Gaussian source with a temporally grounded prior, improving robotic manipulation success rates.
- The method includes two variants: WP-Past and WP-Preview, which leverage recent action history for better performance.
- WarmPrior enhances sample efficiency and final performance in prior-space reinforcement learning.
- Empirical results show significant improvements over traditional methods, especially in complex tasks.
Read more
WarmPrior: Straightening Flow-Matching Policies with Temporal Priors
Summary
The paper introduces WarmPrior, a novel approach to enhance generative policies for visuomotor robotic control by replacing the conventional isotropic Gaussian source distribution with a temporally grounded prior based on recent action history. This method aims to improve the success rates of robotic manipulation tasks by creating straighter probability paths, which are akin to optimal-transport couplings in Rectified Flow. WarmPrior is implemented in two variants: WP-Past, which uses the last executed action as the prior mean, and WP-Preview, which predicts future actions based on the model's previous forecasts. The authors demonstrate that this minimal intervention leads to significant improvements in sample efficiency and performance in prior-space reinforcement learning. Empirical results show that WarmPrior consistently outperforms the standard Gaussian baseline across various robotic tasks, particularly under low inference budgets and challenging scenarios, thereby highlighting the importance of the source distribution in generative robot control.
Methodology
The authors propose WarmPrior, which modifies the source distribution in flow-matching policies by using a temporally grounded prior based on recent actions. Two variants are introduced: WP-Past, which anchors the prior on the last action, and WP-Preview, which predicts future actions. The approach maintains the existing network architecture while adjusting the source distribution to improve the efficiency of the transport process.
Results
WarmPrior consistently outperformed the standard isotropic Gaussian source distribution in various robotic manipulation tasks, demonstrating higher success rates, particularly in scenarios with limited inference budgets and more complex tasks. The improvements were attributed to shorter and straighter probability paths, leading to more efficient learning and execution.
Implications
The findings suggest that modifying the source distribution in generative policies can significantly enhance robotic control performance. This approach could be applied to various robotic manipulation tasks, potentially leading to more efficient and effective learning algorithms in robotics.
Multi-Objective and Mixed-Reward Reinforcement Learning via Reward-Decorrelated Policy Optimization
Reinforcement Learning
Large Language Models
NLP
- Introduction of Reward-Decorrelated Policy Optimization (RDPO) for stabilizing multi-objective reinforcement learning.
- Utilization of Magnitude-Aware Quantile Normalization and Mahalanobis whitening to address reward heterogeneity and correlation.
- Demonstrated improvements in model performance on instruction following and writing quality through RDPO.
- Introduction of Effective Information Efficiency (ηeff) as a metric for assessing mixed-reward aggregation quality.
Read more
Multi-Objective and Mixed-Reward Reinforcement Learning via Reward-Decorrelated Policy Optimization
Summary
This paper addresses the challenges of multi-objective and mixed-reward reinforcement learning environments, where heterogeneous reward distributions and correlated reward dimensions can destabilize the training process. The authors propose a novel method called Reward-Decorrelated Policy Optimization (RDPO), which aims to enhance the stability and effectiveness of reward processing in such complex settings. RDPO employs a two-step approach: first, it utilizes Magnitude-Aware Quantile Normalization to stabilize advantage allocation across various reward types (binary, fractional, and continuous). Second, it applies Mahalanobis whitening to reduce correlation redundancy among reward dimensions before aggregation. The effectiveness of RDPO is demonstrated through post-training experiments on the LongCat-Flash model, showing improvements in instruction following, writing quality, and robustness to challenging prompts, while maintaining competitive performance in reasoning and coding tasks. The paper also introduces a diagnostic measure, Effective Information Efficiency (ηeff), to evaluate the quality of mixed-reward aggregation, highlighting the importance of balancing weights across reward dimensions and minimizing redundant variance.
Methodology
The methodology involves a two-step reward processing pipeline: (1) Magnitude-Aware Quantile Normalization for stabilizing advantage allocation across diverse reward types, and (2) Mahalanobis whitening to mitigate correlation redundancy among reward dimensions prior to aggregation. The paper also introduces Effective Information Efficiency (ηeff) as a diagnostic measure for evaluating the effectiveness of mixed-reward aggregation.
Results
The application of RDPO in post-training experiments on LongCat-Flash resulted in enhanced instruction following, improved writing quality, and increased robustness to difficult prompts. The method also demonstrated competitive performance in reasoning and coding evaluations, indicating its effectiveness in handling complex multi-objective tasks.
Implications
The findings suggest that RDPO can significantly improve the training stability and performance of reinforcement learning models in environments with mixed-reward signals. This has potential applications in various domains requiring multi-task learning and complex reward structures, such as natural language processing and robotics.
TILBench: A Systematic Benchmark for Tabular Imbalanced Learning Across Data Regimes
Theory
Optimization
Efficient ML
- TILBench evaluates over 40 imbalanced learning algorithms across 57 datasets, resulting in extensive empirical insights.
- The effectiveness of imbalanced learning methods varies significantly based on dataset characteristics such as sample size and imbalance severity.
- No single method is universally superior; practical recommendations for method selection are provided based on data properties.
- The benchmark assesses not only predictive performance but also computational scalability and efficiency.
Read more
TILBench: A Systematic Benchmark for Tabular Imbalanced Learning Across Data Regimes
Summary
Imbalanced learning is a significant challenge in tabular data applications, where minority instances are often critical yet underrepresented. This paper introduces TILBench, a comprehensive benchmark designed to systematically evaluate over 40 imbalanced learning algorithms across 57 diverse tabular datasets. The benchmark encompasses more than 200,000 controlled experiments, providing insights into the predictive performance, robustness, and computational scalability of various methods under different data characteristics. The findings reveal that no single method consistently outperforms others across all scenarios; instead, the effectiveness of imbalanced learning techniques is highly dependent on dataset characteristics and computational constraints. The authors offer practical recommendations for selecting appropriate methods based on specific data properties and system limitations, emphasizing the need for a regime-aware approach to imbalanced learning.
Methodology
The authors established TILBench as a large-scale empirical benchmark that evaluates imbalanced learning methods through a unified framework. They conducted controlled experiments across diverse datasets, analyzing performance, behavior under varying data characteristics, and computational efficiency. The methods were categorized into three families: data-level, algorithm-level, and ensemble-based approaches.
Results
The benchmark revealed that no single imbalanced learning method consistently outperformed others across all datasets. Instead, performance varied significantly based on dataset characteristics, such as imbalance severity and feature dimensionality. The study also highlighted the computational scalability of different methods, providing insights into how performance and efficiency change with increasing data complexity.
Implications
The findings from TILBench can guide practitioners in selecting appropriate imbalanced learning methods tailored to specific real-world applications, such as fraud detection and medical diagnosis. The regime-aware recommendations can help optimize predictive performance while considering computational constraints.
Uncovering Trajectory and Topological Signatures in Multimodal Pediatric Sleep Embeddings
Multimodal
Time Series
Generative Models
- Introduces a novel approach to analyze session-wide trajectories in pediatric PSG embeddings.
- Applies persistent homology to characterize topological features of sleep data.
- Demonstrates that augmenting embeddings with clinical EHR data improves predictive performance.
- Shows significant improvements in AUPRC for various sleep event detection tasks.
Read more
Uncovering Trajectory and Topological Signatures in Multimodal Pediatric Sleep Embeddings
Summary
This paper investigates the latent structure of multimodal embeddings derived from pediatric polysomnography (PSG) data, focusing on the session-wide diagnostic information contained in sequences of 30-second PSG epochs. The authors utilize a multimodal masked autoencoder to generate embeddings and augment them with various features, including PHATE-derived coordinates, persistent homology summaries, and electronic health records (EHR). The study aims to understand how these embeddings can reflect disease burden and improve the detection of sleep events such as apnea, hypopnea, desaturation, and EEG arousal. The authors employ simple linear and multi-layer perceptron (MLP) models for interpretability, demonstrating that geometric, topological, and clinical features provide complementary gains in predictive performance. The results indicate that more expressive late-fusion models outperform simpler models, with notable improvements in area under the precision-recall curve (AUPRC) across various sleep event detection tasks. The study highlights the importance of latent geometry and topology in enhancing model calibration and robustness, particularly in the context of pediatric sleep disorders.
Methodology
The authors employed a multimodal masked autoencoder to generate embeddings from pediatric PSG data. They augmented these embeddings with PHATE-derived coordinates, persistent homology summaries, and EHR data. Predictive models, including linear and MLP models, were used to evaluate the impact of these features on the detection of sleep events.
Results
The study found that augmenting the embeddings with geometric and topological features, as well as EHR data, led to improved predictive performance. AUPRC scores improved significantly for desaturation (0.26 to 0.34), EEG arousal (0.31 to 0.48), hypopnea (0.09 to 0.22), and apnea (0.05 to 0.14). The full fusion model also exhibited the best calibration across all tasks.
Implications
The findings suggest that understanding the latent geometry and topology of sleep embeddings can enhance the interpretability and performance of generative models in sleep medicine, potentially leading to better diagnostic tools for pediatric sleep disorders.
Collaborative Yet Personalized Policy Training: Single-Timescale Federated Actor-Critic
Reinforcement Learning
Federated Learning
Robotics
- Introduction of a federated actor-critic framework that supports personalized policy training.
- Establishment of finite-time convergence rates for critic error and policy gradient norms.
- Development of a new perturbation analysis to handle complexities in heterogeneous environments.
- Experimental validation showing improved performance over traditional federated learning methods.
Read more
Collaborative Yet Personalized Policy Training: Single-Timescale Federated Actor-Critic
Summary
This paper presents a novel federated actor-critic framework that allows agents to collaboratively train personalized policies while sharing a common linear subspace representation. The authors address the challenge of environmental heterogeneity in reinforcement learning by enabling agents to maintain personalized local policy components. The proposed method, termed pFedAC, employs a single-timescale update scheme under Markovian sampling, which is both practical and analytically complex. The authors establish finite-time convergence results, demonstrating that the critic error and policy gradient norm converge to zero at specific rates, indicating linear speedup with respect to the number of agents. The paper also introduces a new perturbation analysis for the projected subspace updates and QR decomposition steps, addressing the complexities introduced by heterogeneous Markovian noise and policy updates. Experimental results on a federated Hopper-v5 benchmark show that the personalized approach outperforms existing methods like Single PPO and FedAvg PPO, and that the learned shared representation facilitates faster adaptation in downstream tasks.
Methodology
The authors propose a personalized federated actor-critic algorithm (pFedAC) that allows agents to collaboratively estimate a shared subspace while updating their local critic heads and personalized policies. The framework operates under a single-timescale update scheme and employs Markovian sampling to analyze convergence and performance.
Results
The paper demonstrates that the critic error converges to zero at the rate of e^O(1/((1 − γ)4√TK)) and the policy gradient norm converges to zero at the rate of e^O(1/((1 − γ)6√TK)), indicating linear speedup with respect to the number of agents K. Experimental results on the federated Hopper-v5 benchmark show that the proposed method outperforms Single PPO and FedAvg PPO.
Implications
The findings suggest that personalized policy training in federated reinforcement learning can significantly enhance performance in heterogeneous environments, making it applicable to complex AI systems where collaboration and individual adaptation are crucial.
LoMETab: Beyond Rank-1 Ensembles for Tabular Deep Learning
Theory
Efficient ML
- LoMETab introduces a rank-r multiplicative implicit ensemble framework for tabular MLPs.
- The model allows for member-specific deviations from a shared weight, enhancing diversity control.
- Empirical results indicate that LoMETab sustains higher predictive diversity compared to traditional methods.
- The framework provides practical trade-offs among rank, ensemble size, and initialization scale.
Read more
LoMETab: Beyond Rank-1 Ensembles for Tabular Deep Learning
Summary
The paper introduces LoMETab, a novel framework for tabular deep learning that generalizes multiplicative implicit ensembles beyond the traditional rank-1 structure. As performance gains in tabular learning plateau, understanding and controlling the mechanisms that enhance simple neural tabular models becomes crucial. LoMETab employs a rank-r generalization of the BatchEnsemble and TabM architectures, allowing for member-specific low-rank factors that enhance diversity and predictive performance. The authors demonstrate that this approach not only expands the hypothesis class of existing methods but also provides practical axes for controlling ensemble diversity through the adapter rank and initialization scale. Empirical results show that LoMETab achieves higher predictive diversity and maintains competitive performance across various datasets, indicating its potential as a flexible and powerful tool for tabular data applications.
Methodology
LoMETab utilizes a rank-r generalization of multiplicative implicit ensembles by parameterizing member weights with low-rank factors. Each member's weight is defined as Wk = W ⊙ (1 + AkB⊤k), where W is shared and (Ak, Bk) are member-specific low-rank matrices. This structure allows for enhanced diversity control through the adapter rank and initialization scale, which are empirically validated across various datasets.
Results
The results demonstrate that LoMETab achieves higher pairwise KL divergence compared to an additive low-rank ablation, indicating greater predictive diversity. The framework's performance varies significantly across different configurations of adapter rank and initialization scale, showcasing its adaptability and effectiveness in maintaining competitive performance against established tabular models.
Implications
LoMETab's ability to control ensemble diversity and enhance predictive performance suggests its applicability in various real-world tabular data scenarios, such as finance, healthcare, and logistics. The framework's flexibility in tuning parameters can lead to improved model performance tailored to specific datasets, making it a valuable tool for practitioners in the field.
Separating Shortcut Transition from Cross-Family OOD Failure in a Minimal Model
Theory
- Introduces a minimal binary model to study shortcut features and OOD failure.
- Demonstrates that training-side observations can indicate potential cross-family failures.
- Establishes that positive training shortcut correlation and shortcut-rule transitions are distinct phenomena.
- Shows that the same training solution can yield different outcomes depending on the held-out family.
Read more
Separating Shortcut Transition from Cross-Family OOD Failure in a Minimal Model
Summary
This paper investigates the relationship between shortcut features and out-of-distribution (OOD) failure in a minimal binary model. The author introduces a model with one invariant coordinate and one family-dependent shortcut coordinate, aiming to clarify how training correlation, learned shortcut use, and test-time failure interact. The study reveals that positive average shortcut correlation can lead to a transition towards shortcut reliance during training, but ridge regularization can maintain an invariant-dominated classifier, preventing deterministic OOD failure. However, when the invariant coordinate is noisy, the model shows that the transition to shortcut reliance can occur if the training shortcut signal surpasses the invariant signal. The consequences of this transition vary depending on the held-out family, indicating that weaker shortcut correlation can result in positive excess risk, while sign-flipped families can lead to above-chance error. The findings emphasize the distinction between shortcut attraction, shortcut-rule transition, and cross-family OOD failure, providing a clearer understanding of these phenomena in machine learning.
Methodology
The author employs a closed-form binary model with two observed coordinates: an invariant signal and a family-dependent shortcut. The model analyzes the effects of training shortcut correlation on classifier behavior, using deterministic and noisy regimes to derive conditions for shortcut transitions and OOD failure. Theoretical results are supported by synthetic checks to validate the model's predictions.
Results
The study finds that in a deterministic setting, ridge regularization prevents deterministic OOD failure despite positive shortcut correlation. In a noisy regime, a transition to shortcut reliance occurs when the training shortcut signal exceeds the invariant signal, leading to varying outcomes based on the held-out family. The results indicate that positive training shortcut correlation does not guarantee robustness, as it can lead to different levels of risk depending on the test family's characteristics.
Implications
The findings have significant implications for understanding shortcut learning and OOD failure in machine learning models. By clarifying the distinctions between shortcut attraction, transitions, and failures, the research can inform the design of more robust training methodologies and diagnostics for identifying potential OOD issues.
How to Scale Mixture-of-Experts: From muP to the Maximally Scale-Stable Parameterization
NLP
Large Language Models
Theory
- Introduces a novel Dynamical Mean Field Theory (DMFT) for analyzing MoE training dynamics.
- Identifies limitations of the Maximal Update Parameterization (µP) in achieving stable learning-rate transfer.
- Proposes the Maximally Scale-Stable Parameterization (MSSP) to enhance stability and performance across scaling regimes.
- Empirical results demonstrate that MSSP outperforms µP in terms of learning-rate transfer and monotonic improvement.
Read more
How to Scale Mixture-of-Experts: From muP to the Maximally Scale-Stable Parameterization
Summary
This paper addresses the scaling of Mixture-of-Experts (MoE) architectures, which are increasingly used in large language models. The authors identify a gap in the understanding of how hyperparameters should scale with various dimensions such as network width, expert width, number of experts, sparsity, and depth to ensure stability and optimal performance. They analyze three co-scaling regimes and develop a novel Dynamical Mean Field Theory (DMFT) to describe the training dynamics of MoEs. The study reveals that the existing Maximal Update Parameterization (µP) does not guarantee reliable learning-rate transfer or monotonic improvement with scale. To address this, the authors propose a Maximally Scale-Stable Parameterization (MSSP) that enhances stability across all scaling regimes. Empirical validation shows that MSSP successfully recovers learning-rate transfer and improves performance predictably with scale. The findings provide a comprehensive scaling prescription for MoE architectures, contributing to more stable and efficient training of large models.
Methodology
The authors analyze three co-scaling regimes for MoEs using a novel Dynamical Mean Field Theory (DMFT) to derive parameterizations that satisfy the µP desiderata for SGD and Adam. They investigate the scale-dependence of µP and propose the MSSP, which includes specific corrections for each regime. The methodology involves both theoretical derivation and empirical validation through experiments on MLP and Transformer MoEs.
Results
The study finds that the MSSP achieves reliable learning-rate transfer and monotonic improvement with scale across all three co-scaling regimes, contrasting with the µP which fails to do so. The proposed MSSP provides a more stable and predictable framework for scaling MoE architectures, leading to improved performance in large-scale models.
Implications
The findings have significant implications for the design and training of large language models and other applications using MoE architectures. By providing a robust scaling prescription, the research aids in optimizing hyperparameter tuning and enhancing model performance, which is crucial as models continue to grow in size and complexity.
Not All Timesteps Matter Equally: Selective Alignment Knowledge Distillation for Spiking Neural Networks
Efficient ML
Theory
Time Series
- Introduces SeAl-KD, a selective knowledge distillation framework for SNNs.
- Highlights the mismatch between intermediate and final predictions in SNNs.
- Proposes Error-aware Logit Alignment (ELA) and Selective Temporal Alignment (STA) for improved supervision.
- Demonstrates significant performance improvements on various datasets.
Read more
Not All Timesteps Matter Equally: Selective Alignment Knowledge Distillation for Spiking Neural Networks
Summary
This paper addresses the performance gap between Spiking Neural Networks (SNNs) and traditional Artificial Neural Networks (ANNs) by proposing a novel knowledge distillation approach called Selective Alignment Knowledge Distillation (SeAl-KD). Traditional knowledge distillation methods enforce uniform alignment across all timesteps, which does not account for the varying importance of timesteps in SNNs. The authors argue that not all intermediate predictions are equally informative and that erroneous timesteps should not be forced to align uniformly with teacher signals. Instead, SeAl-KD selectively aligns class-level and temporal knowledge by refining the logits at erroneous timesteps and reweighting the alignment based on confidence and inter-timestep similarity. The paper presents extensive experiments on both static image datasets and neuromorphic event-based datasets, demonstrating that SeAl-KD consistently improves performance over existing distillation methods, thereby preserving richer temporal distributions and enhancing the training of SNNs.
Methodology
The proposed SeAl-KD framework consists of two main components: Error-aware Logit Alignment (ELA), which refines the class evidence for erroneous timesteps, and Selective Temporal Alignment (STA), which emphasizes reliable source timesteps during the alignment process. The methodology involves analyzing the confidence and similarity of predictions across timesteps to selectively guide the distillation process.
Results
The experiments conducted on static image datasets and neuromorphic event-based datasets show that SeAl-KD leads to consistent performance improvements compared to existing distillation methods. The results indicate that the selective alignment approach effectively preserves the temporal dynamics of SNNs and enhances their overall accuracy.
Implications
The findings suggest that selective knowledge distillation can significantly enhance the training efficiency and performance of SNNs, making them more competitive with ANNs. This has potential applications in energy-efficient computing and neuromorphic hardware deployment, where SNNs can be utilized for tasks requiring real-time processing and low power consumption.
Di-BiLPS: Denoising induced Bidirectional Latent-PDE-Solver under Sparse Observations
Efficient ML
Generative Models
Theory
- Di-BiLPS effectively addresses both forward and inverse PDE problems under extreme data sparsity.
- The framework utilizes a combination of variational autoencoders, latent diffusion models, and contrastive learning.
- It achieves state-of-the-art performance with significantly reduced computational costs.
- The proposed denoising algorithm integrates physical constraints for improved inference.
Read more
Di-BiLPS: Denoising induced Bidirectional Latent-PDE-Solver under Sparse Observations
Summary
The paper introduces Di-BiLPS, a novel neural framework designed to address the challenges of solving partial differential equations (PDEs) under extremely sparse observational data. Traditional numerical solvers and existing neural approaches struggle with high-resolution inference and accuracy when data is limited. Di-BiLPS combines a variational autoencoder for dimensionality reduction, a latent diffusion module for uncertainty modeling, and contrastive learning for representation alignment. This framework operates in a compressed latent space, enhancing computational efficiency and flexibility in input-output mapping. A key innovation is the PDE-informed denoising algorithm, which utilizes a variance-preserving diffusion process to improve inference efficiency. Extensive experiments across five PDE benchmark datasets demonstrate that Di-BiLPS consistently outperforms state-of-the-art methods in both accuracy and computational cost, even with as little as 3% input data. Furthermore, it supports zero-shot super-resolution, allowing predictions over continuous spatial-temporal domains without retraining.
Methodology
Di-BiLPS employs a three-component architecture: (1) a contrastive learning module for aligning representations between sparse and full observations, (2) a pre-trained variational autoencoder to compress inputs into a latent space, and (3) a latent diffusion model that facilitates bidirectional inference for PDE solutions. The framework also includes a PDE-informed denoising algorithm based on a variance-preserving diffusion process.
Results
The experiments conducted on five PDE benchmark datasets reveal that Di-BiLPS consistently outperforms existing methods in terms of prediction accuracy and computational efficiency, even with extremely sparse inputs. The framework demonstrates the ability to generalize to unseen spatial resolutions without the need for retraining.
Implications
The advancements presented in Di-BiLPS could significantly enhance the modeling of complex physical and natural phenomena in various fields, including engineering, physics, and environmental science, where data is often sparse. The ability to perform zero-shot super-resolution may also open new avenues for real-time applications and simulations.
Scaling Laws for Mixture Pretraining Under Data Constraints
NLP
Large Language Models
Optimization
- Mixture training allows for higher repetition of target data compared to single-source training.
- Optimal repetition rates for target data range from 15 to 20 times, depending on various factors.
- A new scaling law is introduced that predicts target-domain loss based on mixture configurations.
- Empirical findings demonstrate that larger models can extract more from limited data despite faster overfitting.
Read more
Scaling Laws for Mixture Pretraining Under Data Constraints
Summary
This paper investigates the trade-off in mixture pretraining of language models when faced with limited target data, such as low-resource languages or specialized domains. The authors conduct over 2,000 training runs across various model sizes and data types to explore how the mixture of scarce target data with abundant generic data affects model performance. They find that while too little target data underexposes the model, excessive repetition of target data leads to overfitting. The study reveals that mixture training can tolerate higher repetition rates than single-source training, with optimal repetitions ranging from 15 to 20 times, depending on the target data size and compute budget. The authors introduce a repetition-aware mixture scaling law that predicts target-domain loss based on target data size, mixture ratio, and model size, providing practical recommendations for effective pretraining under data constraints. This work contributes to the understanding of how to optimally mix constrained and abundant data sources in language model training.
Methodology
The authors conducted a systematic empirical study involving over 2,000 training runs across different model sizes (from 101M to 805M parameters) and various target data types, including multilingual and domain-specific datasets. They analyzed the effects of data repetition on model performance and developed a scaling law that incorporates the diminishing returns of repeated tokens in mixture training.
Results
The study found that repetition is a significant factor in target-domain performance, with optimal repetition rates allowing for effective learning from limited data. The introduced scaling law accurately predicts target-domain loss and provides insights into the optimal mixture configurations for pretraining, demonstrating that higher repetition is feasible without performance degradation when abundant generic data is present.
Implications
The findings have important implications for training language models in scenarios with limited target data, such as low-resource languages or specialized domains. The proposed scaling law and recommendations can guide practitioners in optimizing their pretraining strategies, potentially improving model performance in underrepresented areas.
RISED: A Pre-Deployment Safety Evaluation Framework for Clinical AI Decision-Support Systems
Theory
Interpretability
- RISED Framework introduces a five-dimension evaluation for clinical AI systems.
- Framework identifies critical deployment risks not captured by traditional metrics.
- Validation across multiple cohorts shows varying failure patterns, supporting construct validity.
- Equity dimension highlights the need for independent measures of clinical need.
Read more
RISED: A Pre-Deployment Safety Evaluation Framework for Clinical AI Decision-Support Systems
Summary
The paper introduces the RISED Framework, a comprehensive five-dimension pre-deployment evaluation approach for clinical AI decision-support systems. Traditional evaluation metrics often fail to capture critical deployment-phase failures such as input reliability, subgroup equity, threshold sensitivity, and operational feasibility. The RISED Framework encompasses five dimensions: Reliability, Inclusivity, Sensitivity, Equity, and Deployability, each defined by formal sub-criteria and pass/fail thresholds. The framework employs bias-corrected accelerated bootstrap confidence intervals to assess each dimension, allowing for a quantitative verdict on the model's readiness for clinical deployment. The author demonstrates that even classifiers meeting conventional high-discrimination benchmarks can fail in critical areas, highlighting the need for a more nuanced evaluation approach. The framework was validated across synthetic and real-world cohorts, revealing varying failure patterns across datasets, thus providing preliminary evidence of its construct validity. Additionally, the Equity dimension is reframed to address proxy-dependence issues, emphasizing the importance of independent measures of clinical need. RISED is made available as an open-source Python package, facilitating the transition from in-silico validation to clinical evaluation.
Methodology
The RISED Framework operationalizes five evaluation dimensions through measurable sub-criteria, employing bootstrap 95% confidence intervals to derive PASS, FAIL, and INCONCLUSIVE verdicts. The framework was validated using a synthetic cohort and three real-world datasets spanning 35 years of clinical data.
Results
The evaluation revealed that two dimensions failed and one was statistically inconclusive despite achieving an AUROC of 0.961. The framework demonstrated that reliability and sensitivity can be compromised even in high-performing models, underscoring the necessity for comprehensive pre-deployment assessments.
Implications
The RISED Framework provides a structured approach for evaluating clinical AI systems before deployment, potentially improving patient safety and care quality by identifying risks that traditional metrics overlook. Its open-source nature allows for widespread adoption and adaptation in clinical settings.
Vision-Based Runtime Monitoring under Varying Specifications using Semantic Latent Representations
Computer Vision
Robotics
Theory
- Introduction of a semantic basis as a minimal reusable interface for monitoring ptSTL fragments.
- Development of a rolling prediction monitor that updates predicate values online, improving learning efficiency.
- Demonstration of compositional conformal certification that allows simultaneous certification of multiple formulas.
- Empirical validation on real-world data showing effectiveness and tightness of certified bounds.
Read more
Vision-Based Runtime Monitoring under Varying Specifications using Semantic Latent Representations
Summary
This paper addresses the challenge of certified runtime monitoring of past-time signal temporal logic (ptSTL) from visual observations, particularly under conditions of partial observability. The authors propose a reusable monitoring framework that allows for the certification of various safety specifications without the need for retraining the model for each new specification. They introduce a semantic basis, which is a vector of atom robustness scores, as a minimal prediction target that enables the evaluation of any formula in a specified fragment. The paper also presents a rolling prediction monitor that updates predicate values online, which simplifies the learning process and enhances prediction accuracy. Empirical validation on real-world Waymo driving data demonstrates that both proposed monitors satisfy conformal coverage guarantees, with the rolling monitor achieving tighter bounds at short horizons and the semantic-basis monitor performing better at longer horizons.
Methodology
The authors utilize a combination of semantic latent representations and conformal prediction to create a runtime monitoring framework. They prove the minimality of the semantic basis for a class of monotone, 1-Lipschitz reusable interfaces and develop a rolling prediction monitor that predicts current predicate values while reconstructing temporal history. The methodology includes empirical testing on a pedestrian-crossroad benchmark and real-world Waymo driving data.
Results
The semantic basis monitor provides a robust framework for evaluating safety specifications with a single calibration pass, achieving up to four times tighter bounds at long horizons compared to the rolling monitor. The rolling prediction monitor demonstrates higher accuracy and tighter bounds at shorter horizons, validating the effectiveness of both approaches in practical scenarios.
Implications
The proposed methods have significant implications for the development of autonomous systems, particularly in enhancing the reliability and adaptability of runtime monitoring frameworks. This work enables the certification of safety conditions in dynamic environments, potentially improving the safety and performance of autonomous vehicles and other systems operating under varying specifications.
GHGbench: A Unified Multi-Entity, Multi-Task Benchmark for Carbon Emission Prediction
Multimodal
Time Series
Optimization
- GHGbench is the first open dataset and benchmark for joint evaluation of company and building-level carbon emissions.
- Building emissions are structurally more difficult to predict compared to company emissions due to additional influencing factors.
- The in-distribution to out-of-distribution performance gap is larger than within-model variations.
- Multimodal remote-sensing embeddings significantly improve prediction accuracy in challenging scenarios.
Read more
GHGbench: A Unified Multi-Entity, Multi-Task Benchmark for Carbon Emission Prediction
Summary
GHGbench introduces a comprehensive open dataset and benchmark for predicting greenhouse gas emissions at both company and building levels. The dataset comprises over 32,000 company-year records from more than 12,000 firms, including Scope 1, 2, and 3 emissions disclosures, alongside financial and sectoral signals. Additionally, the building track harmonizes 491,591 building-year records from 13 sources across 26 metropolitan areas, integrating climate covariates and multimodal remote-sensing embeddings. The benchmark establishes canonical task splits for in-distribution and cross-region/city transfer tasks, as well as short-horizon forecasting. Various models, including gradient-boosted trees, tabular foundation models, MLPs, FT-Transformers, and multimodal fusion techniques, were evaluated using multi-seed paired-bootstrap tests. Key findings reveal that building emissions are more complex to predict than company emissions, the gap between in-distribution and out-of-distribution performance is significant, and multimodal embeddings enhance prediction accuracy where traditional tabular methods fail. GHGbench also identifies systematic failure modes, such as catastrophic city transfer, indicating areas for future model improvement. The dataset and evaluation framework aim to facilitate reproducible research and advance carbon emission prediction methodologies.
Methodology
The methodology involves creating a unified dataset from fragmented sources, normalizing identifiers and units, and enriching data with financial and sectoral signals. The benchmark includes a multi-task evaluation suite with canonical splits for various prediction tasks, employing models like gradient-boosted trees, tabular foundation models, and multimodal approaches, all assessed through rigorous statistical testing.
Results
The results indicate that building emissions are harder to predict than company emissions, with a notable performance gap between in-distribution and out-of-distribution settings. A tabular foundation model achieved significant improvements over traditional tuned trees in multi-city building-emission tasks. Additionally, the use of multimodal embeddings provided measurable gains in cross-city transfer scenarios.
Implications
The findings from GHGbench can inform climate policy, finance, and urban operations by providing a robust framework for carbon emission prediction. The dataset can be utilized for developing more accurate predictive models, enhancing transparency in emissions reporting, and guiding strategic decisions towards achieving net-zero emissions.
Language-Induced Priors for Domain Adaptation
NLP
Large Language Models
Reinforcement Learning
- Introduction of Language-Induced Prior (LIP) for source relevance in domain adaptation.
- Integration of LIP into a Bayesian hierarchical model using Expectation-Maximization (EM) for improved performance.
- Theoretical guarantees validate the effectiveness of the proposed framework.
- Empirical results demonstrate superior performance in various tasks, especially under data scarcity.
Read more
Language-Induced Priors for Domain Adaptation
Summary
This paper addresses the challenges of domain adaptation (DA) in cold-start scenarios, where target data is scarce, leading to difficulties in distinguishing relevant source domains from irrelevant ones. The authors propose a novel probabilistic framework that utilizes expert textual descriptions of the target domain to create a Language-Induced Prior (LIP). This LIP is derived from a pretrained Large Language Model (LLM) and is integrated into an Expectation-Maximization (EM) algorithm to enhance source relevance identification. The framework is compatible with any parametric model with available likelihoods, allowing it to guide source selection when target signals are weak and refine these choices as more data becomes available. Theoretical guarantees are provided, demonstrating that the estimator can approximate an oracle cold-start mean squared error under a correct prior, while maintaining consistency regardless of LIP quality. Empirical validation across three tasks—Gaussian estimation, C-MAPSS dataset analysis, and a deep reinforcement learning task using MuJoCo hopper—shows that the LIP-aided EM outperforms traditional methods, particularly when target data is limited.
Methodology
The authors developed a probabilistic framework that leverages expert textual descriptions to create a Language-Induced Prior (LIP). This LIP is incorporated into an Expectation-Maximization (EM) algorithm, which operates within a Bayesian hierarchical model to identify relevant source domains. The framework is designed to refine source selection as more target data becomes available, ensuring adaptability in cold-start scenarios.
Results
The proposed LIP-aided EM framework was validated through three distinct tasks: Gaussian estimation, analysis of the C-MAPSS dataset, and a reinforcement learning task with the MuJoCo hopper. In all cases, the framework demonstrated superior performance compared to traditional domain adaptation methods, particularly when the amount of target data was limited.
Implications
The findings suggest that incorporating expert knowledge through textual descriptions can significantly enhance domain adaptation processes, particularly in scenarios where data is scarce. This approach could be applied in various fields such as healthcare, robotics, and industrial applications, where contextual information is often available but underutilized.
DeepTokenEEG Enhancing Mild Cognitive Impairment and Alzheimers Classification via Tokenized EEG Features
Time Series
Efficient ML
- Introduction of DeepTokenEEG, a lightweight model for EEG-based AD classification.
- Utilization of tokenization to enhance feature extraction from EEG signals.
- Achieved 100% accuracy on specific frequency bands, surpassing existing methods.
- Constructed a large-scale dataset for comprehensive benchmarking.
Read more
DeepTokenEEG Enhancing Mild Cognitive Impairment and Alzheimers Classification via Tokenized EEG Features
Summary
The paper presents DeepTokenEEG, a novel lightweight deep learning model designed for the classification of Alzheimer's disease (AD) and other neurological conditions using electroencephalogram (EEG) signals. The authors highlight the importance of early detection of AD for improving patient outcomes and address the limitations of traditional diagnostic methods, which are often subjective and time-consuming. DeepTokenEEG employs a spatial and temporal tokenizer to effectively capture AD-related biomarkers in both the temporal and frequency domains, achieving high accuracy with only 0.29 million parameters. The model was trained on a combined dataset of 274 subjects, including 180 AD cases and 94 healthy controls, and demonstrated a maximum accuracy of 100% on specific frequency bands, outperforming state-of-the-art methods by 1.41-15.35%. The study emphasizes the potential of DeepTokenEEG for early detection and screening of AD, making it suitable for deployment in clinical settings due to its compact size and efficiency.
Methodology
The DeepTokenEEG model utilizes a tokenization approach to transform EEG signals into tokens, allowing the model to learn temporal order and long-range dependencies. The model was trained on a dataset comprising various neurological conditions and healthy controls, focusing on capturing AD-related biomarkers effectively.
Results
DeepTokenEEG achieved a maximum accuracy of 100% on specific frequency bands, representing a significant improvement of 1.41-15.35% over existing state-of-the-art methods on the same dataset. The model's lightweight nature, with only 0.29 million parameters, indicates its suitability for edge deployment.
Implications
The findings suggest that DeepTokenEEG could serve as a valuable tool for the early detection and screening of Alzheimer's disease, potentially improving patient outcomes through timely intervention. Its compact design makes it feasible for use in various clinical settings, enhancing accessibility to AD diagnosis.
Not All Symbols Are Equal: Importance-Aware Constellation Design for Semantic Communication
Reinforcement Learning
Generative Models
Optimization
- Introduces a joint semantic-physical layer framework for communication systems.
- Develops a learned semantic-aware M-QAM constellation that prioritizes task-relevant symbols.
- Proposes novel metrics (SSV and SPP) to evaluate the protection of semantically important information.
- Demonstrates significant improvements in semantic quality and compression ratios compared to traditional methods.
Read more
Not All Symbols Are Equal: Importance-Aware Constellation Design for Semantic Communication
Summary
This paper presents a novel framework for semantic communication that integrates semantic importance into the physical layer constellation design. Traditional communication systems often treat all symbols equally, leading to vulnerabilities in transmitting task-critical information. The authors propose a joint semantic-physical layer framework that includes a vector quantized-variational autoencoder (VQ-VAE) for extracting discrete latent concepts, a semantic criticality indicator (SCI) for scoring the relevance of these concepts, and a deep reinforcement learning (RL) agent for dynamically selecting transmission subsets based on channel conditions. The framework introduces a learned semantic-aware M-QAM constellation that assigns symbol positions based on co-occurrence statistics and SCI scores, moving away from standard uniform spacing and Gray coding. The paper also introduces two novel metrics: semantic symbol vulnerability (SSV) and semantic protection probability (SPP), which quantify the exposure of critical symbols to decoding errors. The authors demonstrate that traditional Gray-coded constellations are suboptimal for non-uniform semantic importance scenarios. Simulation results show that the proposed method achieves near 100% SPP across various modulation orders, significantly outperforming standard constellations in terms of semantic quality and compression efficiency.
Methodology
The authors employ a combination of vector quantized-variational autoencoders (VQ-VAE) for concept extraction, a semantic criticality indicator (SCI) for scoring importance, and a deep reinforcement learning (RL) agent for adaptive transmission selection. The constellation design is optimized using a SCI-weighted loss function, allowing for a direct mapping of semantic concepts to physical symbols.
Results
The proposed constellation design achieves near 100% semantic protection probability (SPP) across modulation orders from 4-QAM to 1024-QAM, compared to 50% for standard Gray-coded constellations. The method also achieves a 21:1 compression ratio with a semantic quality score above 0.9, demonstrating effectiveness across various datasets including MNIST, Fashion-MNIST, and FSDD.
Implications
This work has significant implications for the design of future wireless communication systems, particularly in scenarios where semantic content is more critical than mere bit-level fidelity. It opens avenues for more efficient communication in IoT applications and other domains where task-oriented transmission is essential.
A Systematic Evaluation of Imbalance Handling Methods in Biomedical Binary Classification
Optimization
Multimodal
- IHMs have varying effectiveness based on model complexity and data modality.
- ROS and RW consistently improve performance in complex models.
- RUS and SMOTE generally degrade performance and are not recommended.
- Direct F1-score optimization is beneficial mainly for unstructured data.
Read more
A Systematic Evaluation of Imbalance Handling Methods in Biomedical Binary Classification
Summary
This study systematically evaluates the impact of various imbalance handling methods (IHMs) on predictive performance in biomedical binary classification tasks. The authors investigate five IHMs: random undersampling (RUS), random oversampling (ROS), SMOTE, re-weighting (RW), and direct F1-score optimization (DMO), comparing them against a raw training (RAW) baseline across three biomedical datasets (MIMIC-III, ADE-Corpus-V2, and MURA) representing tabular, text, and image data modalities. The evaluation employs a range of models from classical logistic regression to advanced deep learning architectures like BiLSTM, BERT, DenseNet, and DINOv2. Results indicate that simpler models do not benefit significantly from IHMs, while more complex models show marked improvements, particularly with ROS and RW. DMO is effective mainly for unstructured data. The study concludes that the effectiveness of IHMs is contingent on model complexity and data modality, providing valuable insights for practitioners in selecting appropriate IHMs for diverse biomedical applications.
Methodology
The authors evaluated five IHMs across three biomedical datasets using various models of differing complexity. They compared the performance of these methods against a RAW baseline, utilizing logistic regression, random forest, and deep learning models for tabular, text, and image data.
Results
The study found that simpler models like logistic regression did not show significant improvements with IHMs compared to the RAW baseline. In contrast, complex models benefited from ROS and RW, while RUS and SMOTE negatively impacted performance. DMO was particularly useful for unstructured text and image data.
Implications
The findings suggest that practitioners in biomedical fields should carefully select IHMs based on the complexity of their models and the nature of their data. This research aids in optimizing predictive performance in imbalanced classification scenarios, which is crucial for clinical decision-making.
EMO: Frustratingly Easy Progressive Training of Extendable MoE
Large Language Models
Efficient ML
- EMO allows for progressive expansion of the expert pool during training, improving efficiency.
- The framework is based on a sparsity scaling law that optimizes token allocation across training stages.
- EMO matches or exceeds the performance of fixed-expert models while reducing training time and costs.
- The approach leverages the principle that MoE capacity should grow with data availability.
Read more
EMO: Frustratingly Easy Progressive Training of Extendable MoE
Summary
The paper introduces EMO, a progressive training framework for Sparse Mixture-of-Experts (MoE) models that addresses the inefficiencies associated with training large expert pools from the outset. The authors argue that the traditional approach of allocating a large number of experts at the beginning of training leads to increased memory and communication costs, which can hinder training efficiency. EMO proposes a method to incrementally expand the expert pool as training progresses, treating MoE capacity as expandable memory. This approach is grounded in a sparsity scaling law that helps determine optimal token budgets for each stage of training, allowing for efficient utilization of compute resources. The authors validate EMO through large-scale experiments, demonstrating that it achieves comparable performance to fixed-expert setups while significantly improving wall-clock efficiency and reducing GPU costs.
Methodology
The authors developed EMO by starting with a smaller dense model and progressively expanding it into a larger MoE model through multiple stages. They conducted scaling-law experiments to determine the optimal allocation of tokens for each stage, ensuring that the model effectively utilizes its capacity as training data increases. This involved calibrating the expert count and adjusting the training schedule based on the predicted performance at each stage.
Results
In experiments, EMO transitioned a 1.1B dense model into a 9.6B MoE model with 128 experts over five stages, achieving a final pretraining loss of 1.017, which is competitive with a fixed-expert baseline of 0.994. Additionally, EMO saved 10% in GPU hours compared to the fixed-expert setup, demonstrating its efficiency. Downstream evaluations across various benchmarks showed that EMO outperformed a fixed-expert model with 64 experts while remaining comparable to the 128-expert baseline.
Implications
The EMO framework has significant implications for the training of large-scale MoE models, providing a more efficient method to leverage expert capacity as data scales. This could lead to advancements in various applications requiring large language models, optimizing resource usage and potentially enabling the development of even larger models without proportional increases in training costs.
Spectral Energy Centroid: a Metric for Improving Performance and Analyzing Spectral Bias in Implicit Neural Representations
Computer Vision
Generative Models
Theory
- Introduces the Spectral Energy Centroid (SEC) as a metric for analyzing spectral bias in INRs.
- Proposes a data-driven hyperparameter selection strategy (SEC-Conf) that outperforms existing methods.
- Demonstrates that SEC serves as a reliable proxy for signal complexity and reconstruction quality.
- Reveals the significant impact of model depth on spectral bias and INR performance.
Read more
Spectral Energy Centroid: a Metric for Improving Performance and Analyzing Spectral Bias in Implicit Neural Representations
Summary
This paper addresses the challenges associated with Implicit Neural Representations (INRs) in modeling continuous signals, particularly focusing on the low-frequency bias inherent in multilayer perceptrons (MLPs). The authors introduce the Spectral Energy Centroid (SEC) metric, which quantifies the frequency characteristics of target images and the spectral bias of INR models. They demonstrate that SEC can be utilized effectively for hyperparameter selection, serving as a reliable proxy for signal complexity and enabling the alignment of spectral biases across different INR architectures. The study reveals that existing methods, such as FreSh, do not adequately account for the influence of model depth on performance, leading to suboptimal results. By employing SEC, the authors propose a data-driven strategy (SEC-Conf) that outperforms traditional heuristics and adapts well to varying model depths. The findings indicate a strong correlation between SEC and reconstruction quality, highlighting the importance of spectral bias in INR performance. Overall, the paper contributes to a deeper understanding of the relationship between frequency content and INR capabilities, providing practical tools for improving INR performance across diverse applications.
Methodology
The authors utilize the Spectral Energy Centroid (SEC) metric to analyze the frequency characteristics of target images and the spectral bias of INR models. They conduct experiments to validate the effectiveness of SEC in hyperparameter selection and performance alignment across various INR architectures, comparing it against existing methods like FreSh.
Results
The study shows that the SEC metric is a versatile tool for INR analysis, leading to improved hyperparameter selection (SEC-Conf) that is robust to model depth. The results indicate a strong correlation between SEC values and the quality of signal reconstruction, confirming its utility as a proxy for signal complexity. Additionally, the authors demonstrate that aligning spectral biases can enhance the performance of older models to match that of newer architectures.
Implications
The findings have significant implications for the design and training of INRs in various applications, including scene modeling, robotics, and generative tasks. By providing a systematic approach to hyperparameter tuning and spectral bias alignment, this research can enhance the fidelity and efficiency of neural representations in practical scenarios.
Mechanistic Interpretability of EEG Foundation Models via Sparse Autoencoders
Interpretability
Time Series
Multimodal
- Introduces a unified framework for interpreting EEG transformers using Sparse Autoencoders.
- Proposes a clinical semanticity taxonomy to audit encoder representations.
- Develops a selectivity metric for evaluating model interventions and their effects.
- Demonstrates the ability to translate latent manipulations into interpretable physiological features.
Read more
Mechanistic Interpretability of EEG Foundation Models via Sparse Autoencoders
Summary
This paper addresses the challenge of mechanistic interpretability in EEG foundation models, which, despite their state-of-the-art clinical performance, lack transparency in their internal computations. The authors propose a novel framework utilizing TopK Sparse Autoencoders (SAEs) to extract interpretable features from three distinct EEG transformer architectures: SleepFM, REVE, and LaBraM. By grounding these features in a clinical taxonomy that includes factors such as abnormality, age, sex, and medication, the study benchmarks the monosemanticity and entanglement of the features across architectures. The authors introduce a robust hyperparameter selection process that is applicable across all models, and a new metric for quantifying steering selectivity, revealing critical representational failures and operational regimes of the models. The framework also includes a spectral decoder that translates latent manipulations into interpretable frequency signatures, providing insights into the physiological relevance of model predictions. This work not only enhances the interpretability of EEG models but also establishes a foundation for clinical trust in their applications.
Methodology
The methodology involves applying TopK Sparse Autoencoders to extract sparse feature dictionaries from the embeddings of three EEG transformer architectures. The authors utilize a single hyperparameter selection procedure that is robust across architectures, and they employ concept steering to quantify steering selectivity. Additionally, a spectral decoder is used to map interventions back to the amplitude spectrum for physiological interpretation.
Results
The study successfully demonstrates that the proposed framework can effectively extract interpretable features from EEG models, revealing critical insights into model behavior and representational failures. The authors identify three operational regimes of the models and provide evidence of clinical entanglements, such as age-pathology confounding. The spectral decoder translates latent manipulations into interpretable frequency signatures, validating the physiological relevance of the model's predictions.
Implications
The findings have significant implications for the clinical application of EEG foundation models, enhancing their interpretability and trustworthiness. By providing a clearer understanding of how these models operate, the research paves the way for more reliable use in clinical settings, potentially improving patient outcomes in areas such as sleep staging and pathology detection.
Modeling Heterophily in Multiplex Graphs: An Adaptive Approach for Node Classification
Graph Learning
- HAAM explicitly models both homophilic and heterophilic interactions in multiplex graphs.
- The use of dimension-specific compatibility matrices allows for tailored representation learning.
- Product-composed Chebyshev filters enable the model to capture non-linear interactions effectively.
- The framework improves node classification performance compared to existing methods.
Read more
Modeling Heterophily in Multiplex Graphs: An Adaptive Approach for Node Classification
Summary
This paper addresses the limitations of existing multiplex graph models that primarily assume homophily, where connected nodes share similar attributes or classes. The authors introduce HAAM (Heterophily-Aware Adaptive Multiplex model), a novel framework designed for node classification in multiplex graphs that accommodates both homophilic and heterophilic interactions. HAAM employs dimension-specific compatibility matrices to capture varying levels of homophily and heterophily across different graph dimensions. A significant innovation of HAAM is the use of a product of trainable low-pass and high-pass Chebyshev filters to effectively model both smooth and abrupt changes in graph signals. This allows the model to adaptively adjust to the heterophilic characteristics of each dimension. The training process utilizes a proximal-gradient optimization method to refine label predictions while promoting sparsity in the consensus predictions. The experimental results demonstrate that HAAM outperforms state-of-the-art methods in node classification tasks on both synthetic and real-world datasets, showcasing its ability to effectively capture the complex interplay of interactions in multiplex graphs.
Methodology
The authors propose HAAM, which utilizes learnable compatibility matrices to model varying degrees of homophily and heterophily across dimensions. The model incorporates a product of low-pass and high-pass Chebyshev filters to capture different frequency components of graph signals. Training is conducted using a dual loss function that includes cross-entropy loss and a divergence minimization term, optimized through a proximal-gradient method.
Results
Extensive experiments on synthetic and real-world datasets indicate that HAAM significantly improves node classification performance compared to state-of-the-art methods, effectively capturing the complexities of multiplex graphs with both homophilic and heterophilic interactions.
Implications
HAAM's approach can be applied to various domains such as social networks, biological systems, and recommendation systems, where understanding the interplay of different types of interactions is crucial for accurate predictions and insights.
When Answers Stray from Questions: Hallucination Detection via Question-Answer Orthogonal Decomposition
NLP
Large Language Models
Efficient ML
- QAOD introduces a geometric approach to hallucination detection by decoupling question and answer representations.
- The framework utilizes Fisher scoring for efficient selection of informative layers and neurons.
- QAOD achieves superior performance in both in-domain and cross-domain settings with a single inference pass.
- The joint probing strategy enhances in-domain discriminability, while the orthogonal-only probe excels in OOD scenarios.
Read more
When Answers Stray from Questions: Hallucination Detection via Question-Answer Orthogonal Decomposition
Summary
This paper addresses the challenge of hallucination detection in large language models (LLMs), which often produce factually incorrect or fabricated content. Traditional methods for detecting hallucinations either rely on black-box consistency checks that require multiple inferences, leading to high computational costs, or on single-pass white-box probes that may struggle with domain shifts. The authors propose a novel framework called QAOD (Question-Answer Orthogonal Decomposition) that efficiently detects hallucinations by projecting the answer representation away from the question-aligned direction, thus isolating the question-orthogonal component that is less sensitive to domain variations. This approach includes a layer and neuron selection mechanism based on Fisher scoring to identify the most informative signals. The framework features two probing strategies: a joint probe that combines the orthogonal component with question context for improved in-domain performance and an orthogonal-only probe that maintains domain-agnostic factuality signals for better out-of-domain (OOD) generalization. The results demonstrate that QAOD's joint probe achieves the highest in-domain AUROC across various model-dataset pairs, while the orthogonal-only probe significantly outperforms existing white-box methods, achieving up to a 21% improvement on the BioASQ dataset at minimal generation cost.
Methodology
QAOD employs a two-branch framework: an offline branch for identifying informative layers and neurons during training, and an online branch for single-pass hallucination detection during testing. It utilizes geometric decomposition to separate the question-aligned component from answer representations, enhancing robustness against domain shifts. Fisher-based scoring is used to select the most discriminative features without iterative optimization.
Results
QAOD's joint probe achieved the best in-domain AUROC across all evaluated model-dataset pairs, while the orthogonal-only probe surpassed the best white-box baseline by up to 21% on the BioASQ dataset. Both probes operate efficiently in a single forward pass, maintaining low computational costs.
Implications
The proposed QAOD framework has significant implications for the deployment of LLMs in high-stakes applications, such as healthcare and legal services, where the accuracy of generated content is critical. By improving hallucination detection, QAOD enhances the reliability and safety of LLMs in real-world scenarios.
Octopus: History-Free Gradient Orthogonalization for Continual Learning in Multimodal Large Language Models
NLP
Large Language Models
Multimodal
- Introduces a history-free approach to gradient orthogonalization for continual learning.
- Decouples task adaptation from regularization to enhance model performance.
- Achieves state-of-the-art results on the UCIT benchmark, outperforming previous methods.
- Addresses privacy and storage concerns associated with rehearsal-based methods.
Read more
Octopus: History-Free Gradient Orthogonalization for Continual Learning in Multimodal Large Language Models
Summary
The paper introduces Octopus, a novel framework for continual learning in multimodal large language models (MLLMs) that addresses the challenges of catastrophic forgetting without relying on historical data. Existing methods, including architecture-based, rehearsal-based, and regularization-based approaches, face limitations such as computational overhead, privacy concerns, and insufficient mitigation of parameter interference. Octopus employs a two-stage continual learning strategy based on History-Free Gradient Orthogonalization (HiFGO), which enforces gradient orthogonality while utilizing only past weights and current task data. This decoupling of task adaptation from regularization allows for a better balance between model plasticity and stability. Experimental results on the UCIT benchmark demonstrate that Octopus achieves state-of-the-art performance, surpassing previous methods by significant margins, thus effectively preserving previously learned knowledge while integrating new tasks.
Methodology
The Octopus framework utilizes a two-stage finetuning strategy that incorporates History-Free Gradient Orthogonalization (HiFGO). This method enforces gradient orthogonality by leveraging past model weights and current task data, avoiding the need for historical task data. The two-stage approach allows for effective regularization while maintaining task-specific adaptations.
Results
Octopus achieved state-of-the-art performance on the UCIT benchmark, surpassing previous methods by 2.14% in average performance and 6.82% in last-task performance metrics, indicating its effectiveness in mitigating catastrophic forgetting while learning new tasks.
Implications
The proposed framework has significant implications for the development of more efficient and privacy-preserving continual learning systems in MLLMs, enabling them to adapt to new tasks without compromising previously acquired knowledge. This could enhance applications in various domains requiring incremental learning capabilities.
A Unified Geometric Framework for Weighted Contrastive Learning
Theory
- Weighted InfoNCE objectives can be viewed as Distance Geometry Problems, linking the weighting scheme to target geometry.
- SupCon and Soft SupCon collapse class samples to prototypes differently under class imbalance, affecting inter-class similarities.
- y-Aware CL struggles to reach its entropic optimum due to inconsistencies between label-space geometry and latent-space similarity.
- The framework offers practical guidance for designing contrastive learning objectives by aligning geometry in weightings and embeddings.
Read more
A Unified Geometric Framework for Weighted Contrastive Learning
Summary
This paper presents a unified geometric framework for understanding weighted contrastive learning (CL) by interpreting weighted InfoNCE objectives as Distance Geometry Problems (DGP). The authors demonstrate that the weighting scheme in contrastive learning defines a target geometry that the learned representations aim to realize. They provide exact characterizations of optimal embeddings for various supervised and weakly supervised objectives, revealing how class imbalance affects the geometry of learned representations. Specifically, they show that SupCon collapses samples within each class to a prototype but is sensitive to class sizes, while Soft SupCon maintains a regular simplex geometry regardless of imbalance. In continuous-label scenarios, the framework highlights that y-Aware CL often fails to achieve its optimal configuration unless labels are on a hypersphere, indicating a mismatch between Euclidean label weights and spherical latent similarity. The authors introduce metrics for evaluating convergence to predicted optima and validate their theoretical findings through experiments, providing a principled approach for designing contrastive objectives.
Methodology
The authors analyze weighted contrastive learning through a geometric lens, framing the learning of representations as a problem of finding geometric realizations of target structures defined by weighting schemes. They derive theoretical results regarding the behavior of different contrastive objectives and introduce new metrics for evaluating convergence and optimality.
Results
The study reveals that the choice of weighting scheme significantly influences the realizability and uniqueness of contrastive learning embeddings. SupCon leads to class-size-dependent geometries, while Soft SupCon maintains a consistent structure. In continuous-label settings, y-Aware CL is shown to be generally inconsistent, while geometrically consistent formulations like X-CLR yield unique optimal embeddings.
Implications
This framework provides insights into the design of contrastive learning objectives, emphasizing the importance of matching the geometry induced by weights with the embedding space. It can guide future research in improving representation learning techniques and understanding the underlying geometry of learned embeddings.
A Unified Three-Stage Machine Learning Framework for Diabetes Detection, Subtype Discrimination, and Cognitive-Metabolic Hypothesis Testing
Interpretability
- Introduces a three-stage framework for diabetes detection and subtype discrimination.
- Achieves high performance metrics with SVM-RBF and Logistic Regression on diabetes prediction.
- Utilizes unsupervised K-Means clustering to identify diabetes subtypes without ground-truth labels.
- Demonstrates a significant association between glycaemic control and cognitive function.
Read more
A Unified Three-Stage Machine Learning Framework for Diabetes Detection, Subtype Discrimination, and Cognitive-Metabolic Hypothesis Testing
Summary
This paper presents a novel three-stage machine learning framework aimed at improving diabetes detection, subtype discrimination, and exploring cognitive-metabolic associations. The authors identify significant gaps in existing machine learning approaches to diabetes prediction, particularly the lack of subtype discrimination and comprehensive evaluation metrics. In Stage 1, five supervised classifiers, including SVM-RBF and Logistic Regression, are benchmarked on the NCSU Diabetes Dataset, achieving a maximum ROC-AUC of 0.825. Stage 2 employs silhouette-validated K-Means clustering to identify diabetes subtypes without relying on ground-truth labels. In Stage 3, the authors conduct a statistical analysis using the Ohio Longitudinal Cognitive Dataset, revealing a significant positive correlation between glycaemic control and cognitive function. The framework emphasizes reproducibility, with all code and methodologies made publicly available, enhancing the potential for clinical application and decision support.
Methodology
The methodology consists of three stages: (1) Benchmarking five supervised classifiers and a stacking ensemble on the NCSU Diabetes Dataset using stratified five-fold cross-validation, (2) Applying silhouette-validated K-Means clustering to identify diabetes subtypes, and (3) Conducting statistical analysis on the Ohio Longitudinal Cognitive Dataset to test the association between glycaemic control and cognitive function.
Results
The study reports that SVM-RBF and Logistic Regression achieved the highest ROC-AUC of 0.825, while Random Forest had the highest accuracy of 0.762. The K-Means clustering identified clinically plausible diabetes subtypes, and the statistical analysis revealed a significant positive correlation (ρs = 0.208, p = 5.29 × 10−5) between glycaemic control and cognitive function.
Implications
The findings suggest that the proposed framework can enhance diabetes detection and management by providing subtype-specific insights and cognitive risk assessments, which could be integrated into clinical decision support systems.
Low-Rank Adapters Initialization via Gradient Surgery for Continual Learning
NLP
Large Language Models
Efficient ML
- Slice is a new initialization method for LoRA adapters that mitigates catastrophic forgetting in continual learning.
- The method uses gradient surgery to align current task objectives with previously learned knowledge.
- Slice outperforms existing methods (vanilla LoRA, LoRA-GA, LoRAM) in terms of stability and performance metrics.
- The paper introduces adversarial task sequences to better evaluate the performance of continual learning methods.
Read more
Low-Rank Adapters Initialization via Gradient Surgery for Continual Learning
Summary
The paper addresses the challenge of catastrophic forgetting in continual learning (CL) when using Low-Rank Adaptation (LoRA) for fine-tuning large language models (LLMs). The authors introduce a novel method called Slice, which employs gradient surgery to initialize LoRA adapters in a way that minimizes interference with previously learned tasks. Slice accumulates gradients from both the current task and a replay buffer of prior tasks, reconciles them through a projection operator, and uses truncated Singular Value Decomposition (SVD) to set the adapter weights. The method is evaluated against existing initialization techniques on the TRACE benchmark and adversarial Super-NI task sequences, demonstrating that Slice significantly improves stability and reduces forgetting while maintaining general performance. The findings indicate that the initialization of adapters plays a crucial role in balancing the trade-off between stability and plasticity in continual learning scenarios.
Methodology
The authors propose Slice, which initializes LoRA adapters by accumulating gradients from the current task and a replay buffer of past tasks. They reconcile these gradients using a projection operator and apply truncated SVD to derive the adapter weights. The method is tested on the TRACE benchmark and adversarial Super-NI sequences, comparing performance against baseline initialization methods.
Results
Slice consistently achieves better stability-plasticity trade-offs compared to baseline methods, improving Average Performance, Final Performance, and Forgetting metrics while preserving General Performance and In Context Performance across both standard and adversarial continual learning sequences.
Implications
The proposed method has significant implications for the deployment of LLMs in dynamic environments where continual adaptation is necessary. By effectively addressing catastrophic forgetting, Slice can enhance the performance of models in real-world applications that require ongoing learning from new data.
Bayesian Model Merging
Optimization
Efficient ML
Computer Vision
- BMM leverages strong anchor models to improve the merging process.
- The framework employs bi-level optimization for effective hyperparameter tuning.
- A data-free variant of BMM allows for regression without auxiliary data.
- BMM shows significant performance improvements over existing model merging techniques.
Read more
Bayesian Model Merging
Summary
The paper introduces Bayesian Model Merging (BMM), a novel framework for combining multiple task-specific expert models into a single model without the need for joint retraining. This approach addresses two significant limitations of existing model merging techniques: the underutilization of strong anchor models and the reliance on a shared hyperparameter setting across different modules. BMM employs a bi-level optimization strategy, where the inner level formulates model merging as an activation-based Bayesian regression using a strong prior from an anchor model, resulting in an efficient closed-form solution. The outer level utilizes Bayesian optimization to globally search for module-specific hyperparameters based on a small validation set. Additionally, the authors demonstrate a crucial alignment between activation statistics and task vectors, allowing for a data-free variant of BMM that estimates the Gram matrix for regression without auxiliary data. Extensive experiments across various benchmarks in vision and language show that BMM consistently outperforms existing plug-and-play anchor baselines, achieving near-optimal performance with a single merged model.
Methodology
BMM is structured as a bi-level optimization framework. The inner optimization formulates model merging as an activation-based Bayesian regression, utilizing a strong prior from an anchor model to derive a closed-form solution. The outer optimization employs Bayesian optimization to search for module-specific hyperparameters, accommodating the heterogeneity of different modules in the network.
Results
BMM was tested on extensive benchmarks, including up to 20-task merging in vision and 5-task merging in language. On the ViT-L/14 benchmark for 8-task merging, BMM achieved a performance of 95.1%, closely matching the average performance of eight task-specific experts (95.8%). BMM consistently outperformed all plug-and-play anchor baselines, with relative gains of up to 27% on weaker anchors.
Implications
The proposed BMM framework has significant implications for efficient model deployment in scenarios with limited data access or computational resources. It provides a practical solution for integrating multiple expert models, reducing operational overhead while maintaining high performance across various tasks.
Rethinking Molecular OOD Generalization via Target-Aware Source Selection
Optimization
Graph Learning
Reinforcement Learning
- Introduction of SCOPE-BENCH, a rigorous OOD evaluation benchmark that mitigates evaluation biases.
- Development of POMA, a policy-guided framework that enhances knowledge transfer and reduces negative transfer.
- Demonstration of significant performance degradation of existing models under stricter OOD conditions.
- Achievement of up to an 11.2% reduction in mean absolute error across diverse architectures.
Read more
Rethinking Molecular OOD Generalization via Target-Aware Source Selection
Summary
This paper addresses the challenge of robust prediction of molecular properties under extreme out-of-distribution (OOD) scenarios, which is critical for AI-driven drug discovery. The authors critique existing scaffold-splitting protocols for failing to prevent semantic overlap, leading to models that overestimate their extrapolation capabilities. They propose a new benchmark, SCOPE-BENCH, which evaluates OOD performance based on cluster-level partitioning in a physicochemical descriptor space, thus eliminating hidden structural interpolation. Additionally, they introduce a novel framework called Policy Optimization for Multi-source Adaptation (POMA), which formulates knowledge transfer as a retrieve–compose–adapt pipeline. POMA identifies labeled source scaffolds that are structurally close to the unlabeled target and uses a reinforcement learning policy to select the optimal source subset from a large candidate pool. The framework also employs dual-scale domain adaptation to align both macroscopic topological and microscopic pharmacophore scales. Experimental results demonstrate that state-of-the-art 3D molecular models experience significant degradation in performance under the new benchmark, while POMA achieves a notable reduction in mean absolute error across various backbone architectures, validating its effectiveness in enhancing OOD generalization.
Methodology
The authors propose SCOPE-BENCH for OOD evaluation based on physicochemical clustering, and POMA, which utilizes a reinforcement learning policy to select optimal source domains for adaptation. The dual-scale domain adaptation approach aligns both macro and micro features independently to preserve chemical semantics.
Results
The evaluation shows that existing models degrade by up to 8.0× with a mean degradation of 5.9× under SCOPE-BENCH. POMA achieves an 11.2% reduction in mean absolute error with an average relative improvement of 6.2% across various backbone architectures.
Implications
The findings suggest that careful selection of source domains is crucial for improving OOD generalization in molecular property prediction, which could significantly impact AI-driven drug discovery processes.
A Hardware-Aware, Per-Layer Methodology for Post-Training Quantization of Large Language Models
Large Language Models
Efficient ML
Optimization
- Introduces Scaled Outer Product (SOP) for efficient post-training quantization of LLMs.
- Achieves near-lossless fidelity at 4.5–6 bits per weight with lower reconstruction error than conventional methods.
- Utilizes a hardware-efficient LUT output format to enhance performance and reduce costs.
- Employs a flexible, per-layer optimization approach tailored to individual model characteristics.
Read more
A Hardware-Aware, Per-Layer Methodology for Post-Training Quantization of Large Language Models
Summary
The paper presents the Scaled Outer Product (SOP), a novel post-training quantization methodology tailored for large language models (LLMs). SOP aims to achieve near-lossless fidelity at a quantization level of 4.5–6 bits per weight, specifically optimized for hardware that utilizes per-layer lookup table (LUT) decoding. The methodology integrates several innovative techniques, including a per-layer search for optimal fixed and dynamic codebook pairs, signed per-block scales, and activation-weighted cosine similarity to enhance performance. It also introduces a hardware-efficient LUT output format (HIF) to improve energy consumption and cost. The evaluation across six open model families demonstrates that the recommended FP6 operating point achieves lower weight reconstruction error compared to the conventional FP8 baseline while also reducing storage costs. The methodology is characterized by its flexibility, adapting to the unique weight and activation distributions of different models and budgets, thereby positioning itself to effectively handle the evolving dynamics of low-precision training in LLMs.
Methodology
The SOP methodology employs a combination of flexible block scaling, activation-weighted cosine similarity, per-input-channel importance weights, and a compact alphabet of fixed and adaptive codebooks. It conducts a per-layer search for optimal codebook pairs and incorporates post-quantization corrections, including outlier extraction and sparse residual adjustments. The approach is designed to adapt to the specific weight and activation statistics of each model, allowing for tailored quantization strategies.
Results
The evaluation of SOP across six open model families indicates that the FP6 operating point (E2M3sUE4M4, 6.5 bpw) results in lower weight reconstruction error compared to the conventional FP8 baseline (E4M3, 8.0 bpw), while also achieving a 1.5 bpw reduction in storage cost. This demonstrates the effectiveness of the SOP methodology in maintaining model fidelity while optimizing resource usage.
Implications
The SOP methodology has significant implications for the deployment of large language models in resource-constrained environments, enabling efficient quantization without substantial loss of performance. It can facilitate the use of LLMs in edge devices and applications where computational resources are limited, thereby broadening the accessibility and applicability of advanced AI technologies.
Fast Rates for Inverse Reinforcement Learning
Reinforcement Learning
Theory
Robotics
- Establishes equivalence between MLE and Min-Max-IRL at population and empirical levels.
- Proves fast convergence rates of O(n−1) for trajectory-level KL divergence and parameter estimation.
- Extends reward identifiability results to general Borel spaces.
- Derives novel results on the derivatives of the soft-optimal value function with respect to reward parameters.
Read more
Fast Rates for Inverse Reinforcement Learning
Summary
This paper presents significant advancements in the field of inverse reinforcement learning (IRL), particularly focusing on entropy-regularized min-max IRL (Min-Max-IRL) within finite-horizon Markov Decision Processes (MDPs) characterized by Borel state and action spaces. The authors establish a structural equivalence between maximum likelihood estimation (MLE) and Min-Max-IRL at both the population and empirical levels, particularly under deterministic dynamics. They demonstrate that the Min-Max-IRL loss exhibits pseudo-self-concordance, leading to fast statistical rates of convergence for both trajectory-level Kullback-Leibler (KL) divergence and parameter estimation, achieving a decay rate of O(n−1) with respect to the number of expert trajectories, n. This is a notable improvement over existing IRL methods that typically yield slower rates. Furthermore, the paper extends reward identifiability results to general Borel spaces and provides novel insights into the derivatives of the soft-optimal value function concerning reward parameters. The findings suggest that Min-Max-IRL can effectively recover rewards without requiring exploration assumptions, even in misspecified settings, thus broadening the applicability of IRL in various sequential decision-making problems.
Methodology
The authors utilize theoretical analysis to establish structural and statistical results for Min-Max-IRL. They leverage concepts such as pseudo-self-concordance of the loss function and derive results using convex optimization techniques. The paper also involves extending existing results on reward identifiability and analyzing the derivatives of the optimal value function.
Results
The main results include the equivalence of MLE and Min-Max-IRL, the demonstration of fast rates of convergence for KL divergence and parameter estimation, and the extension of reward identifiability to Borel spaces. The authors also provide novel insights into the derivatives of the soft-optimal value function, marking a significant contribution to the theoretical understanding of IRL.
Implications
The findings have significant implications for the design and implementation of IRL algorithms in practical applications such as robotics, where inferring reward functions from expert demonstrations is crucial. The fast convergence rates and the ability to operate under misspecification without exploration assumptions enhance the robustness and efficiency of IRL methods.