AI-generated summaries
Today's ML research,
without the noise.
Daily summaries of the latest machine learning papers from arXiv, processed every 8 hours.
24
Papers today
8h
Update frequency
7
Days of history
GHGbench: A Unified Multi-Entity, Multi-Task Benchmark for Carbon Emission Prediction
Multimodal
Time Series
Optimization
- GHGbench is the first open dataset and benchmark for joint evaluation of company and building-level carbon emissions.
- Building emissions are structurally more difficult to predict compared to company emissions due to additional influencing factors.
- The in-distribution to out-of-distribution performance gap is larger than within-model variations.
- Multimodal remote-sensing embeddings significantly improve prediction accuracy in challenging scenarios.
Read more
GHGbench: A Unified Multi-Entity, Multi-Task Benchmark for Carbon Emission Prediction
Summary
GHGbench introduces a comprehensive open dataset and benchmark for predicting greenhouse gas emissions at both company and building levels. The dataset comprises over 32,000 company-year records from more than 12,000 firms, including Scope 1, 2, and 3 emissions disclosures, alongside financial and sectoral signals. Additionally, the building track harmonizes 491,591 building-year records from 13 sources across 26 metropolitan areas, integrating climate covariates and multimodal remote-sensing embeddings. The benchmark establishes canonical task splits for in-distribution and cross-region/city transfer tasks, as well as short-horizon forecasting. Various models, including gradient-boosted trees, tabular foundation models, MLPs, FT-Transformers, and multimodal fusion techniques, were evaluated using multi-seed paired-bootstrap tests. Key findings reveal that building emissions are more complex to predict than company emissions, the gap between in-distribution and out-of-distribution performance is significant, and multimodal embeddings enhance prediction accuracy where traditional tabular methods fail. GHGbench also identifies systematic failure modes, such as catastrophic city transfer, indicating areas for future model improvement. The dataset and evaluation framework aim to facilitate reproducible research and advance carbon emission prediction methodologies.
Methodology
The methodology involves creating a unified dataset from fragmented sources, normalizing identifiers and units, and enriching data with financial and sectoral signals. The benchmark includes a multi-task evaluation suite with canonical splits for various prediction tasks, employing models like gradient-boosted trees, tabular foundation models, and multimodal approaches, all assessed through rigorous statistical testing.
Results
The results indicate that building emissions are harder to predict than company emissions, with a notable performance gap between in-distribution and out-of-distribution settings. A tabular foundation model achieved significant improvements over traditional tuned trees in multi-city building-emission tasks. Additionally, the use of multimodal embeddings provided measurable gains in cross-city transfer scenarios.
Implications
The findings from GHGbench can inform climate policy, finance, and urban operations by providing a robust framework for carbon emission prediction. The dataset can be utilized for developing more accurate predictive models, enhancing transparency in emissions reporting, and guiding strategic decisions towards achieving net-zero emissions.
Bayesian Model Merging
Optimization
Efficient ML
Computer Vision
- BMM leverages strong anchor models to improve the merging process.
- The framework employs bi-level optimization for effective hyperparameter tuning.
- A data-free variant of BMM allows for regression without auxiliary data.
- BMM shows significant performance improvements over existing model merging techniques.
Read more
Bayesian Model Merging
Summary
The paper introduces Bayesian Model Merging (BMM), a novel framework for combining multiple task-specific expert models into a single model without the need for joint retraining. This approach addresses two significant limitations of existing model merging techniques: the underutilization of strong anchor models and the reliance on a shared hyperparameter setting across different modules. BMM employs a bi-level optimization strategy, where the inner level formulates model merging as an activation-based Bayesian regression using a strong prior from an anchor model, resulting in an efficient closed-form solution. The outer level utilizes Bayesian optimization to globally search for module-specific hyperparameters based on a small validation set. Additionally, the authors demonstrate a crucial alignment between activation statistics and task vectors, allowing for a data-free variant of BMM that estimates the Gram matrix for regression without auxiliary data. Extensive experiments across various benchmarks in vision and language show that BMM consistently outperforms existing plug-and-play anchor baselines, achieving near-optimal performance with a single merged model.
Methodology
BMM is structured as a bi-level optimization framework. The inner optimization formulates model merging as an activation-based Bayesian regression, utilizing a strong prior from an anchor model to derive a closed-form solution. The outer optimization employs Bayesian optimization to search for module-specific hyperparameters, accommodating the heterogeneity of different modules in the network.
Results
BMM was tested on extensive benchmarks, including up to 20-task merging in vision and 5-task merging in language. On the ViT-L/14 benchmark for 8-task merging, BMM achieved a performance of 95.1%, closely matching the average performance of eight task-specific experts (95.8%). BMM consistently outperformed all plug-and-play anchor baselines, with relative gains of up to 27% on weaker anchors.
Implications
The proposed BMM framework has significant implications for efficient model deployment in scenarios with limited data access or computational resources. It provides a practical solution for integrating multiple expert models, reducing operational overhead while maintaining high performance across various tasks.
Spectral Energy Centroid: a Metric for Improving Performance and Analyzing Spectral Bias in Implicit Neural Representations
Computer Vision
Generative Models
Theory
- Introduces the Spectral Energy Centroid (SEC) as a metric for analyzing spectral bias in INRs.
- Proposes a data-driven hyperparameter selection strategy (SEC-Conf) that outperforms existing methods.
- Demonstrates that SEC serves as a reliable proxy for signal complexity and reconstruction quality.
- Reveals the significant impact of model depth on spectral bias and INR performance.
Read more
Spectral Energy Centroid: a Metric for Improving Performance and Analyzing Spectral Bias in Implicit Neural Representations
Summary
This paper addresses the challenges associated with Implicit Neural Representations (INRs) in modeling continuous signals, particularly focusing on the low-frequency bias inherent in multilayer perceptrons (MLPs). The authors introduce the Spectral Energy Centroid (SEC) metric, which quantifies the frequency characteristics of target images and the spectral bias of INR models. They demonstrate that SEC can be utilized effectively for hyperparameter selection, serving as a reliable proxy for signal complexity and enabling the alignment of spectral biases across different INR architectures. The study reveals that existing methods, such as FreSh, do not adequately account for the influence of model depth on performance, leading to suboptimal results. By employing SEC, the authors propose a data-driven strategy (SEC-Conf) that outperforms traditional heuristics and adapts well to varying model depths. The findings indicate a strong correlation between SEC and reconstruction quality, highlighting the importance of spectral bias in INR performance. Overall, the paper contributes to a deeper understanding of the relationship between frequency content and INR capabilities, providing practical tools for improving INR performance across diverse applications.
Methodology
The authors utilize the Spectral Energy Centroid (SEC) metric to analyze the frequency characteristics of target images and the spectral bias of INR models. They conduct experiments to validate the effectiveness of SEC in hyperparameter selection and performance alignment across various INR architectures, comparing it against existing methods like FreSh.
Results
The study shows that the SEC metric is a versatile tool for INR analysis, leading to improved hyperparameter selection (SEC-Conf) that is robust to model depth. The results indicate a strong correlation between SEC values and the quality of signal reconstruction, confirming its utility as a proxy for signal complexity. Additionally, the authors demonstrate that aligning spectral biases can enhance the performance of older models to match that of newer architectures.
Implications
The findings have significant implications for the design and training of INRs in various applications, including scene modeling, robotics, and generative tasks. By providing a systematic approach to hyperparameter tuning and spectral bias alignment, this research can enhance the fidelity and efficiency of neural representations in practical scenarios.
Low-Rank Adapters Initialization via Gradient Surgery for Continual Learning
NLP
Large Language Models
Efficient ML
- Slice is a new initialization method for LoRA adapters that mitigates catastrophic forgetting in continual learning.
- The method uses gradient surgery to align current task objectives with previously learned knowledge.
- Slice outperforms existing methods (vanilla LoRA, LoRA-GA, LoRAM) in terms of stability and performance metrics.
- The paper introduces adversarial task sequences to better evaluate the performance of continual learning methods.
Read more
Low-Rank Adapters Initialization via Gradient Surgery for Continual Learning
Summary
The paper addresses the challenge of catastrophic forgetting in continual learning (CL) when using Low-Rank Adaptation (LoRA) for fine-tuning large language models (LLMs). The authors introduce a novel method called Slice, which employs gradient surgery to initialize LoRA adapters in a way that minimizes interference with previously learned tasks. Slice accumulates gradients from both the current task and a replay buffer of prior tasks, reconciles them through a projection operator, and uses truncated Singular Value Decomposition (SVD) to set the adapter weights. The method is evaluated against existing initialization techniques on the TRACE benchmark and adversarial Super-NI task sequences, demonstrating that Slice significantly improves stability and reduces forgetting while maintaining general performance. The findings indicate that the initialization of adapters plays a crucial role in balancing the trade-off between stability and plasticity in continual learning scenarios.
Methodology
The authors propose Slice, which initializes LoRA adapters by accumulating gradients from the current task and a replay buffer of past tasks. They reconcile these gradients using a projection operator and apply truncated SVD to derive the adapter weights. The method is tested on the TRACE benchmark and adversarial Super-NI sequences, comparing performance against baseline initialization methods.
Results
Slice consistently achieves better stability-plasticity trade-offs compared to baseline methods, improving Average Performance, Final Performance, and Forgetting metrics while preserving General Performance and In Context Performance across both standard and adversarial continual learning sequences.
Implications
The proposed method has significant implications for the deployment of LLMs in dynamic environments where continual adaptation is necessary. By effectively addressing catastrophic forgetting, Slice can enhance the performance of models in real-world applications that require ongoing learning from new data.
Multi-Objective and Mixed-Reward Reinforcement Learning via Reward-Decorrelated Policy Optimization
Reinforcement Learning
Large Language Models
NLP
- Introduction of Reward-Decorrelated Policy Optimization (RDPO) for stabilizing multi-objective reinforcement learning.
- Utilization of Magnitude-Aware Quantile Normalization and Mahalanobis whitening to address reward heterogeneity and correlation.
- Demonstrated improvements in model performance on instruction following and writing quality through RDPO.
- Introduction of Effective Information Efficiency (ηeff) as a metric for assessing mixed-reward aggregation quality.
Read more
Multi-Objective and Mixed-Reward Reinforcement Learning via Reward-Decorrelated Policy Optimization
Summary
This paper addresses the challenges of multi-objective and mixed-reward reinforcement learning environments, where heterogeneous reward distributions and correlated reward dimensions can destabilize the training process. The authors propose a novel method called Reward-Decorrelated Policy Optimization (RDPO), which aims to enhance the stability and effectiveness of reward processing in such complex settings. RDPO employs a two-step approach: first, it utilizes Magnitude-Aware Quantile Normalization to stabilize advantage allocation across various reward types (binary, fractional, and continuous). Second, it applies Mahalanobis whitening to reduce correlation redundancy among reward dimensions before aggregation. The effectiveness of RDPO is demonstrated through post-training experiments on the LongCat-Flash model, showing improvements in instruction following, writing quality, and robustness to challenging prompts, while maintaining competitive performance in reasoning and coding tasks. The paper also introduces a diagnostic measure, Effective Information Efficiency (ηeff), to evaluate the quality of mixed-reward aggregation, highlighting the importance of balancing weights across reward dimensions and minimizing redundant variance.
Methodology
The methodology involves a two-step reward processing pipeline: (1) Magnitude-Aware Quantile Normalization for stabilizing advantage allocation across diverse reward types, and (2) Mahalanobis whitening to mitigate correlation redundancy among reward dimensions prior to aggregation. The paper also introduces Effective Information Efficiency (ηeff) as a diagnostic measure for evaluating the effectiveness of mixed-reward aggregation.
Results
The application of RDPO in post-training experiments on LongCat-Flash resulted in enhanced instruction following, improved writing quality, and increased robustness to difficult prompts. The method also demonstrated competitive performance in reasoning and coding evaluations, indicating its effectiveness in handling complex multi-objective tasks.
Implications
The findings suggest that RDPO can significantly improve the training stability and performance of reinforcement learning models in environments with mixed-reward signals. This has potential applications in various domains requiring multi-task learning and complex reward structures, such as natural language processing and robotics.
A Unified Three-Stage Machine Learning Framework for Diabetes Detection, Subtype Discrimination, and Cognitive-Metabolic Hypothesis Testing
Interpretability
- Introduces a three-stage framework for diabetes detection and subtype discrimination.
- Achieves high performance metrics with SVM-RBF and Logistic Regression on diabetes prediction.
- Utilizes unsupervised K-Means clustering to identify diabetes subtypes without ground-truth labels.
- Demonstrates a significant association between glycaemic control and cognitive function.
Read more
A Unified Three-Stage Machine Learning Framework for Diabetes Detection, Subtype Discrimination, and Cognitive-Metabolic Hypothesis Testing
Summary
This paper presents a novel three-stage machine learning framework aimed at improving diabetes detection, subtype discrimination, and exploring cognitive-metabolic associations. The authors identify significant gaps in existing machine learning approaches to diabetes prediction, particularly the lack of subtype discrimination and comprehensive evaluation metrics. In Stage 1, five supervised classifiers, including SVM-RBF and Logistic Regression, are benchmarked on the NCSU Diabetes Dataset, achieving a maximum ROC-AUC of 0.825. Stage 2 employs silhouette-validated K-Means clustering to identify diabetes subtypes without relying on ground-truth labels. In Stage 3, the authors conduct a statistical analysis using the Ohio Longitudinal Cognitive Dataset, revealing a significant positive correlation between glycaemic control and cognitive function. The framework emphasizes reproducibility, with all code and methodologies made publicly available, enhancing the potential for clinical application and decision support.
Methodology
The methodology consists of three stages: (1) Benchmarking five supervised classifiers and a stacking ensemble on the NCSU Diabetes Dataset using stratified five-fold cross-validation, (2) Applying silhouette-validated K-Means clustering to identify diabetes subtypes, and (3) Conducting statistical analysis on the Ohio Longitudinal Cognitive Dataset to test the association between glycaemic control and cognitive function.
Results
The study reports that SVM-RBF and Logistic Regression achieved the highest ROC-AUC of 0.825, while Random Forest had the highest accuracy of 0.762. The K-Means clustering identified clinically plausible diabetes subtypes, and the statistical analysis revealed a significant positive correlation (ρs = 0.208, p = 5.29 × 10−5) between glycaemic control and cognitive function.
Implications
The findings suggest that the proposed framework can enhance diabetes detection and management by providing subtype-specific insights and cognitive risk assessments, which could be integrated into clinical decision support systems.
Scaling Laws for Mixture Pretraining Under Data Constraints
NLP
Large Language Models
Optimization
- Mixture training allows for higher repetition of target data compared to single-source training.
- Optimal repetition rates for target data range from 15 to 20 times, depending on various factors.
- A new scaling law is introduced that predicts target-domain loss based on mixture configurations.
- Empirical findings demonstrate that larger models can extract more from limited data despite faster overfitting.
Read more
Scaling Laws for Mixture Pretraining Under Data Constraints
Summary
This paper investigates the trade-off in mixture pretraining of language models when faced with limited target data, such as low-resource languages or specialized domains. The authors conduct over 2,000 training runs across various model sizes and data types to explore how the mixture of scarce target data with abundant generic data affects model performance. They find that while too little target data underexposes the model, excessive repetition of target data leads to overfitting. The study reveals that mixture training can tolerate higher repetition rates than single-source training, with optimal repetitions ranging from 15 to 20 times, depending on the target data size and compute budget. The authors introduce a repetition-aware mixture scaling law that predicts target-domain loss based on target data size, mixture ratio, and model size, providing practical recommendations for effective pretraining under data constraints. This work contributes to the understanding of how to optimally mix constrained and abundant data sources in language model training.
Methodology
The authors conducted a systematic empirical study involving over 2,000 training runs across different model sizes (from 101M to 805M parameters) and various target data types, including multilingual and domain-specific datasets. They analyzed the effects of data repetition on model performance and developed a scaling law that incorporates the diminishing returns of repeated tokens in mixture training.
Results
The study found that repetition is a significant factor in target-domain performance, with optimal repetition rates allowing for effective learning from limited data. The introduced scaling law accurately predicts target-domain loss and provides insights into the optimal mixture configurations for pretraining, demonstrating that higher repetition is feasible without performance degradation when abundant generic data is present.
Implications
The findings have important implications for training language models in scenarios with limited target data, such as low-resource languages or specialized domains. The proposed scaling law and recommendations can guide practitioners in optimizing their pretraining strategies, potentially improving model performance in underrepresented areas.
Strategic PAC Learnability via Geometric Definability
Theory
- Strategic behavior can significantly impact the learnability of hypothesis classes.
- The authors provide a counterexample showing that learnability is not preserved under strategic behavior in simple cases.
- Introducing geometric definability allows for the preservation of learnability and manageable sample complexity.
- The framework accommodates a variety of cost functions and hypothesis classes commonly used in machine learning.
Read more
Strategic PAC Learnability via Geometric Definability
Summary
This paper investigates the concept of strategic classification, where individuals can alter their features at a cost to influence a classifier's decision. The authors explore how the sample complexity of the induced strategic hypothesis class is affected by the complexities of the underlying hypothesis class and the cost structure governing feature manipulations. They demonstrate that previous assumptions about learnability under strategic behavior are not universally valid, presenting a counterexample where a hypothesis class with VC dimension 1 leads to an induced class with infinite VC dimension. To address this, the authors introduce a geometric definability assumption, allowing both the hypothesis class and cost-induced neighborhood relations to be described using first-order formulas over the reals. This framework captures a wide range of natural classes and cost functions. The authors prove that under this geometric structure, learnability is preserved, and the sample complexity is manageable, depending on the complexity of the defining formulas. This work highlights the necessity of geometric structure in maintaining learnability in strategic settings.
Methodology
The authors utilize a theoretical approach, constructing counterexamples and proving results based on first-order definability over the reals. They analyze the implications of strategic behavior on sample complexity and learnability through geometric structures.
Results
The paper establishes that strategic behavior can lead to non-learnable hypothesis classes, even in simple scenarios. However, by imposing a geometric definability structure, the authors prove that learnability can be preserved, with sample complexity linked to the complexity of the defining formulas.
Implications
The results suggest that strategic classification systems need to account for geometric structures to ensure robust learnability. This has practical implications for designing classifiers in various fields where individuals can manipulate their features to influence outcomes.
Contextual Bandits for Resource-Constrained Devices using Probabilistic Learning
Reinforcement Learning
Efficient ML
Theory
- Introduces probabilistic HD-CB, a low-precision variant of HD-CB for resource-constrained devices.
- Replaces deterministic accumulation with a probabilistic update rule to enhance decision-making efficiency.
- Demonstrates improved performance over binarized HD-CB while maintaining low precision.
- Addresses the overflow issue in low-precision components without the need for periodic binarization.
Read more
Contextual Bandits for Resource-Constrained Devices using Probabilistic Learning
Summary
This paper addresses the challenges of deploying contextual bandit (CB) algorithms on resource-constrained devices, where standard linear CB methods are often impractical due to their high computational and memory costs. The authors introduce a novel approach called probabilistic HD-CB, which enhances the existing hyperdimensional computing-based CB method (HD-CB) by replacing deterministic accumulation with a probabilistic update rule. This new method updates only a random subset of vector components at each step, allowing for low-precision components while preventing overflow and reducing update costs. The authors demonstrate that probabilistic HD-CB outperforms the previously proposed binarized HD-CB method at equal precision levels and approaches the performance of the original HD-CB with significantly lower precision requirements. The findings suggest that probabilistic HD-CB is a promising solution for implementing CB algorithms on devices with strict resource constraints, making it suitable for various applications in edge computing and adaptive services.
Methodology
The authors developed probabilistic HD-CB by implementing a probabilistic update mechanism that randomly selects a subset of vector components to update at each decision step. This method is inspired by low-precision neural networks and incorporates time-decaying update probabilities to control the learning rate, allowing for efficient memory usage and preventing overflow in low-precision components.
Results
Probabilistic HD-CB consistently outperformed the binarized HD-CB approach in off-policy evaluations on standardized synthetic CB benchmarks, achieving performance levels comparable to the original HD-CB with as few as 3 bits per component. This indicates a significant reduction in resource requirements while maintaining decision quality.
Implications
The findings suggest that probabilistic HD-CB can facilitate the deployment of contextual bandit algorithms on resource-constrained devices, enabling real-time decision-making in various applications such as online advertising, personalized recommendations, and adaptive resource management in edge computing environments.
Multimodal Graph-based Classification of Esophageal Motility Disorders
Multimodal
Graph Learning
- Proposes a multimodal ML approach combining HRIM data with patient-specific information.
- Uses graph-based modeling to represent HRIM data, enhancing the analysis of esophageal motility.
- Demonstrates improved classification accuracy over traditional methods and vision-based classifiers.
- Highlights the importance of integrating multiple data modalities for better diagnostic outcomes.
Read more
Multimodal Graph-based Classification of Esophageal Motility Disorders
Summary
This paper addresses the challenges in diagnosing esophageal motility disorders, particularly dysphagia, by proposing a multimodal machine learning approach that combines high-resolution impedance manometry (HRIM) data with patient-specific information. The authors collected data from 104 patients and represented HRIM recordings as spatio-temporal graphs, where nodes correspond to pressure values and edges represent spatial adjacency and impedance dynamics. A graph neural network (GNN) was employed to learn meaningful representations from these graphs, which were then fused with patient embeddings for multi-category classification of swallow events. The study demonstrated that integrating patient-specific information significantly improved classification accuracy compared to models relying solely on HRIM features. Additionally, the graph-based approach outperformed vision-based classifier baselines, highlighting the importance of multimodal data integration in enhancing diagnostic precision. The findings suggest that this method could lead to more accurate and personalized assessments of esophageal motility disorders, although further validation with larger datasets is necessary to confirm these results.
Methodology
The study utilized HRIM recordings and patient data from 104 patients, transforming HRIM data into spatio-temporal graphs for analysis. A graph neural network was applied to learn representations from these graphs, which were then combined with patient embeddings for classification tasks. Ablation studies were conducted to evaluate the impact of different features and modeling approaches.
Results
The multimodal approach showed significant improvements in classification accuracy across all categories compared to models that used only HRIM-derived features. The graph-based modeling also provided advantages over vision-based classifier baselines, indicating the effectiveness of the proposed method.
Implications
The findings suggest that integrating patient-level data with graph-based representations of HRIM signals can lead to more accurate classifications of esophageal motility disorders, potentially improving clinical decision-making and patient outcomes. This approach could pave the way for personalized medicine in the diagnosis and treatment of dysphagia and related disorders.
A Hierarchical Language Model with Predictable Scaling Laws and Provable Benefits of Reasoning
NLP
Large Language Models
Theory
- Introduces synthetic languages with hierarchical structures for precise analysis of context and reasoning in autoregressive generation.
- Derives explicit asymptotic predictions for distributional statistics in two broadcast process settings.
- Establishes a lower bound on context length for faithful sampling and demonstrates an exponential improvement using reasoning models.
- Empirical results validate theoretical predictions, showing the relationship between context size and model performance.
Read more
A Hierarchical Language Model with Predictable Scaling Laws and Provable Benefits of Reasoning
Summary
This paper introduces a family of synthetic languages characterized by a hierarchical structure generated through a broadcast process on trees. The authors analyze the impact of context length and reasoning in autoregressive generation using an exact k-gram ansatz, which serves as a substitute for traditional transformers. They derive asymptotic predictions for the distributional statistics of sequences produced by trained models in two settings: the Ising broadcast process and the coloring broadcast process. The findings reveal that the variance of generated sequences scales log-linearly with context depth, and the kurtosis approaches that of a Gaussian distribution, indicating deviations from the true language for sublinear contexts. Furthermore, the authors establish a lower bound on the context length required for accurate sampling of sequences, contrasting this with an autoregressive reasoning model that can sample exactly from the true language using significantly less memory. Empirical validation with transformers confirms the theoretical predictions across various context sizes, demonstrating the effectiveness of the proposed model in understanding the trade-offs between context length and reasoning in language modeling.
Methodology
The authors utilize a k-gram ansatz to analyze autoregressive processes, replacing traditional transformers. They derive theoretical results based on a broadcast process on trees, focusing on two instantiations: the Ising broadcast process and the coloring broadcast process. Empirical validation is conducted using transformers trained on the synthetic languages to confirm theoretical predictions.
Results
The study finds that the variance of generated sequences scales log-linearly with context depth, while kurtosis converges to that of a Gaussian distribution. A lower bound of Ω(n) on context length is established for accurate sampling, while an autoregressive reasoning model with Θ(log n) memory can sample exactly from the true language, demonstrating significant efficiency gains.
Implications
The findings suggest that understanding the interplay between context length and reasoning can lead to more efficient language models, potentially reducing computational costs and improving performance in tasks requiring long-range dependencies.
Separating Shortcut Transition from Cross-Family OOD Failure in a Minimal Model
Theory
- Introduces a minimal binary model to study shortcut features and OOD failure.
- Demonstrates that training-side observations can indicate potential cross-family failures.
- Establishes that positive training shortcut correlation and shortcut-rule transitions are distinct phenomena.
- Shows that the same training solution can yield different outcomes depending on the held-out family.
Read more
Separating Shortcut Transition from Cross-Family OOD Failure in a Minimal Model
Summary
This paper investigates the relationship between shortcut features and out-of-distribution (OOD) failure in a minimal binary model. The author introduces a model with one invariant coordinate and one family-dependent shortcut coordinate, aiming to clarify how training correlation, learned shortcut use, and test-time failure interact. The study reveals that positive average shortcut correlation can lead to a transition towards shortcut reliance during training, but ridge regularization can maintain an invariant-dominated classifier, preventing deterministic OOD failure. However, when the invariant coordinate is noisy, the model shows that the transition to shortcut reliance can occur if the training shortcut signal surpasses the invariant signal. The consequences of this transition vary depending on the held-out family, indicating that weaker shortcut correlation can result in positive excess risk, while sign-flipped families can lead to above-chance error. The findings emphasize the distinction between shortcut attraction, shortcut-rule transition, and cross-family OOD failure, providing a clearer understanding of these phenomena in machine learning.
Methodology
The author employs a closed-form binary model with two observed coordinates: an invariant signal and a family-dependent shortcut. The model analyzes the effects of training shortcut correlation on classifier behavior, using deterministic and noisy regimes to derive conditions for shortcut transitions and OOD failure. Theoretical results are supported by synthetic checks to validate the model's predictions.
Results
The study finds that in a deterministic setting, ridge regularization prevents deterministic OOD failure despite positive shortcut correlation. In a noisy regime, a transition to shortcut reliance occurs when the training shortcut signal exceeds the invariant signal, leading to varying outcomes based on the held-out family. The results indicate that positive training shortcut correlation does not guarantee robustness, as it can lead to different levels of risk depending on the test family's characteristics.
Implications
The findings have significant implications for understanding shortcut learning and OOD failure in machine learning models. By clarifying the distinctions between shortcut attraction, transitions, and failures, the research can inform the design of more robust training methodologies and diagnostics for identifying potential OOD issues.
Orthrus: Memory-Efficient Parallel Token Generation via Dual-View Diffusion
NLP
Large Language Models
Efficient ML
- Introduces Orthrus, a dual-architecture framework that combines autoregressive and diffusion models.
- Achieves up to 7.8× speedup in token generation while maintaining exact predictive fidelity.
- Utilizes a shared Key-Value cache to eliminate redundant memory usage.
- Incorporates a consensus mechanism for lossless inference.
Read more
Orthrus: Memory-Efficient Parallel Token Generation via Dual-View Diffusion
Summary
Orthrus introduces a novel dual-architecture framework that combines the high fidelity of autoregressive Large Language Models (LLMs) with the parallel token generation capabilities of diffusion models. Traditional autoregressive models face inefficiencies during the decoding phase due to their sequential nature, leading to high inference latency. On the other hand, diffusion models can generate tokens in parallel but often suffer from performance degradation and high training costs. Orthrus addresses these challenges by integrating a lightweight, trainable diffusion module alongside a frozen autoregressive model, allowing both components to share the same high-fidelity Key-Value (KV) cache. This design enables lossless inference while achieving significant speedups—up to 7.8 times faster—without incurring substantial memory overhead. The framework maintains the exact predictive distribution of the autoregressive model through a consensus mechanism that validates the outputs of the diffusion head against the autoregressive head, ensuring high-quality token generation. Orthrus is lightweight, requiring only a small fraction of the model parameters to be fine-tuned, making it a practical solution for enhancing the efficiency of existing LLMs.
Methodology
Orthrus integrates a frozen autoregressive model with a lightweight diffusion module. During the pre-filling stage, the autoregressive model constructs a high-fidelity Key-Value cache, which is then utilized by the diffusion head for parallel token generation. The framework employs a two-head consensus mechanism to ensure that the generated tokens match the autoregressive model's predictive distribution, achieving lossless inference.
Results
The Orthrus framework demonstrates significant improvements in inference speed, achieving up to 7.8 times faster token generation compared to traditional autoregressive methods. It also maintains the exact predictive distribution of the base autoregressive model, ensuring high-quality outputs.
Implications
Orthrus has the potential to enhance the efficiency of existing large language models, making them more suitable for real-time applications where speed is critical. Its lightweight design allows for easy integration into current systems, promoting broader adoption of efficient token generation techniques in natural language processing tasks.
Population Risk Bounds for Kolmogorov-Arnold Networks Trained by DP-SGD with Correlated Noise
Theory
Optimization
- First population risk bounds for KANs trained with mini-batch SGD and correlated noise.
- Establishes bounds for both non-private and differentially private settings.
- Introduces a novel analysis framework for correlated-noise DP training in non-convex regimes.
- Demonstrates that correlated noise can improve the privacy-utility tradeoff compared to independent noise.
Read more
Population Risk Bounds for Kolmogorov-Arnold Networks Trained by DP-SGD with Correlated Noise
Summary
This paper presents the first population risk bounds for Kolmogorov-Arnold Networks (KANs) trained using mini-batch Stochastic Gradient Descent (SGD) with gradient clipping, addressing both non-private and differentially private (DP) settings with correlated noise. The authors highlight that traditional analyses have been limited to full-batch gradient descent and independent noise, which do not reflect practical training scenarios. The study introduces a novel analysis framework that accounts for the challenges posed by temporal correlations in noise, which can improve the privacy-utility tradeoff. The results extend existing KAN theory by providing sharper bounds and covering various special cases, including non-private mini-batch SGD and independent-noise DP-SGD. The paper's contributions include establishing population risk bounds for two-layer KANs, demonstrating that correlated noise can enhance performance, and providing a new analytical route for DP training in non-convex settings. This work is significant as it fills a gap in the literature regarding population risk guarantees for non-convex neural networks under practical training conditions.
Methodology
The authors develop a new analysis route that incorporates an auxiliary unprojected dynamics and a shifted iterate to manage the challenges posed by correlated noise in the optimization process. They utilize high-probability bootstrapping to certify projection inactivity, allowing for a more accurate assessment of population risk bounds in the context of mini-batch SGD and DP training.
Results
The paper establishes that the population risk bounds for two-layer KANs trained by clipped mini-batch SGD are valid under specific width regimes. In the DP setting, the results provide the first population risk bounds for correlated-noise DP training in non-convex settings, matching the convex DP-SCO lower bound up to logarithmic factors. The findings also confirm that the bounds for KANs encompass various training scenarios, including non-private and independent-noise cases.
Implications
The results have significant implications for the training of neural networks in sensitive applications, such as healthcare and finance, where privacy is crucial. The improved understanding of population risk in the context of correlated noise can lead to better model performance and privacy guarantees in practical machine learning deployments.
Multi-Quantile Regression for Extreme Precipitation Downscaling
Time Series
Generative Models
Theory
- Q-SRDRN significantly improves detection rates of extreme precipitation events compared to traditional methods.
- The use of pinball loss allows for better handling of heavy-tail distributions in precipitation data.
- Data augmentation through cVAE is beneficial when aligned with the model architecture and regional characteristics.
- The architecture shows strong performance across diverse climatic conditions, indicating its robustness.
Read more
Multi-Quantile Regression for Extreme Precipitation Downscaling
Summary
This paper addresses the limitations of deep super-resolution networks in predicting extreme precipitation events, which are critical for flood risk assessment. The authors introduce Q-SRDRN, a multi-quantile super-resolution network that utilizes pinball loss to improve the detection of heavy-tail precipitation events. They identify that traditional data augmentation methods fail due to the averaging effect of intensity-weighted MAE loss, which dilutes the predictive power for extreme events. The proposed architecture includes IncrementBound to enforce monotonicity and separate output heads for different quantiles, allowing for better specialization of convolutional filters. The methodology is validated across three distinct U.S. climates: Florida, California, and a Texas substate, demonstrating significant improvements in detection rates for extreme precipitation events compared to deterministic baselines. The findings suggest that multi-quantile regression can effectively capture extreme events, and when paired with appropriate data augmentation, it can enhance model performance without introducing bias.
Methodology
The authors developed Q-SRDRN, a multi-quantile super-resolution network that employs pinball loss for training. The architecture includes IncrementBound for monotonicity and separate convolutional heads for different quantiles, allowing for specialized learning. The model was validated using precipitation data from Florida, California, and Texas, with a focus on extreme events.
Results
The Q-SRDRN model achieved an 18-fold increase in detection rates for extreme precipitation events in Florida, detecting 1,598 out of 2,111 events at 200 mm/day compared to only 88 by the deterministic baseline. In California, the model reached near-perfect detection rates for extreme events, while in Texas, it detected 8,776 out of 10,720 events at the same threshold. The median channel's performance improved significantly with the introduction of cVAE-generated samples.
Implications
The findings suggest that multi-quantile regression can enhance predictive modeling for extreme weather events, which is crucial for flood risk management and infrastructure planning. The methodology can be applied to other regions and types of extreme weather, potentially improving climate resilience strategies.
EMO: Frustratingly Easy Progressive Training of Extendable MoE
Large Language Models
Efficient ML
- EMO allows for progressive expansion of the expert pool during training, improving efficiency.
- The framework is based on a sparsity scaling law that optimizes token allocation across training stages.
- EMO matches or exceeds the performance of fixed-expert models while reducing training time and costs.
- The approach leverages the principle that MoE capacity should grow with data availability.
Read more
EMO: Frustratingly Easy Progressive Training of Extendable MoE
Summary
The paper introduces EMO, a progressive training framework for Sparse Mixture-of-Experts (MoE) models that addresses the inefficiencies associated with training large expert pools from the outset. The authors argue that the traditional approach of allocating a large number of experts at the beginning of training leads to increased memory and communication costs, which can hinder training efficiency. EMO proposes a method to incrementally expand the expert pool as training progresses, treating MoE capacity as expandable memory. This approach is grounded in a sparsity scaling law that helps determine optimal token budgets for each stage of training, allowing for efficient utilization of compute resources. The authors validate EMO through large-scale experiments, demonstrating that it achieves comparable performance to fixed-expert setups while significantly improving wall-clock efficiency and reducing GPU costs.
Methodology
The authors developed EMO by starting with a smaller dense model and progressively expanding it into a larger MoE model through multiple stages. They conducted scaling-law experiments to determine the optimal allocation of tokens for each stage, ensuring that the model effectively utilizes its capacity as training data increases. This involved calibrating the expert count and adjusting the training schedule based on the predicted performance at each stage.
Results
In experiments, EMO transitioned a 1.1B dense model into a 9.6B MoE model with 128 experts over five stages, achieving a final pretraining loss of 1.017, which is competitive with a fixed-expert baseline of 0.994. Additionally, EMO saved 10% in GPU hours compared to the fixed-expert setup, demonstrating its efficiency. Downstream evaluations across various benchmarks showed that EMO outperformed a fixed-expert model with 64 experts while remaining comparable to the 128-expert baseline.
Implications
The EMO framework has significant implications for the training of large-scale MoE models, providing a more efficient method to leverage expert capacity as data scales. This could lead to advancements in various applications requiring large language models, optimizing resource usage and potentially enabling the development of even larger models without proportional increases in training costs.
Modeling Heterophily in Multiplex Graphs: An Adaptive Approach for Node Classification
Graph Learning
- HAAM explicitly models both homophilic and heterophilic interactions in multiplex graphs.
- The use of dimension-specific compatibility matrices allows for tailored representation learning.
- Product-composed Chebyshev filters enable the model to capture non-linear interactions effectively.
- The framework improves node classification performance compared to existing methods.
Read more
Modeling Heterophily in Multiplex Graphs: An Adaptive Approach for Node Classification
Summary
This paper addresses the limitations of existing multiplex graph models that primarily assume homophily, where connected nodes share similar attributes or classes. The authors introduce HAAM (Heterophily-Aware Adaptive Multiplex model), a novel framework designed for node classification in multiplex graphs that accommodates both homophilic and heterophilic interactions. HAAM employs dimension-specific compatibility matrices to capture varying levels of homophily and heterophily across different graph dimensions. A significant innovation of HAAM is the use of a product of trainable low-pass and high-pass Chebyshev filters to effectively model both smooth and abrupt changes in graph signals. This allows the model to adaptively adjust to the heterophilic characteristics of each dimension. The training process utilizes a proximal-gradient optimization method to refine label predictions while promoting sparsity in the consensus predictions. The experimental results demonstrate that HAAM outperforms state-of-the-art methods in node classification tasks on both synthetic and real-world datasets, showcasing its ability to effectively capture the complex interplay of interactions in multiplex graphs.
Methodology
The authors propose HAAM, which utilizes learnable compatibility matrices to model varying degrees of homophily and heterophily across dimensions. The model incorporates a product of low-pass and high-pass Chebyshev filters to capture different frequency components of graph signals. Training is conducted using a dual loss function that includes cross-entropy loss and a divergence minimization term, optimized through a proximal-gradient method.
Results
Extensive experiments on synthetic and real-world datasets indicate that HAAM significantly improves node classification performance compared to state-of-the-art methods, effectively capturing the complexities of multiplex graphs with both homophilic and heterophilic interactions.
Implications
HAAM's approach can be applied to various domains such as social networks, biological systems, and recommendation systems, where understanding the interplay of different types of interactions is crucial for accurate predictions and insights.
Di-BiLPS: Denoising induced Bidirectional Latent-PDE-Solver under Sparse Observations
Efficient ML
Generative Models
Theory
- Di-BiLPS effectively addresses both forward and inverse PDE problems under extreme data sparsity.
- The framework utilizes a combination of variational autoencoders, latent diffusion models, and contrastive learning.
- It achieves state-of-the-art performance with significantly reduced computational costs.
- The proposed denoising algorithm integrates physical constraints for improved inference.
Read more
Di-BiLPS: Denoising induced Bidirectional Latent-PDE-Solver under Sparse Observations
Summary
The paper introduces Di-BiLPS, a novel neural framework designed to address the challenges of solving partial differential equations (PDEs) under extremely sparse observational data. Traditional numerical solvers and existing neural approaches struggle with high-resolution inference and accuracy when data is limited. Di-BiLPS combines a variational autoencoder for dimensionality reduction, a latent diffusion module for uncertainty modeling, and contrastive learning for representation alignment. This framework operates in a compressed latent space, enhancing computational efficiency and flexibility in input-output mapping. A key innovation is the PDE-informed denoising algorithm, which utilizes a variance-preserving diffusion process to improve inference efficiency. Extensive experiments across five PDE benchmark datasets demonstrate that Di-BiLPS consistently outperforms state-of-the-art methods in both accuracy and computational cost, even with as little as 3% input data. Furthermore, it supports zero-shot super-resolution, allowing predictions over continuous spatial-temporal domains without retraining.
Methodology
Di-BiLPS employs a three-component architecture: (1) a contrastive learning module for aligning representations between sparse and full observations, (2) a pre-trained variational autoencoder to compress inputs into a latent space, and (3) a latent diffusion model that facilitates bidirectional inference for PDE solutions. The framework also includes a PDE-informed denoising algorithm based on a variance-preserving diffusion process.
Results
The experiments conducted on five PDE benchmark datasets reveal that Di-BiLPS consistently outperforms existing methods in terms of prediction accuracy and computational efficiency, even with extremely sparse inputs. The framework demonstrates the ability to generalize to unseen spatial resolutions without the need for retraining.
Implications
The advancements presented in Di-BiLPS could significantly enhance the modeling of complex physical and natural phenomena in various fields, including engineering, physics, and environmental science, where data is often sparse. The ability to perform zero-shot super-resolution may also open new avenues for real-time applications and simulations.
Tight Sample Complexity Bounds for Entropic Best Policy Identification
Reinforcement Learning
Theory
- Introduces a new lower bound for best policy identification in risk-sensitive reinforcement learning.
- Develops the Entropic-BPI algorithm that achieves optimal sample complexity.
- Improves concentration bounds for exponential utilities, enhancing exploration strategies.
- Demonstrates that the maximal achievable reward Gmax is a better metric for sample complexity than the horizon H.
Read more
Tight Sample Complexity Bounds for Entropic Best Policy Identification
Summary
This paper addresses the problem of best-policy identification in finite-horizon risk-sensitive reinforcement learning, specifically under the entropic risk measure. Previous research highlighted a significant gap between the lower and upper bounds on the sample complexity required to identify an approximately optimal policy, with lower bounds scaling as Ω(pe|β|Hq) and upper bounds achieving O(pe2|β|Hq). The authors identify that the extra exponential factor in the upper bound arises from loose concentration control for exponential utilities. To bridge this gap, they propose a forward-model based algorithm that incorporates KL-based exploration bonuses tailored to the entropic criterion. The authors introduce two main innovations: sharper concentration bounds derived from the smoothness properties of exponential utility and a new stopping rule that optimally exploits this tightness. The paper presents a new lower bound for the best policy identification problem, expressed in terms of the maximal achievable reward Gmax, which is argued to be more suitable for the entropic risk measure. The proposed algorithm, Entropic-BPI, achieves optimal sample complexity, matching the lower bound with only logarithmic factors and an exponential dependence on Gmax, thus eliminating the additional exponential factor seen in prior work.
Methodology
The authors utilize a forward-model based approach, adapting KL-based exploration bonuses to the entropic risk measure. They derive sharper concentration bounds and propose a new stopping rule to optimize sample complexity in identifying the best policy.
Results
The paper successfully closes the gap between lower and upper bounds on sample complexity for entropic best policy identification, demonstrating that the proposed algorithm achieves optimal sample complexity with improved bounds based on Gmax.
Implications
The findings have significant implications for risk-sensitive decision-making in various fields, including finance and robotics, where understanding and managing downside risk is crucial. The proposed methods can enhance the efficiency of reinforcement learning algorithms in uncertain environments.
WriteSAE: Sparse Autoencoders for Recurrent State
NLP
Large Language Models
Theory
- WriteSAE is the first sparse autoencoder that effectively addresses matrix cache write operations in recurrent language models.
- The method allows for closed-form predictions of logit shifts, achieving high accuracy (R² = 0.98).
- Substitution of learned rank-1 atoms consistently outperforms traditional matched-norm ablation tests.
- WriteSAE demonstrates significant improvements in performance metrics, including a 3× lift in midrank target-in-continuation tasks.
Read more
WriteSAE: Sparse Autoencoders for Recurrent State
Summary
This paper introduces WriteSAE, a novel sparse autoencoder designed to enhance the performance of state-space and hybrid recurrent language models by addressing the limitations of existing sparse autoencoders (SAEs) that primarily read residual streams. WriteSAE innovatively decomposes and edits the matrix cache write operations of recurrent models like Gated DeltaNet, Mamba-2, and RWKV-7, which utilize rank-1 updates that cannot be replaced by vector atoms. The proposed method factors each decoder atom into a native write shape, allowing for a closed-form expression for per-token logit shifts and training under a matched Frobenius norm. The results demonstrate that atom substitution significantly outperforms matched-norm ablation in 92.4% of cases across 4,851 firings, achieving an R² of 0.98 for the closed-form predictions. Furthermore, WriteSAE shows sustained improvements in target-in-continuation tasks, marking a significant advancement in the behavioral installation at the matrix-recurrent write site.
Methodology
The methodology involves the development of a sparse autoencoder that factors decoder atoms into rank-1 outer products, allowing for targeted cache-slot substitutions. The model is trained using a matched Frobenius norm, and a three-factor closed form is derived to predict logit shifts based on observable quantities from forward passes. The performance of WriteSAE is evaluated through various tests, including cache-slot substitution and matched substitution tests across different architectures.
Results
The results indicate that WriteSAE's atom substitution outperforms matched-norm ablation in 92.4% of the tested cases. The closed-form predictions for logit shifts achieved a median R² of 0.98 across 200 atom-by-ε cells. Additionally, the method demonstrated a significant increase in performance for target-in-continuation tasks, achieving a 100% success rate under greedy decoding for midrank targets.
Implications
The findings suggest that WriteSAE can enhance the performance of recurrent language models by improving how they manage matrix cache writes. This could lead to more efficient training and better performance in various natural language processing tasks, particularly in applications requiring real-time updates and adaptations.
Uncertainty-Aware Prediction of Lung Tumor Growth from Sparse Longitudinal CT Data via Bayesian Physics-Informed Neural Networks
Time Series
- Introduces a Bayesian physics-informed framework for tumor growth prediction under sparse CT data.
- Combines mechanistic Gompertz constraints with probabilistic inference for improved prediction accuracy.
- Utilizes a two-stage inference procedure for stable posterior inference and efficient sampling.
- Demonstrates the model's capability to provide calibrated uncertainty estimates alongside predictions.
Read more
Uncertainty-Aware Prediction of Lung Tumor Growth from Sparse Longitudinal CT Data via Bayesian Physics-Informed Neural Networks
Summary
This paper addresses the challenge of predicting lung tumor growth using sparse and irregular longitudinal CT data, which often suffers from measurement variability. The authors propose a novel Bayesian physics-informed neural network (PINN) framework that integrates Gompertz growth dynamics with low-dimensional Bayesian inference in the log-volume domain. The framework employs a two-stage inference strategy that combines maximum a posteriori (MAP) estimation and Hamiltonian Monte Carlo (HMC) sampling to estimate posterior predictive distributions and quantify uncertainty. The model was evaluated using longitudinal data from the National Lung Screening Trial involving 30 patients. The results demonstrate that the proposed method effectively captures heterogeneous tumor growth patterns while providing calibrated uncertainty estimates, which are crucial for clinical decision-making. The model achieved a cohort-level log-space RMSE of approximately 0.20 and maintained well-calibrated 95% credible interval coverage across the patient cohort. These findings indicate that the Bayesian physics-informed modeling approach is promising for uncertainty-aware tumor growth assessment, particularly in scenarios with limited longitudinal follow-up scans.
Methodology
The authors developed a Bayesian physics-informed neural network that integrates Gompertz growth dynamics with Bayesian inference. The methodology includes a two-stage inference strategy involving MAP initialization followed by HMC sampling to estimate posterior distributions and quantify uncertainty in tumor growth predictions.
Results
The proposed framework successfully captured heterogeneous tumor growth patterns and achieved a cohort-level log-space RMSE of approximately 0.20. It also provided well-calibrated 95% credible interval coverage across the 30 patients, indicating reliable uncertainty quantification.
Implications
The findings suggest that Bayesian physics-informed modeling can significantly enhance uncertainty-aware tumor growth assessment, which is critical for clinical decision-making, especially when only limited longitudinal follow-up scans are available.
Discovery of Hidden Miscalibration Regimes
Large Language Models
NLP
Interpretability
- Introduces the concept of hidden miscalibration regimes that are not detectable through traditional calibration methods.
- Defines an input-dependent miscalibration field to measure calibration error across the input space.
- Demonstrates the prevalence of calibration heterogeneity in large language models across various datasets.
- Provides a diagnostic framework that supports local confidence corrections, enhancing model reliability.
Read more
Discovery of Hidden Miscalibration Regimes
Summary
This paper addresses the issue of model calibration in machine learning, particularly focusing on the limitations of traditional calibration evaluation methods that rely solely on confidence scores. The authors argue that such methods can obscure significant calibration failures by treating all inputs with the same confidence as exchangeable, leading to a lack of insight into the model's performance across different input types. To tackle this problem, the authors propose a novel framework for discovering hidden miscalibration regimes without requiring predefined data slices. They introduce the concept of a miscalibration field, which captures the signed calibration error across the input space. By learning a calibration-aware representation of the input space, the framework enables the identification of regions where models are systematically overconfident or underconfident. The authors validate their approach through synthetic experiments and a large-scale study involving four real-world benchmarks and twelve large language models (LLMs). The findings reveal that input-dependent calibration heterogeneity is common, and the discovered miscalibration fields can inform targeted local confidence corrections, improving model reliability in areas where traditional methods fall short.
Methodology
The authors develop a diagnostic framework that learns a representation of the input space, allowing for the estimation of a miscalibration field. This field captures local calibration errors by averaging residuals in a learned geometry, facilitating the identification of coherent regions of overconfidence and underconfidence without relying on predefined data slices.
Results
The framework successfully identifies hidden miscalibration structures in controlled synthetic experiments and reveals significant calibration heterogeneity in real-world LLM benchmarks. The results indicate that some model-dataset pairs exhibit pronounced regions of miscalibration, and the learned fields effectively reduce calibration error in these areas, outperforming traditional confidence-based methods.
Implications
The findings suggest that understanding and addressing hidden miscalibration regimes can lead to more reliable machine learning models, particularly in complex domains like natural language processing. The proposed framework can be applied to improve model calibration in various applications, enhancing decision-making processes that rely on model predictions.
Spatiotemporal downscaling and nowcasting of urban land surface temperatures with deep neural networks
Time Series
- Introduces a novel deep learning model for downscaling LST from geostationary to high-resolution satellite data.
- Achieves high accuracy in LST forecasting with low RMSE and bias errors.
- Demonstrates the applicability of the model across major European cities.
- Provides a framework for intraday LST nowcasting, enhancing urban climate studies.
Read more
Spatiotemporal downscaling and nowcasting of urban land surface temperatures with deep neural networks
Summary
This paper addresses the challenge of accurately forecasting Land Surface Temperature (LST) in urban areas by combining high spatial and temporal resolution satellite data. The authors propose a two-step approach: first, a U-Net model is developed to downscale LST fields from geostationary satellites (SEVIRI/MSG) to a higher resolution (1 km) using collocated data from polar orbiting satellites (Terra/Aqua MODIS). This model is trained on LST data from major European cities with populations over 1 million, achieving an RMSE of 1.92 °C and a mean bias error of 0.01 °C on the test set. The second step involves a nowcasting model based on ConvLSTM architecture, which forecasts LST fields for lead times of 15 to 75 minutes. This model outperforms traditional benchmarks, yielding RMSEs between 0.57 and 1.15 °C and biases ranging from -0.1 to 0.14 °C. The validation against independent MODIS overpasses confirms the robustness of the forecasts. This research represents a significant advancement in the field of urban climate monitoring, providing a practical tool for operational satellite-based LST monitoring.
Methodology
The study employs a U-Net architecture for downscaling LST fields from SEVIRI/MSG to MODIS resolution and a ConvLSTM model for nowcasting LST fields. The models are trained on extensive datasets from large European cities, focusing on both spatial and temporal resolution improvements.
Results
The downscaling model achieved an RMSE of 1.92 °C and a mean bias error of 0.01 °C. The nowcasting model outperformed benchmarks with RMSEs of 0.57 to 1.15 °C for lead times of 15 to 75 minutes, demonstrating effective short-term forecasting capabilities.
Implications
The proposed models can significantly enhance urban climate monitoring, allowing for better management of urban heat islands and related phenomena. The high-resolution, high-frequency LST data can be utilized in various applications, including energy demand forecasting, ecosystem monitoring, and climate change studies.
RISED: A Pre-Deployment Safety Evaluation Framework for Clinical AI Decision-Support Systems
Theory
Interpretability
- RISED Framework introduces a five-dimension evaluation for clinical AI systems.
- Framework identifies critical deployment risks not captured by traditional metrics.
- Validation across multiple cohorts shows varying failure patterns, supporting construct validity.
- Equity dimension highlights the need for independent measures of clinical need.
Read more
RISED: A Pre-Deployment Safety Evaluation Framework for Clinical AI Decision-Support Systems
Summary
The paper introduces the RISED Framework, a comprehensive five-dimension pre-deployment evaluation approach for clinical AI decision-support systems. Traditional evaluation metrics often fail to capture critical deployment-phase failures such as input reliability, subgroup equity, threshold sensitivity, and operational feasibility. The RISED Framework encompasses five dimensions: Reliability, Inclusivity, Sensitivity, Equity, and Deployability, each defined by formal sub-criteria and pass/fail thresholds. The framework employs bias-corrected accelerated bootstrap confidence intervals to assess each dimension, allowing for a quantitative verdict on the model's readiness for clinical deployment. The author demonstrates that even classifiers meeting conventional high-discrimination benchmarks can fail in critical areas, highlighting the need for a more nuanced evaluation approach. The framework was validated across synthetic and real-world cohorts, revealing varying failure patterns across datasets, thus providing preliminary evidence of its construct validity. Additionally, the Equity dimension is reframed to address proxy-dependence issues, emphasizing the importance of independent measures of clinical need. RISED is made available as an open-source Python package, facilitating the transition from in-silico validation to clinical evaluation.
Methodology
The RISED Framework operationalizes five evaluation dimensions through measurable sub-criteria, employing bootstrap 95% confidence intervals to derive PASS, FAIL, and INCONCLUSIVE verdicts. The framework was validated using a synthetic cohort and three real-world datasets spanning 35 years of clinical data.
Results
The evaluation revealed that two dimensions failed and one was statistically inconclusive despite achieving an AUROC of 0.961. The framework demonstrated that reliability and sensitivity can be compromised even in high-performing models, underscoring the necessity for comprehensive pre-deployment assessments.
Implications
The RISED Framework provides a structured approach for evaluating clinical AI systems before deployment, potentially improving patient safety and care quality by identifying risks that traditional metrics overlook. Its open-source nature allows for widespread adoption and adaptation in clinical settings.