AI-generated summaries
Today's ML research,
without the noise.
Daily summaries of the latest machine learning papers from arXiv, processed every 8 hours.
43
Papers today
8h
Update frequency
7
Days of history
Judging to Improve: A De-biased VLM-as-3D-Judge Protocol for Single-Image 3D Generation
Computer Vision
Generative Models
Optimization
- Development of an optimization-grade de-biased VLM-as-3D-judge protocol.
- Identification and correction of three failure modes in the judging process.
- Lightweight adaptation methods achieve parity with a strong base but do not exceed it.
- The study emphasizes the need for engineered quality-contrastive signals in training.
Read more
Judging to Improve: A De-biased VLM-as-3D-Judge Protocol for Single-Image 3D Generation
Summary
This paper builds upon a previous study that established a de-biased VLM-as-3D-judge for evaluating single-image-to-3D mesh quality. The authors explore whether the preferences of this judge can be utilized to optimize a strong open generator, TRELLIS, specifically for furniture assets without relying on human labels. The main contribution is the development of an optimization-grade protocol that enhances the judge's reliability by separating training and evaluation judges, correcting position bias, and addressing three identified failure modes. The study finds that while the adapted generator matches the strong base in performance, it does not exceed it, indicating that lightweight adaptation methods may not be sufficient for significant improvements. The results highlight the importance of engineering a quality-contrastive signal for effective training and provide insights into the mechanistic aspects of the optimization process.
Methodology
The authors employed a de-biased VLM-as-3D-judge protocol that involved using distinct judges for training (Qwen2.5-VL-7B) and evaluation (InternVL3-8B). They implemented position-bias correction and addressed failure modes such as image overload and geometry defects. The study involved six adaptation methods across two input regimes and a severity sweep to evaluate the performance of the TRELLIS generator.
Results
The results indicated that the adapted TRELLIS generator achieved parity with the strong base (0.50 win-rate) under severe degradation conditions, but did not meet the target win-rate of ≥65%. The training judge demonstrated a high win-rate (0.89) for quality-contrastive construction, while independent samples showed a high order-flip rate (0.94), indicating a lack of learnable preference.
Implications
The findings suggest that achieving significant improvements in single-image-to-3D generation may require more than lightweight adaptation techniques. The optimization-grade judge protocol can be reused for future studies, potentially leading to better training signals and improved generator performance in specialized applications.
Predictability as a Fine-Grained Measure for Privacy
Theory
Efficient ML
Optimization
- Introduces predictability as a new privacy metric that incorporates the attacker's knowledge and sensitive queries.
- Demonstrates that predictability and differential privacy are generally incomparable but can align under specific conditions.
- Develops a GMM framework for analyzing asymptotic predictability in the context of compromised data.
- Proposes an improved output perturbation scheme for ERM that enhances accuracy compared to traditional isotropic perturbation.
Read more
Predictability as a Fine-Grained Measure for Privacy
Summary
This paper introduces a novel framework for assessing privacy in machine learning through the concept of 'predictability.' Unlike traditional differential privacy (DP), which provides worst-case privacy guarantees, predictability offers a fine-grained measure that accounts for the attacker's knowledge and the specific sensitive queries of interest. The authors argue that predictability can reveal how much an attacker can infer about unknown individuals after observing the algorithm's output, beyond what they could deduce from compromised data. They demonstrate that predictability and DP are generally incomparable, but under certain worst-case scenarios, predictability can imply mutual-information DP. The paper presents a generalized method of moments (GMM) framework to analyze asymptotic predictability, particularly when the compromised data is generated by a stationary, ergodic, mixing process. Furthermore, the authors propose a predictability-calibrated output perturbation scheme for empirical risk minimization (ERM), which adapts noise based on loss curvature and attacker knowledge, leading to improved accuracy in linear regression tasks. The framework is also extended to handle various query families and scenarios where the attacker's process is unknown.
Methodology
The authors utilize the generalized method of moments (GMM) to analyze the asymptotic predictability of algorithms with respect to sensitive queries. They derive a predictability-calibrated output perturbation scheme that adjusts noise based on the characteristics of the loss function and the attacker's knowledge.
Results
The study shows that predictability provides a finer-grained privacy metric compared to differential privacy, allowing for tailored privacy guarantees based on specific attacker models and sensitive queries. The proposed perturbation scheme yields improved accuracy bounds for linear regression tasks compared to isotropic noise injection.
Implications
This work has significant implications for privacy-preserving machine learning, suggesting that practitioners can achieve better privacy-accuracy trade-offs by adopting predictability as a guiding principle. It opens avenues for developing more nuanced privacy mechanisms that are sensitive to the context of data breaches and attacker capabilities.
An Information Theoretic Framework for Graph Novelty Generation via Latent Mixture Modeling
Generative Models
Graph Learning
Theory
- Introduces a novel framework for graph novelty generation based on latent mixture modeling.
- Imposes explicit novelty and reliability conditions using the Minimum Description Length principle.
- Provides theoretical guarantees for controlling misclassification probabilities.
- Demonstrates effectiveness through empirical experiments on synthetic and real-world datasets.
Read more
An Information Theoretic Framework for Graph Novelty Generation via Latent Mixture Modeling
Summary
This paper introduces an information-theoretic framework for generating novel graphs that are distinct from existing patterns while maintaining global structural consistency. The authors propose a method that embeds data into a latent space and models the latent distribution using finite mixture models (FMM). Novelty is enforced by ensuring that generated samples are poorly explained by existing mixture components, while reliability is maintained through the Minimum Description Length (MDL) principle. The framework includes a theoretical analysis that guarantees the convergence of misclassification probabilities to zero under appropriate threshold choices. Experiments conducted on synthetic and benchmark graph datasets demonstrate the effectiveness of the proposed method in achieving principled novelty generation with quantifiable risk. The paper distinguishes novelty generation from related concepts like data augmentation and extrapolation, emphasizing its importance in applications requiring genuine creativity.
Methodology
The proposed methodology involves embedding data into a latent space and modeling the latent distribution with finite mixture models. Novelty is enforced by requiring generated samples to be poorly explained by existing components, while reliability is ensured by bounding the global structure's description length. An MDL-guided sampling scheme is introduced to balance these conditions, and theoretical analysis is provided to support the framework.
Results
The experiments show that the proposed method effectively controls novelty and reliability through varying threshold parameters, outperforming competing methods. The theoretical analysis confirms that the probabilities of misclassifying non-novel or unreliable samples converge to zero with explicit rates, validating the framework's robustness.
Implications
The framework has significant implications for applications requiring the generation of genuinely novel data, such as community formation and material design. It provides a rigorous mathematical foundation for novelty generation, enabling the development of principled algorithms with theoretical guarantees.
Computational Identifiability
Theory
- Introduction of computational identifiability as a practical alternative to theoretical identifiability.
- Formalization of the connection between causal effect estimation and meta-learning.
- Empirical demonstration of the framework across various complex identification scenarios.
- Focus on finite computational procedures rather than idealized mathematical derivations.
Read more
Computational Identifiability
Summary
This paper introduces the concept of 'computational identifiability,' which contrasts with traditional theoretical identifiability in causal inference. The authors argue that while theoretical identifiability relies on idealized conditions such as infinite data, computational identifiability focuses on practical, finite computational procedures to determine if a causal effect can be estimated within a specified error tolerance. The framework involves defining a meta-prior over parameters and a hypothesis space of estimators, allowing for empirical identification even in complex scenarios with limited data. The authors demonstrate the utility of this approach through experiments that address identification challenges in small sample sizes, ambiguous graphical criteria, and mixed observational-interventional data. The proposed method aims to provide actionable insights into causal inference, making it relevant for real-world applications where traditional methods may fall short.
Methodology
The authors propose a framework for computational identifiability that involves defining a meta-prior distribution over structural causal models (SCMs) and a hypothesis space of estimators. They establish criteria for successful identification based on finite sample sizes, error tolerances, and confidence bounds, allowing for empirical estimations of causal effects.
Results
The experiments conducted show that the computational identifiability framework successfully addresses identification questions in scenarios with small sample sizes, ambiguous graphical criteria, and mixed data types. The results indicate that practical identification is achievable even when traditional theoretical methods may not apply.
Implications
The concept of computational identifiability has significant implications for causal inference in real-world applications, particularly in fields where data is limited or complex. This framework can enhance the ability to derive actionable insights from empirical data, making it a valuable tool for researchers and practitioners in causal analysis.
Self-Adaptive Scale Handling for Forecasting Time Series with Scale Heterogeneity
Time Series
- Introduces a self-Adaptive Scale-handling (AS) module for scale-heterogeneous time series forecasting.
- The Scale Calibrating (SC) sub-module calibrates prior scaling factors to reduce inverse-scaling errors.
- The Scaling Selection (SS) sub-module autonomously decides on the use of calibrated or original scaling factors.
- Demonstrates significant performance improvements on real-world datasets.
Read more
Self-Adaptive Scale Handling for Forecasting Time Series with Scale Heterogeneity
Summary
This paper addresses the challenge of forecasting time series data that exhibit scale heterogeneity, where different time series can differ significantly in their numerical magnitudes. Traditional time series forecasting methods often assume scale-homogeneous data, leading to performance degradation when applied to scale-heterogeneous scenarios. The authors propose a novel self-Adaptive Scale-handling (AS) module that learns adaptive scale factors for each input time series, thereby preserving semantic discriminability and reducing inverse-scaling errors. The AS module consists of two components: Scale Calibrating (SC), which calibrates prior mean scaling factors using neural networks, and Scaling Selection (SS), which determines whether to apply the calibrated factor or retain the original one. The proposed method was tested on real-world datasets from Ant Fortune and Alipay, demonstrating consistent improvements in forecasting performance when integrated with popular time series forecasting models.
Methodology
The authors developed the AS module, which includes two sub-modules: SC for calibrating scale factors using neural networks, and SS for selecting whether to use calibrated or original scale factors based on a Bernoulli distribution parameterized via Gumbel-Softmax. This approach allows the model to adaptively handle scale differences while maintaining the integrity of the data's semantic meaning.
Results
Experiments conducted on real-world fund sales datasets showed that the AS module, when integrated with existing time series forecasting models, consistently outperformed traditional scaling methods. The results highlighted improvements in forecasting accuracy and robustness against scale-related errors.
Implications
The findings suggest that the AS module can significantly enhance the performance of time series forecasting models in industrial applications where scale heterogeneity is prevalent, such as finance and e-commerce. This approach could lead to better data utilization and improved forecasting accuracy across various domains.
Towards Modality-imbalanced Federated Graph Learning: A Data Synthesis-based Approach
Federated Learning
Graph Learning
Multimodal
- Introduces a novel framework, FedMGS, for addressing modality imbalance in federated graph learning.
- Identifies and characterizes two types of modality imbalance: client-level and node-level.
- Employs a graph-aware approach to recover missing modalities while preserving semantic integrity.
- Demonstrates significant performance improvements over existing methods in various tasks.
Read more
Towards Modality-imbalanced Federated Graph Learning: A Data Synthesis-based Approach
Summary
This paper addresses the challenges of modality imbalance in MultiModal Federated Graph Learning (MM-FGL), which arises when multimodal graph data is distributed across different clients with varying availability of modalities. The authors identify two types of modality imbalance: client-level imbalance, where certain clients lack entire modalities, and node-level imbalance, where individual nodes miss specific attributes. Existing methods primarily focus on centralized or graph-agnostic scenarios, making them unsuitable for federated settings. To tackle these issues, the authors propose FedMGS (Federated Modality-aware Graph Synthesis), a framework that synthesizes missing modal semantics within the representation space while maintaining alignment with the original data's semantic distribution. FedMGS incorporates three key components: an availability-aware graph encoder to prevent contamination from missing modalities, a prototype-guided latent semantic synthesizer to create semantic anchors for unavailable modalities, and a reliability-calibrated semantic fusion mechanism to regulate the impact of synthesized representations. Experimental results demonstrate that FedMGS significantly outperforms competitive baselines, achieving improvements of up to 17.41% in performance while maintaining efficiency.
Methodology
The methodology involves a client-server framework where FedMGS integrates three components: an availability-aware graph encoder to filter out missing modalities during local training, a prototype-guided latent semantic synthesizer to generate missing representations based on cross-client prototypes, and a reliability-calibrated semantic fusion mechanism to manage the contribution of synthesized semantics before final predictions.
Results
Extensive experiments conducted on four different tasks show that FedMGS consistently outperforms competitive baselines, achieving performance gains of up to 17.41% while ensuring an optimal trade-off between efficiency and effectiveness.
Implications
The findings suggest that FedMGS can enhance collaborative learning in environments where data privacy and modality availability are significant concerns, making it applicable in fields such as healthcare, finance, and any domain requiring federated learning with multimodal data.
Uncertainty-Aware Reward Modeling for Stable RLHF
Reinforcement Learning
Large Language Models
Optimization
- UARM introduces calibrated uncertainty in reward models to signal prediction reliability.
- The method employs quantile regression and conformal prediction for uncertainty estimation.
- Heteroscedastic advantage reweighting suppresses the influence of unreliable reward signals.
- Experiments show significant improvements in reward model calibration and alignment quality.
Read more
Uncertainty-Aware Reward Modeling for Stable RLHF
Summary
This paper addresses critical challenges in Reinforcement Learning from Human Feedback (RLHF), particularly the inability of reward models to signal prediction uncertainty and the amplification of unreliable reward signals in group-based policy optimization methods like Group Relative Policy Optimization (GRPO). The authors propose Uncertainty-Aware Reward Modeling (UARM), which integrates calibrated uncertainty into reward models using quantile-based conformal prediction and employs a heteroscedastic variance decomposition approach to reweight advantages in GRPO. The proposed method enhances the reliability of reward estimates, reduces the risk of reward hacking, and improves the alignment quality of language models. The experimental results across three datasets (HelpSteer, Ultra-Feedback, and PKU-SafeRLHF) demonstrate that UARM significantly outperforms standard GRPO and other uncertainty-agnostic baselines in terms of reward model calibration and alignment quality.
Methodology
The proposed UARM framework operates in two phases: an offline phase where a quantile regression estimator is trained to provide conditional quantiles of the reward distribution, and an online phase where the width of prediction intervals is interpreted as observation noise. This allows for a heteroscedastic variance decomposition that generates sample-specific reliability weights, which are then used to construct a heteroscedastic advantage for GRPO, effectively down-weighting high-uncertainty samples.
Results
The experiments conducted across three preference datasets demonstrate that UARM significantly improves the calibration of reward models, reduces instances of reward hacking, and enhances the overall alignment quality of the language models when compared to standard GRPO and other uncertainty-agnostic approaches.
Implications
The findings suggest that incorporating uncertainty awareness into reward modeling can lead to more stable and reliable reinforcement learning systems, particularly in applications involving human feedback. This could enhance the deployment of large language models in real-world scenarios where alignment with human values is critical.
Unsupervised Causal Abstractions Discovery
Theory
Interpretability
Graph Learning
- Introduces an unsupervised method for discovering high-level causal abstractions from low-level measurements.
- Demonstrates that low-rank causal structures can induce identifiable latent variables forming causal abstractions.
- Establishes the 'anchor assumption' for ensuring the uniqueness of latent factors.
- Validates theoretical results through empirical studies, including simulations and DNN analysis.
Read more
Unsupervised Causal Abstractions Discovery
Summary
This paper addresses the problem of discovering high-level causal abstractions directly from low-level measurements in complex systems. Traditional approaches often rely on hypothesis testing, where an expert proposes a candidate high-level model, which is then evaluated against low-level data. The authors propose a novel unsupervised method that leverages low-rank causal discovery principles to identify causal abstractions. They demonstrate that observations from low-rank graphs can induce latent variables that form a causal abstraction. The paper provides identifiability results for these latent variables under certain assumptions, specifically the 'anchor assumption,' which ensures that each latent factor is uniquely influenced by specific low-level variables. The authors validate their theoretical findings through empirical studies, including simulations and qualitative analyses of abstractions learned from a deep neural network (DNN) performing an arithmetic task. This work contributes to the field of mechanistic interpretability by enabling the unsupervised discovery of high-level causal models, which can facilitate understanding and intervention in complex systems.
Methodology
The authors utilize a framework based on low-rank causal discovery, specifically focusing on factor directed acyclic graphs (f-DAGs). They establish theoretical foundations for identifying causal abstractions from low-level data, proving that the causal discovery algorithm can yield a high-level structural causal model (SCM) that accurately reflects the low-level system's behavior. The anchor assumption is introduced to enhance identifiability, and empirical validation is conducted through simulations and analyses of a DNN.
Results
The paper shows that the proposed unsupervised method successfully identifies causal abstractions from low-level measurements. The theoretical results confirm the identifiability of latent factors under the anchor assumption, and empirical studies demonstrate the effectiveness of the approach in both simulated environments and real-world DNN applications.
Implications
This research has significant implications for fields requiring mechanistic interpretability, such as neuroscience and artificial intelligence. By enabling the unsupervised discovery of causal abstractions, it allows researchers to better understand complex systems and design interventions based on high-level causal insights.
SEAGAN: domain-Specific and Edge-Aware Graph Attention Network for Dynamic Plant Processes
Graph Learning
- SEAGAN improves the identification of biochemical limitation states in A–Ci curves using graph-based approaches.
- The model utilizes domain-specific graph representations and edge attributes to enhance classification accuracy.
- SEAGAN outperforms traditional methods and automated fitting benchmarks, achieving high F1-scores and accuracy.
- The approach reduces the need for manual intervention in estimating photosynthetic parameters.
Read more
SEAGAN: domain-Specific and Edge-Aware Graph Attention Network for Dynamic Plant Processes
Summary
This paper introduces SEAGAN, a novel Graph Attention Network (GNN) designed for identifying biochemical limitation states along A–Ci curves in plant physiology. The A–Ci curve relates net CO2 assimilation rate (Anet) to leaf intercellular CO2 concentration (Ci) and is critical for estimating photosynthetic parameters. Traditional methods for estimating these parameters often rely on manual or heuristic assignments of data points to biochemical regimes, which can introduce subjectivity and uncertainty. SEAGAN addresses this challenge by framing the limitation-state identification as a graph-based node classification problem, where curve points are represented as nodes. The authors develop domain-specific graph representations using distance-based k-nearest-neighbor (kNN) and auxiliary-signal-guided (ASG) connectivity, with edge attributes capturing pairwise relationships. The framework is evaluated against conventional learning baselines and automated fitting benchmarks, demonstrating that graph-based models significantly enhance classification accuracy, particularly in regions of biochemical transition. SEAGAN achieves an F1-score of 0.857 and an accuracy of 0.882, indicating its effectiveness in improving biochemical limitation-state analysis through edge-aware attention mechanisms.
Methodology
The authors formulated the limitation-state identification as a graph-based node classification problem, creating domain-specific graph representations using distance-based k-nearest-neighbor (kNN) and auxiliary-signal-guided (ASG) connectivity. Edge attributes were used to encode pairwise relationships, and the model was trained with a weighted cross-entropy loss function to optimize classification performance.
Results
SEAGAN demonstrated significant improvements in classification performance on a large synthetic dataset, achieving an F1-score of 0.857 and an accuracy of 0.882. The model particularly excelled in accurately classifying points near biochemical transition regions, showcasing the effectiveness of graph-based representations in this context.
Implications
The findings suggest that SEAGAN can enhance the accuracy of biochemical parameter estimation in plant physiology, facilitating high-throughput phenotyping and ecological studies. By automating the identification of biochemical limitation states, the model can streamline research processes and improve the scalability of plant physiological analyses.
ProMUSE: Progressive Multi-modal Uncertainty-guided Staged Evidential Alzheimer Disease Classification
Multimodal
Efficient ML
- ProMUSE reduces reliance on costly MRI and PET imaging by up to 90% while maintaining diagnostic accuracy.
- The framework employs a staged acquisition strategy guided by uncertainty quantification.
- ProMUSE integrates clinical, MRI, and PET data to enhance the predictive performance for AD classification.
- The model demonstrates competitive accuracy across multiple datasets, indicating its robustness and generalizability.
Read more
ProMUSE: Progressive Multi-modal Uncertainty-guided Staged Evidential Alzheimer Disease Classification
Summary
The paper presents ProMUSE, a novel framework for Alzheimer’s disease (AD) classification that utilizes a progressive, multi-modal, uncertainty-guided approach to improve diagnostic accuracy while minimizing the reliance on costly imaging modalities like MRI and PET. Recognizing that early diagnosis is crucial for effective treatment, the authors highlight the limitations of existing models that assume full modality availability, which can impose significant financial burdens on patients and healthcare systems. ProMUSE begins by performing evidential classification using low-cost clinical data and quantifies uncertainty through a Dirichlet-based subjective logic model. When the uncertainty surpasses a predefined threshold, the model progressively incorporates MRI or PET features, employing Dempster–Shafer theory to fuse modality-specific beliefs and uncertainties. This staged acquisition strategy allows for accurate diagnosis while significantly reducing the need for expensive imaging techniques. The framework was evaluated on three independent datasets: ADNI, AIBL, and OASIS, demonstrating that ProMUSE achieves competitive or superior accuracy compared to full-modality baselines while reducing MRI/PET usage by 50-90%. These findings suggest that ProMUSE is a practical, resource-efficient solution for real-world AD screening.
Methodology
ProMUSE employs a progressive multi-modal framework that starts with low-cost clinical data for initial classification. It quantifies uncertainty using a Dirichlet-based subjective logic model and incorporates additional modalities (MRI/PET) only when necessary, guided by uncertainty thresholds. The fusion of modality-specific beliefs and uncertainties is achieved through Dempster–Shafer theory, allowing for calibrated multi-modal predictions.
Results
ProMUSE was tested on three datasets (ADNI, AIBL, OASIS) and achieved competitive or superior accuracy compared to full-modality baselines while significantly reducing the need for MRI and PET imaging by 50-90%. This demonstrates the model's effectiveness in maintaining diagnostic performance while lowering costs.
Implications
The ProMUSE framework has significant implications for clinical practice, particularly in enhancing the accessibility and affordability of early Alzheimer’s disease diagnosis. By reducing the reliance on expensive imaging techniques, it could lead to broader screening and earlier interventions, ultimately improving patient outcomes.
Performance Analysis and Optimization of 3D Generative Diffusion Models across GPU Architectures
Generative Models
Optimization
Efficient ML
- Comprehensive performance analysis of Med-DDPM across three NVIDIA GPU architectures.
- Identification of architecture-specific bottlenecks in convolution and normalization processes.
- Implementation of TF32 Tensor Core activation and a 3D channels-last layout as optimizations.
- Significant performance improvements without degrading synthesis quality.
Read more
Performance Analysis and Optimization of 3D Generative Diffusion Models across GPU Architectures
Summary
This paper investigates the performance of the Med-DDPM, a state-of-the-art medical diffusion model for 3D MRI synthesis, across three generations of NVIDIA GPU architectures (Volta, Ampere, and Hopper). The authors conduct a comprehensive performance analysis focusing on kernel-level runtime breakdowns, instruction-mix characteristics, and memory system utilization. They identify that the training process is heavily dominated by cuDNN convolution and implicit-GEMM kernels, with inefficiencies stemming from memory-access patterns and limited Tensor Core utilization. To address these issues, the authors propose two architecture-aware optimizations: enabling TF32 Tensor Core activation and restructuring the model's memory layout to a 3D channels-last format. These optimizations significantly enhance performance metrics, achieving reductions in SM cycles and dynamic instructions while increasing Tensor Core utilization and IPC, all without compromising the quality of the synthesized images. The findings provide practical guidelines for optimizing 3D diffusion models on modern GPUs, emphasizing the importance of architectural considerations in medical imaging research.
Methodology
The authors utilized NVIDIA’s Nsight Compute profiler to perform a full-stack GPU performance study of Med-DDPM. They analyzed kernel-level execution characteristics, dynamic instruction-mix profiles, and memory-hierarchy behavior across three GPU architectures. Based on the profiling results, they applied two targeted optimizations to address identified bottlenecks.
Results
The optimizations led to a reduction in SM cycles by up to 100×, a decrease in dynamic instructions by 100×, an increase in Tensor Core utilization from 1.45 to 9.98×, and a 7% increase in IPC on the A100 architecture, all while maintaining the quality of the synthesized images.
Implications
The findings suggest that architectural considerations are crucial for the efficient deployment of diffusion models in medical imaging. The proposed optimizations can enhance the performance of generative models, making them more viable for practical applications in clinical settings.
Federated Bilevel Performative Prediction
Optimization
Federated Learning
Theory
- Introduction of federated bilevel performative prediction framework.
- Formalization of federated bilevel performatively stable (FBPS) point.
- Development of two algorithms: FBi-RRM and FBi-SGD with convergence guarantees.
- Experimental validation showing improved performance over non-performative baselines.
Read more
Federated Bilevel Performative Prediction
Summary
This paper addresses the challenges of federated bilevel optimization in the context of performative prediction, where client-specific, decision-dependent distribution shifts occur due to the feedback loop between model decisions and client behavior. The authors introduce a unified framework for federated bilevel performative prediction, which incorporates these shifts into both upper-level (UL) and lower-level (LL) objectives. They formalize the concept of a federated bilevel performatively stable (FBPS) point and provide sufficient conditions for its existence and uniqueness. The paper presents two algorithms: FBi-RRM, which converges linearly under a contraction condition, and FBi-SGD, a communication-efficient stochastic method that leverages federated hypergradient estimation. The proposed methods are evaluated through experiments on strategic regression and meta-strategic classification, demonstrating improved meta-generalization and practical effectiveness in nonconvex neural network settings. The results validate the predicted stability thresholds and highlight the importance of considering performativity in federated learning scenarios.
Methodology
The authors develop a unified framework for federated bilevel optimization that incorporates client-specific decision-dependent distributions. They analyze the stability of the FBPS point and propose two algorithms: FBi-RRM, which ensures linear convergence under specific conditions, and FBi-SGD, which is designed for communication efficiency and convergence under diminishing step sizes.
Results
The experiments demonstrate that the proposed methods effectively handle performative shifts, leading to improved meta-generalization in strategic regression and classification tasks. The results confirm the predicted stability thresholds and show that the proposed algorithms outperform traditional non-performative approaches.
Implications
This work has significant implications for federated learning applications, particularly in scenarios where model decisions influence client behavior and data distributions. It emphasizes the need for adaptive learning strategies that account for performative effects, potentially leading to more robust and effective machine learning systems in distributed environments.
Contagion Networks: Evaluator Bias Propagation in Multi-Agent LLM Systems
Large Language Models
NLP
Theory
- Introduces the Contagion Networks framework for measuring evaluator bias propagation in multi-agent LLM systems.
- Finds that evaluator biases propagate consistently across agents, with specific coefficients indicating the strength of this contagion.
- Identifies three propagation regimes governed by the spectral radius of the contagion matrix.
- Demonstrates that increasing the evaluator committee size significantly reduces effective contagion.
Read more
Contagion Networks: Evaluator Bias Propagation in Multi-Agent LLM Systems
Summary
This paper introduces the concept of Contagion Networks to analyze how evaluator biases propagate in multi-agent systems utilizing large language models (LLMs). The study highlights that biases from one agent can influence the evaluations and outputs of other agents, potentially leading to a systemic preference collapse. Through a controlled experiment involving three agents with distinct evaluator bias profiles, the authors measure the Cross-Agent Contagion Matrix (Γ3) and find that biases consistently propagate, with coefficients ranging from 0.157 to 0.352. The research identifies three distinct propagation regimes based on the spectral radius of the contagion matrix, demonstrating that homogeneous agents exhibit significantly weaker contagion compared to cross-model agents. Furthermore, the study reveals that increasing the size of the evaluator committee can effectively mitigate bias propagation, reducing effective contagion by 72.4%. The authors also provide an open-source framework for further exploration of bias propagation in multi-agent systems.
Methodology
The authors conducted a controlled experiment with three agents using the DeepSeek-chat model, each with different evaluator bias profiles. They measured the Cross-Agent Contagion Matrix (Γ3) to quantify bias propagation and analyzed the dynamics using a formal framework that includes the spectral radius to characterize propagation regimes. The methodology also involved Test-Time Reinforcement Learning (TTRL) for agents to adapt their strategy weights based on evaluations.
Results
The study found that evaluator biases propagate between agents with coefficients ranging from 0.157 to 0.352. It identified three propagation regimes based on the spectral radius of the contagion matrix, with homogeneous agents showing 3-5 times weaker contagion compared to cross-model agents. Importantly, increasing the evaluator committee size from one to three reduced effective contagion by 72.4%, confirming the effectiveness of this mitigation strategy.
Implications
The findings suggest that bias propagation in multi-agent LLM systems can lead to a loss of cognitive diversity, which is critical for effective collaboration. The proposed mitigation strategies could be applied to enhance the robustness of multi-agent systems against evaluator bias, ensuring more reliable and diverse outputs. The open-source framework allows for broader investigation into bias dynamics in AI systems.
Physics-Informed Neural Network with Squeeze-Excitation-like Attention
Theory
Optimization
Efficient ML
- Introduction of SEA-PINN architecture with Squeeze-Excitation-like attention mechanism.
- Demonstrated stability with negligible variance and reduced initial loss on benchmark problems.
- Achieved competitive accuracy without Fourier feature embeddings.
- Improved performance when integrated with TSA-PINN.
Read more
Physics-Informed Neural Network with Squeeze-Excitation-like Attention
Summary
This paper introduces SEA-PINN, a novel architecture that integrates a Squeeze-Excitation-like attention mechanism into Physics-Informed Neural Networks (PINNs). The primary innovation of SEA-PINN is its ability to dynamically recalibrate the importance of neurons across layers, which enhances the model's stability and performance. The authors demonstrate that SEA-PINN achieves a highly stable initialization, exhibiting negligible variance and significantly reduced initial loss on 17 out of 20 benchmark problems. This stability provides a quasi-deterministic starting point for optimization, which is crucial for the convergence of PINNs. Notably, SEA-PINN achieves competitive accuracy without the need for Fourier feature embeddings or periodic activation functions, showing an 83% improvement in high-frequency cases compared to traditional FNN-PINN models. Furthermore, when integrated with TSA-PINN, SEA-PINN boosts performance by 42.49%. These findings highlight SEA-PINN as a lightweight module that enhances the nonlinear representation power of PINNs, promotes robust convergence, and strengthens the reliability of physics-informed learning.
Methodology
The SEA-PINN architecture incorporates a Squeeze-Excitation-like attention mechanism that dynamically assigns weights to neuron outputs in each hidden layer. This mechanism is designed to recalibrate neuron importance based on input, allowing for more effective learning and representation of high-frequency components in the data. The architecture maintains the backbone structure of standard FNN-PINN while introducing this dynamic reweighting process.
Results
SEA-PINN exhibited nearly negligible variance and significantly reduced initial loss across 17 out of 20 benchmark problems. It achieved competitive accuracy with an 83% improvement in high-frequency cases compared to FNN-PINN and a 42.49% performance boost when combined with TSA-PINN.
Implications
The SEA-PINN architecture has the potential to improve the application of PINNs in various scientific and engineering fields by enhancing their stability and performance, particularly in high-frequency problems. This could lead to more reliable solutions in complex physical simulations and broader applicability in data-sparse environments.
Adaptive Distance-Aware Trunk Deep Operator Learning for Long-Span Roadway Bridges
Theory
Efficient ML
- Introduction of AD-DeepONet for localized structural response prediction in bridges.
- Utilization of KNN for adaptive influence-domain selection to capture localized phenomena.
- Incorporation of distance-aware trunk features for improved representation of structural behavior.
- Significant reduction in computational time for response evaluation, achieving FEM-level accuracy.
Read more
Adaptive Distance-Aware Trunk Deep Operator Learning for Long-Span Roadway Bridges
Summary
This paper presents the Adaptive Distance-Aware Trunk Deep Operator Network (AD-DeepONet), a novel machine learning framework designed for predicting localized structural responses in long-span roadway bridges, which are characterized by complex behaviors under vehicular loading. Traditional finite element methods (FEM) are computationally expensive, particularly for generating influence lines and surfaces, which are critical for real-time analysis and digital twin applications. The proposed AD-DeepONet addresses this challenge by employing a K-nearest neighbors (KNN) strategy to dynamically select influence domains, allowing the model to focus on regions of significant structural response. Additionally, distance-aware trunk features are integrated to enhance the representation of localized behaviors by encoding the geometric relationships between loads and structural nodes. The framework incorporates a physics-based full-field reconstruction method through a stiffness-informed Schur complement formulation, enabling predictions at adaptive nodes to be extended across the entire structure. The model is validated using both a benchmark bridge model and the real-world Mussafah Bridge, demonstrating its effectiveness in achieving high accuracy while significantly reducing computational time compared to traditional FEM approaches.
Methodology
The AD-DeepONet framework employs a K-nearest neighbors strategy for influence-domain selection, allowing it to focus on localized structural responses. It integrates distance-aware features to encode geometric relationships and uses a stiffness-informed Schur complement formulation for full-field reconstruction. Training data is generated using a reduced-order equivalent shell model to maintain computational efficiency.
Results
The AD-DeepONet achieved finite element method (FEM)-level accuracy with relative errors below 5%. The total response evaluation time, including full-field reconstruction, was reduced by approximately 60 times, and the inference speed of the model was up to four orders of magnitude faster than traditional FEM methods.
Implications
The proposed framework has significant implications for large-scale bridge analysis and digital twin applications, enabling rapid generation of influence lines and surfaces under various vehicular loading conditions. This could lead to more efficient structural health monitoring and real-time analysis in civil engineering.
Techniques for Peak Memory Reduction for LoRA Fine-tuning of LLMs on Edge Devices
NLP
Large Language Models
Efficient ML
- Introduces techniques to significantly reduce peak memory usage during LoRA fine-tuning of LLMs.
- Demonstrates up to 26× and 28× memory reduction for Llama-3.2 3B and Qwen-2.5 3B models, respectively.
- Techniques include quantization, memory-efficient checkpointing, softmax approximation, and logits masking.
- Enables fine-tuning on resource-constrained devices, enhancing privacy and user experience.
Read more
Techniques for Peak Memory Reduction for LoRA Fine-tuning of LLMs on Edge Devices
Summary
This paper addresses the challenge of fine-tuning Large Language Models (LLMs) using Low-Rank Adaptation (LoRA) on resource-constrained edge devices, where peak memory usage often exceeds device limits. The authors propose a suite of techniques aimed at reducing the memory footprint during fine-tuning without compromising model quality. These techniques include base model quantization with on-the-fly dequantization, memory-efficient checkpointing that combines selective activation caching and disk offloading, softmax approximation using semantically relevant token subsets, and logits masking. Experimental results demonstrate significant reductions in peak memory usage, achieving up to 26× and 28× reductions for Llama-3.2 3B and Qwen-2.5 3B models, respectively. This enables the fine-tuning of LLMs on devices with limited memory, unlocking new possibilities for personalized user experiences while maintaining data privacy.
Methodology
The authors employed a combination of techniques to optimize memory usage during the fine-tuning process. These include quantizing the base model, implementing memory-efficient checkpointing strategies, approximating softmax calculations, and applying logits masking. Each technique was tested on Llama-3.2 3B and Qwen-2.5 3B models to evaluate their effectiveness in reducing peak memory requirements.
Results
The experimental results indicate that the proposed techniques can drastically lower the peak memory required for fine-tuning LLMs. For instance, the peak memory for Llama-3.2 3B with a 2048-token context was reduced from 26.20 GB to as low as 1.02 GB, making it feasible to fine-tune on devices with limited memory resources.
Implications
The findings of this paper have significant implications for deploying personalized LLMs on edge devices, allowing for enhanced user experiences while ensuring data privacy. The techniques presented can facilitate the broader adoption of LLMs in consumer applications, such as smartphones and IoT devices, where memory constraints are a critical concern.
Thermodynamic Signatures of Reasoning: Free-Energy and Spectral-Form-Factor Diagnostics for Hallucination Detection in Large Language Models
NLP
Large Language Models
Theory
- Introduction of Free-Energy Signatures (FES) for hallucination detection in LLMs.
- FES captures the full spectral structure of attention Laplacians using thermodynamic and RMT metrics.
- Empirical results show FES significantly outperforms existing spectral methods in detecting hallucinations.
- The paper establishes theoretical foundations for FES, including stability and expressiveness results.
Read more
Thermodynamic Signatures of Reasoning: Free-Energy and Spectral-Form-Factor Diagnostics for Hallucination Detection in Large Language Models
Summary
This paper addresses the critical issue of hallucination detection in large language models (LLMs) by introducing Free-Energy Signatures (FES), a novel spectral descriptor derived from the attention Laplacian of LLMs. Unlike previous methods that summarize the Laplacian spectrum using a limited number of eigenvalues, FES utilizes thermodynamic potentials and random-matrix-theory (RMT) metrics to capture the full structure of the spectrum. The author proves three key results: (i) FES is Lipschitz stable under perturbations, (ii) it enriches finite spectral summaries and approximates spectral functionals, and (iii) it provides a PAC-style bound on the AUROC of a training-free detector. Empirical evaluations across six open-weight LLMs demonstrate that FES outperforms existing spectral baselines, achieving a significant improvement in AUROC scores. The study also reveals that healthy LLM outputs exhibit Wigner-Dyson-like spectral statistics, while hallucinations show Poisson-like patterns, providing a new perspective on reasoning quality in LLMs.
Methodology
The methodology involves treating the attention Laplacian as a Hamiltonian and extracting thermodynamic potentials (partition function, free energy, spectral entropy, heat capacity) and RMT metrics (spectral form factor). This approach allows for a comprehensive analysis of the Laplacian spectrum across different layers of the LLM, resulting in a multiscale descriptor that can be used for hallucination detection without retraining the model.
Results
FES achieved the highest aggregate AUROC scores across six open-weight LLMs and six benchmarks, outperforming the strongest existing spectral baselines by an average of 6.5 AUROC points. In a fully unsupervised setting, an RMT-deviation score achieved a mean AUROC of 0.71. The analysis also confirmed that valid generations exhibit Wigner-Dyson-like spectral statistics, while hallucinations show Poisson-like statistics.
Implications
The findings suggest that FES can serve as a robust tool for real-time hallucination detection in LLMs, enhancing the reliability of these models in practical applications. The insights into spectral statistics may also inform future research on model interpretability and reasoning quality.
The Token Is a Group Element: On Lie-Algebra Attention over Matrix Lie Groups
Robotics
Computer Vision
Theory
- Introduces the concept of using bare group elements as attention tokens, shifting the paradigm in equivariant models.
- Develops a closed-form attention score based on the negative squared algebra norm of the relative pose, avoiding the need for learned kernels.
- Demonstrates applicability to a range of matrix Lie groups, including non-compact and non-abelian cases.
- Empirical results show superior performance compared to traditional vector-token methods and learned kernels.
Read more
The Token Is a Group Element: On Lie-Algebra Attention over Matrix Lie Groups
Summary
This paper introduces a novel attention mechanism termed Lie-Algebra Attention, which innovatively positions the attention token as an element of a matrix Lie group rather than a feature vector. This approach eliminates the need for traditional representation-theoretic machinery, allowing for direct manipulation of transformations in spatial reasoning tasks. The attention score is derived from the closed-form algebra norm of the relative pose between tokens, enabling the model to reach affine full-frame groups that previous methods could not. The paper provides closed-form instantiations for various matrix Lie groups, including SO(2), SE(2), SO(3), SE(3), Aff(2), and Aff(3). Experimental results demonstrate that the proposed method outperforms learned kernel approaches while using significantly fewer parameters, showcasing its efficiency and effectiveness in sequence-completion tasks.
Methodology
The methodology involves defining attention tokens as elements of matrix Lie groups, allowing for direct computation of pairwise invariants and attention scores without relying on external representations or learned kernels. The construction is validated through closed-form expressions and empirical experiments across various Lie groups.
Results
The proposed Lie-Algebra Attention mechanism was empirically validated through three sequence-completion experiments on SE(2), SO(3), and Aff(2). The results indicated that the closed-form score matched a learned MLP kernel while using 50 to 80 times fewer parameters, and outperformed it on SE(2). In contrast, a vector-token baseline demonstrated significant loss of invariance.
Implications
This work has significant implications for fields requiring spatial reasoning and transformation handling, such as robotics, computer vision, and molecular modeling. By simplifying the attention mechanism and enhancing performance, it opens avenues for more efficient models in these domains.
Manifold Bandits: Bayesian Curriculum Learning over the Latent Geometry of Large Language Models
Reinforcement Learning
Large Language Models
NLP
- Introduces Bayesian Manifold Curriculum (BMC) for structured problem sampling in LLM training.
- Frames problem sampling as a manifold-structured bandit problem, highlighting the importance of latent relationships.
- Demonstrates non-trivial trade-offs between productivity, diversity, and utility in sampling strategies.
- Proposes Latent Task Trees for hierarchical representation of task relationships.
Read more
Manifold Bandits: Bayesian Curriculum Learning over the Latent Geometry of Large Language Models
Summary
This paper addresses the challenges of training large language models (LLMs) through reinforcement learning (RL) by proposing a novel framework called Bayesian Manifold Curriculum (BMC). Traditional adaptive curriculum learning methods often prioritize prompts of intermediate difficulty without considering the structured nature of the task space. The authors frame problem sampling as a manifold-structured bandit problem, where the relationships between problems are informed by the model's latent representation space. BMC organizes problems into a hierarchical task tree and employs Bayesian learning to optimize sampling decisions. The study reveals that different sampling strategies yield varying trade-offs between productivity (learning signal), diversity (coverage of the task manifold), and utility (evaluation relevance). The findings emphasize that merely focusing on difficulty is insufficient for achieving strong downstream performance, advocating for a structured and type-aware approach to problem sampling.
Methodology
The authors developed a hierarchical method called Latent Task Trees to approximate the structure of the task manifold using model embeddings. BMC utilizes Bayesian decision-making over these trees to guide the sampling of problems, accounting for the non-stationary dynamics induced by the model's learning process. The framework allows for efficient exploration while balancing the trade-offs between productivity, diversity, and utility in problem selection.
Results
The empirical results indicate that BMC significantly improves training efficiency and broad data coverage compared to traditional methods. The analysis of sampling strategies shows that they induce distinct trade-offs among productivity, diversity, and utility, suggesting that a more nuanced approach to problem selection can enhance downstream performance.
Implications
The findings of this paper could lead to more effective training methodologies for LLMs, improving their reasoning capabilities and generalization across diverse tasks. The structured approach to problem sampling may also be applicable to other areas of machine learning where task relationships are significant.
On the Oracle Complexity of Interpolation-Based Gradient Descent
Optimization
Theory
Efficient ML
- PPI-GD achieves improved oracle complexity for strongly convex problems compared to existing methods.
- The algorithm operates effectively in a broader regime of data dimensionality than previous methods.
- The paper provides new insights into multivariate polynomial interpolation, establishing novel error bounds.
- PPI-GD outperforms traditional gradient descent and stochastic gradient descent in specific smoothness conditions.
Read more
On the Oracle Complexity of Interpolation-Based Gradient Descent
Summary
This paper introduces a novel first-order optimization algorithm called Piecewise Polynomial Interpolation-based Gradient Descent (PPI-GD) aimed at improving the oracle complexity of gradient descent methods for empirical risk minimization (ERM). The authors argue that leveraging the smoothness of ERM loss functions with respect to training data can enhance optimization efficiency. PPI-GD approximates the full gradient by querying a first-order oracle at equidistant points in the data domain, constructing polynomial interpolants over small patches of the data. The authors analyze the oracle complexity of PPI-GD for both strongly convex and non-convex loss functions, demonstrating that it outperforms existing gradient descent variants in scenarios where the loss function exhibits sufficient smoothness. The paper also extends techniques from bicubic spline error analysis to multivariate polynomial interpolants, contributing to the theoretical understanding of interpolation in optimization contexts.
Methodology
The authors propose PPI-GD, which divides the data space into small chunks and performs polynomial interpolation to approximate gradients. This method reduces the number of oracle calls needed for gradient estimation, enhancing efficiency. The oracle complexity is analyzed for both strongly convex and non-convex loss functions, with theoretical results derived using techniques from numerical analysis.
Results
PPI-GD achieves an oracle complexity of ˜O((p/ε)d/(2ℓ)) for strongly convex problems, outperforming traditional gradient descent and stochastic gradient descent under specific conditions. The method also shows asymptotic dominance in the non-convex setting. The theoretical analysis includes new error bounds for tensor product polynomial interpolants, contributing to the broader understanding of interpolation methods.
Implications
The findings suggest that PPI-GD can be a valuable tool for optimizing machine learning models, particularly in scenarios where loss functions are smooth with respect to the data. The theoretical advancements in polynomial interpolation may also inspire further research in numerical analysis and optimization techniques.
Algebraic Dead Directions in LayerNorm Transformers: A Forward-Pass-Only Diagnostic at LLM Scale
Theory
Large Language Models
Optimization
- Introduces a forward-pass-only method to identify dead directions in LayerNorm transformers.
- Demonstrates that the dead direction can be computed from the LayerNorm scale parameter alone.
- Validates the method across 14 pretrained transformers, achieving high accuracy in predictions.
- Shows that training increases the depth of dead directions, revealing more complex structures.
Read more
Algebraic Dead Directions in LayerNorm Transformers: A Forward-Pass-Only Diagnostic at LLM Scale
Summary
This paper introduces a novel method for identifying dead directions in LayerNorm transformers, which are directions in parameter space where the Fisher information metric degenerates. The authors propose a closed-form solution that allows for the identification of these dead directions using only the LayerNorm scale parameter, without the need for forward or backward passes or eigendecomposition. The method is validated on 14 pretrained transformers, demonstrating that the predicted dead direction aligns closely with the measured bottom singular direction in LayerNorm models, while being absent in RMSNorm models. The study also reveals that training significantly deepens the covariance eigenvalue along the predicted direction, indicating the emergence of additional dead directions. The findings provide insights into the structure of pretrained transformers and offer a diagnostic tool for assessing normalization methods based solely on model parameters.
Methodology
The authors derive a closed-form expression for the dead direction in LayerNorm transformers using the inverse-scale direction of the LayerNorm affine parameter. They validate this method through empirical testing on various pretrained models, comparing predicted directions with measured singular directions using a single forward pass for verification.
Results
The predicted dead direction matches the measured bottom singular direction to four decimal places in all LayerNorm models tested, while being absent in RMSNorm models. Additionally, the covariance eigenvalue along the predicted direction deepens significantly after training, indicating the opening of further dead directions.
Implications
This work provides a new diagnostic tool for understanding the structure of pretrained transformers, which could aid in model optimization and the selection of normalization techniques. It also opens avenues for further research into the implications of dead directions on model performance and generalization.
Enhancing Graph Neural Networks Using Proximity Graphs for Dust Source Emission Forecasting
Graph Learning
Time Series
- Proximity graphs enhance the modeling capabilities of Graph Neural Networks for dust emission forecasting.
- The study demonstrates significant performance improvements over traditional forecasting methods and LSTM models.
- The proposed methodology effectively captures complex spatiotemporal dependencies in dust source emissions.
- This research is the first to apply proximity graph models with GNNs in the field of dust emission forecasting.
Read more
Enhancing Graph Neural Networks Using Proximity Graphs for Dust Source Emission Forecasting
Summary
This paper addresses the challenge of accurately forecasting dust source emissions, which are critical for mitigating environmental and health hazards associated with dust storms. Traditional forecasting methods often fail to capture the complex spatiotemporal dynamics of dust emissions. The authors propose a novel approach that integrates proximity graphs with Graph Neural Networks (GNNs) to model the intricate spatial and temporal relationships among dust sources. By employing various proximity graph constructions, including Delaunay triangulation, Gabriel graph, k-Nearest Neighbor graph, and Yao graph, the authors demonstrate that GNNs can effectively perform message passing and enhance prediction accuracy. The study compares the performance of GNNs using proximity graphs against those using random graphs and Long Short-Term Memory (LSTM) models, revealing that GNNs with proximity graphs significantly outperform both alternatives. This research represents the first application of proximity graph models with GNNs in the context of dust emission forecasting, offering a robust solution for proactive environmental management.
Methodology
The methodology involves constructing proximity graphs to represent spatial and temporal relationships among dust sources, followed by the application of various GNN architectures (GraphSAGE, Graph Convolutional Networks, and Graph Attention Networks) for predictive modeling. The approach treats the forecasting task as a binary classification problem, identifying active dust source emissions based on historical data.
Results
The results indicate that GNNs utilizing proximity graphs significantly outperform those using random graphs and traditional LSTM models in forecasting dust source emissions. The integration of proximity graphs allows for a more accurate representation of the spatiotemporal dynamics involved in dust emissions.
Implications
The findings suggest that integrating proximity graphs with GNNs can lead to more effective forecasting models for environmental phenomena, potentially aiding policymakers and businesses in planning mitigation strategies against dust storms and their associated impacts.
Predicting gestational age at birth in the context of preterm birth from multi-modal fetal MRI
Multimodal
- Developed a machine learning pipeline for predicting gestational age using multi-modal fetal MRI data.
- Achieved a mean absolute error of 2.74 weeks in gestational age predictions.
- Identified key predictive features such as cervical length and placental T2* values.
- Provided a novel approach by treating preterm birth prediction as a regression problem.
Read more
Predicting gestational age at birth in the context of preterm birth from multi-modal fetal MRI
Summary
This paper addresses the critical issue of preterm birth, which poses significant risks for neonatal mortality and long-term morbidity. The authors developed a machine learning pipeline that integrates multi-modal fetal MRI data to predict gestational age (GA) at birth, distinguishing between term and preterm births. The study utilized data from 333 control cases and 93 preterm birth cases, employing bespoke methods for data imputation, feature selection, and regression modeling. The performance of the pipeline was evaluated through stratified 10-fold cross-validation, yielding an R2 score of 0.13 and a mean absolute error of 2.74 weeks. The model achieved an accuracy of 0.77, sensitivity of 0.59, and specificity of 0.82. Key features influencing predictions included cervical length and placental T2* statistics. This research is pioneering in treating preterm birth prediction as a regression problem rather than a classification task, providing a proof of concept for future studies. The authors aim to expand the cohort size for more refined stratification within the preterm birth group.
Methodology
The study employed a machine learning pipeline that included data imputation, feature selection, and regression modeling. It utilized multi-modal morphological and functional fetal MRI data from a total of 426 cases, with performance evaluated using stratified 10-fold cross-validation.
Results
The pipeline achieved an R2 score of 0.13, a mean absolute error of 2.74 weeks, and classification metrics of 0.77 accuracy, 0.59 sensitivity, and 0.82 specificity. Key features selected were cervical length and statistics from placental T2* values.
Implications
The findings suggest that machine learning can enhance the prediction of gestational age at birth, which is crucial for improving care for preterm infants. This approach could lead to better clinical decision-making and resource allocation in maternal and neonatal healthcare.
Pseudo-Feature Padding: A Lightweight Defense Against False Data Injection in Power Grids
Theory
Efficient ML
- Introduction of a lightweight defense framework against FDIA in DNNs for CPS.
- Dynamic padding with pseudo-features increases input dimensionality and complexity.
- No modifications to existing DNN architectures are required, enhancing deployability.
- Demonstrated effectiveness through simulations on IEEE power grid test systems.
Read more
Pseudo-Feature Padding: A Lightweight Defense Against False Data Injection in Power Grids
Summary
This paper addresses the vulnerability of Deep Neural Networks (DNNs) in Cyber-Physical Systems (CPS), particularly in the context of False Data Injection Attacks (FDIA) that can disrupt critical operations such as state estimation in power grids. The authors propose a novel defense mechanism called Pseudo-Feature Padding, which introduces an additional input layer that pads input samples with pseudo-feature values derived from the statistical distribution of the input data. This approach increases the input dimensionality in a randomized and data-aware manner, making it more difficult for adversaries to generate effective perturbations. The method is lightweight, model-agnostic, and does not require changes to the existing DNN architecture, facilitating its deployment in real-world CPS environments. The authors validate their framework through extensive experiments on various IEEE test systems, demonstrating that their padding strategy significantly enhances model robustness against FDIA while maintaining performance integrity.
Methodology
The proposed framework integrates an additional input layer that pads input samples with pseudo-feature values, which are generated based on the statistical distribution of the input data. This method identifies low-importance input features using tree-based models and samples new values, thereby increasing the dimensionality and complexity of the input data. The padding is randomized during inference to further enhance model robustness.
Results
The experiments conducted on the IEEE 14-bus, 30-bus, 118-bus, and 300-bus systems show that the proposed padding strategy significantly improves the robustness of DNNs against FDIA. The results indicate that the framework effectively mitigates attacks that could bypass conventional defenses, while maintaining a negligible drop in accuracy compared to baseline models.
Implications
The proposed Pseudo-Feature Padding framework has significant implications for enhancing the security of DNNs in CPS, particularly in critical infrastructure like power grids. Its lightweight and model-agnostic nature makes it suitable for real-world applications, potentially leading to more resilient systems against adversarial attacks.
Human-like autonomy emerges from self-play and a pinch of human data
Reinforcement Learning
Robotics
- Spiced self-play combines self-play reinforcement learning with minimal human data to improve policy alignment with human driving behavior.
- Only 30 minutes of human driving data is sufficient to enhance coordination with human proxies, significantly less than traditional imitation learning approaches.
- The method avoids extensive reward engineering and domain randomization, simplifying the training process.
- The resulting policies exhibit lower collision rates and more human-like behavior in driving scenarios.
Read more
Human-like autonomy emerges from self-play and a pinch of human data
Summary
This paper introduces a novel approach to training autonomous driving policies using self-play reinforcement learning (RL) combined with a minimal amount of human driving data. The authors identify a key limitation in traditional self-play methods, which can lead to the development of driving strategies that are misaligned with human norms. To address this, they propose a method called 'spiced self-play,' which incorporates a small fraction of human data (only 30 minutes) as a regularization objective alongside a sparse reward for safe goal-reaching. This approach significantly enhances the coordination of the trained policies with human driving behaviors without the need for extensive reward engineering or domain randomization. The authors demonstrate that their method achieves human-like driving performance using only a fraction of the human data typically required by imitation learning methods, thus showcasing the effectiveness of integrating minimal human input into a predominantly self-play framework.
Methodology
The authors utilize Proximal Policy Optimization (PPO) to train a policy under a sparse reward structure for safe goal-reaching. They regularize this policy towards a behavioral cloning anchor derived from a small set of human driving demonstrations, effectively integrating human data into the self-play reinforcement learning framework.
Results
The spiced self-play method results in policies that not only achieve lower collision rates but also demonstrate improved human-like behavior in terms of distributional realism and collision severity profiles. The training process is efficient, requiring only 15 hours on a single consumer-grade GPU, and utilizes a total of 20 billion transitions from self-play.
Implications
This research suggests that minimal human data can significantly enhance the training of autonomous agents, making it feasible to develop more human-compatible driving policies with reduced reliance on extensive human demonstrations. The findings could have broad applications in autonomous driving and other domains requiring coordination among agents.
Learning universal approximations for partial differential equations with Physics-Informed Broad Learning System
Optimization
Efficient ML
Theory
- PIBLS is the first application of Broad Learning System (BLS) to solve PDEs, providing a backpropagation-free scientific computing paradigm.
- The framework reformulates PDE solving as a direct least-squares optimization, enhancing computational efficiency.
- A rigorous proof of the universal approximation property of PIBLS is provided, ensuring its capability to model complex physical fields.
- Experiments show PIBLS outperforms conventional PINNs in speed and accuracy, making it suitable for real-time applications.
Read more
Learning universal approximations for partial differential equations with Physics-Informed Broad Learning System
Summary
This paper introduces the Physics-Informed Broad Learning System (PIBLS), a novel framework designed to solve partial differential equations (PDEs) efficiently. Traditional numerical methods for PDEs, while robust, often require extensive computational resources due to mesh dependencies. Conversely, Physics-Informed Neural Networks (PINNs) offer a mesh-free approach but struggle with slow convergence and optimization instability. PIBLS addresses these challenges by reformulating PDE solving as a direct least-squares optimization problem, eliminating the need for backpropagation. The authors present a unique solver strategy that includes an analytical solution for linear PDEs and a nonlinear least-squares perturbation algorithm for nonlinear PDEs, ensuring stable convergence. Furthermore, they provide a rigorous mathematical proof of PIBLS's universal approximation property, confirming its ability to approximate solutions to PDEs. Experimental results demonstrate that PIBLS is one to three orders of magnitude faster than conventional PINNs while achieving significantly higher accuracy, establishing it as a promising alternative for real-time simulations and design optimization tasks in scientific machine learning.
Methodology
The PIBLS framework utilizes a Broad Learning System architecture, where input coordinates are projected into a network of randomly generated feature and enhancement nodes. The output is computed as a linear combination of these nodes, with the optimal weights determined through least-squares optimization. The methodology includes analytical solutions for linear PDEs and a perturbation algorithm for nonlinear PDEs, along with a mathematical proof of universal approximation capability.
Results
PIBLS demonstrated a speed improvement of one to three orders of magnitude over traditional PINNs while achieving significantly higher solution accuracy in experiments involving both linear and nonlinear PDEs.
Implications
The PIBLS framework offers a computationally efficient alternative for solving PDEs, which can be applied in various fields such as physics, engineering, and biology for real-time simulations and design optimization tasks.
Exploring the potential of AlphaEarth and TESSERA embeddings for Fine-scale Local Climate Zone Mapping: A case study across five cities in Switzerland
Computer Vision
- TESSERA embeddings outperform traditional Sentinel-1/2 composites and AlphaEarth for LCZ mapping.
- The study demonstrates the feasibility of generating fine-scale LCZ maps at 10 m resolution.
- Embedding-based models can reduce preprocessing and manual feature engineering efforts.
- Higher-quality reference data significantly enhances classification accuracy.
Read more
Exploring the potential of AlphaEarth and TESSERA embeddings for Fine-scale Local Climate Zone Mapping: A case study across five cities in Switzerland
Summary
This study investigates the use of precomputed embeddings from TESSERA and AlphaEarth to enhance Local Climate Zone (LCZ) mapping at a fine scale (10 m resolution) across five Swiss cities. Traditional LCZ mapping relies on coarse 100-m resolution data, which is inadequate for detailed urban analysis. The authors employ an attention-based U-Net architecture to upscale LCZ maps using both the embeddings and Sentinel-1/2 composites. Three experiments are conducted to evaluate multi-city transferability, the impact of higher-resolution reference data, and the temporal robustness of the models. Results indicate that TESSERA embeddings consistently outperform both Sentinel-1/2 and AlphaEarth in terms of classification accuracy, achieving Intersection-over-Union (IoU) scores between 0.59-0.69 and 0.77-0.82 in the first two experiments. The study highlights the potential of embedding-based models to streamline the LCZ mapping process, reduce preprocessing time, and improve regional transferability, while also emphasizing the importance of high-quality reference data for further accuracy improvements.
Methodology
The authors utilize an attention-based U-Net architecture to process multi-channel input images, comparing precomputed embeddings from TESSERA and AlphaEarth against Sentinel-1/2 composites. Three experiments assess the models' performance across different cities, the impact of higher-resolution reference data, and the temporal transferability of the models.
Results
The study finds that TESSERA embeddings consistently yield higher classification accuracy than both Sentinel-1/2 and AlphaEarth, with IoU scores ranging from 0.59-0.69 and 0.77-0.82 in the respective experiments. The results also indicate challenges in transferring models across different years, highlighting the need for improved reference data.
Implications
The findings suggest that embedding-based approaches can significantly enhance the accuracy and efficiency of LCZ mapping, which is crucial for urban climate modeling and sustainable urban planning. The open-source nature of the workflow allows for its application in various urban contexts globally, potentially aiding in climate risk assessments and urban design.
ADaPT: Token-Level Decoupling for Efficient Large Reasoning Models
NLP
Large Language Models
Efficient ML
- Identifies the performance degradation in efficient reasoning methods due to sequence-level coupling of efficiency and correctness signals.
- Proposes ADaPT, a token-level framework that decouples efficiency and correctness during training.
- Enables precise control over the efficiency-performance trade-off at inference time.
- Demonstrates significant reductions in inference costs without sacrificing reasoning performance across multiple benchmarks.
Read more
ADaPT: Token-Level Decoupling for Efficient Large Reasoning Models
Summary
The paper introduces Adaptive Dual-Process Thinking (ADaPT), a novel framework designed to enhance the efficiency of large reasoning models (LRMs) while maintaining their reasoning capabilities. Traditional methods for improving efficiency often lead to a degradation in reasoning performance due to a sequence-level coupling between efficiency incentives and correctness optimization. ADaPT addresses this issue by implementing a token-level dual-process framework that decouples efficiency and correctness signals during training. It introduces a mode-selection token that governs fast and slow reasoning, applying efficiency-related rewards exclusively to this token. This approach allows the model to preserve long, correct reasoning paths while encouraging efficiency when suitable. Furthermore, ADaPT provides a mechanism for precise control over the efficiency-performance trade-off during inference, enabling a single trained model to navigate the efficiency-performance Pareto frontier. The authors conducted extensive experiments demonstrating that ADaPT significantly reduces inference costs while maintaining strong reasoning performance across various benchmarks.
Methodology
ADaPT employs a two-stage training pipeline consisting of a supervised fine-tuning (SFT) stage to establish basic reasoning behaviors, followed by a reinforcement learning stage utilizing a token-level variant of Group Relative Policy Optimization (GRPO) to optimize reasoning-mode selection.
Results
The experiments conducted show that ADaPT significantly reduces inference costs while preserving strong reasoning performance across various benchmarks, indicating its effectiveness in balancing efficiency and correctness.
Implications
ADaPT's approach could lead to more efficient deployment of large reasoning models in real-world applications, where computational resources are limited, and maintaining reasoning quality is crucial. This framework may also inspire further research into decoupling efficiency and correctness in other machine learning contexts.
Alzheimer's Disease Diagnosis using a Multimodal Approach with 3D MRI and PET
Multimodal
- First study to combine 3D MRI and PET images with advanced fusion methods and a Mixture-of-Experts classifier.
- Demonstrates the effectiveness of input-adaptive multimodal modeling for Alzheimer's diagnosis.
- Utilizes Grad-CAM for model interpretability, enhancing trust in clinical applications.
- Achieves high classification accuracies across multiple binary tasks related to Alzheimer's disease.
Read more
Alzheimer's Disease Diagnosis using a Multimodal Approach with 3D MRI and PET
Summary
This paper presents a novel approach for diagnosing Alzheimer's Disease (AD) by integrating 3D MRI and PET neuroimaging data using advanced machine learning techniques. The authors highlight the importance of early diagnosis, particularly at the Mild Cognitive Impairment (MCI) stage, where timely interventions can slow disease progression. The study addresses limitations in existing multimodal models that typically rely on static concatenation of MRI and PET data, which can hinder robustness and efficiency. The proposed method employs 3D convolutional neural networks (CNNs) for feature extraction, combined with three fusion strategies: concatenation, Gated Multimodal Unit (GMU), and gated self-attention. Additionally, a sparsely gated Mixture-of-Experts (MoE) classifier is introduced to enhance input-adaptive routing, activating only the most relevant experts for each case. To improve interpretability, Grad-CAM is utilized to visualize the regions of the brain that influence the model's decisions. The experiments conducted across three binary classification tasks (NC vs. MCI, MCI vs. AD, and NC vs. AD) demonstrate the effectiveness of the proposed approach, with GMU achieving accuracies of 80.46% and 95.47% for NC vs. MCI and NC vs. AD, respectively, while gated self-attention reached 82.08% for MCI vs. AD. Ablation studies confirm that the MoE component is crucial for maintaining high accuracy across all tasks.
Methodology
The methodology involves preprocessing MRI and PET images, extracting features using a 3D CNN, and applying three fusion techniques (concatenation, GMU, gated self-attention). A sparsely gated Mixture-of-Experts model is integrated to dynamically select relevant subnetworks for classification. Grad-CAM is employed for visualizing the model's decision-making process.
Results
The proposed model achieved accuracies of 80.46% for NC vs. MCI, 95.47% for NC vs. AD, and 82.08% for MCI vs. AD. Ablation studies indicated that removing the MoE layer consistently reduced accuracy across all tasks, highlighting its importance.
Implications
The findings suggest that integrating multimodal neuroimaging data with adaptive machine learning techniques can significantly enhance early diagnosis of Alzheimer's disease, potentially leading to better patient outcomes through timely interventions. The model's interpretability may also facilitate its acceptance in clinical settings.
Weibull Weight-Scale Parameter Evolution under AdamW Training Dynamics
Optimization
Theory
Large Language Models
- Establishes a connection between the Weibull weight-scale parameter λ and AdamW squared-norm dynamics.
- Demonstrates that alignment force is the dominant contributor to the rise phase of λ during training.
- Identifies a transition from alignment dominance to a balance with decay forces near saturation.
- Introduces a method for recovering alignment force from sparse checkpoints with high accuracy.
Read more
Weibull Weight-Scale Parameter Evolution under AdamW Training Dynamics
Summary
This paper investigates the dynamics of the Weibull weight-scale parameter λ during the training of transformer models using the AdamW optimizer. The author builds on a two-parameter Weibull framework to diagnose weight distributions and explores the reasons behind the observed growth, overshoot, and relaxation of λ during training. A leading-order three-force decomposition of the squared weight norm is derived, consisting of an alignment force, an injection force from adaptive step magnitude, and a decay force from weight decay. The study utilizes self-trained Pythia-70M models with ground-truth optimizer moments to demonstrate that the alignment force predominantly drives the rise phase of λ, contributing 88-94% of the total force budget. Near saturation, the balance between alignment and decay forces explains the transition from growth to relaxation of λ. The paper also introduces a spline displacement method to recover alignment force from sparse checkpoints, achieving high accuracy. Additionally, the peak value of λ is shown to vary with training data coherence, indicating a potential data-dependent aspect of weight-scale growth.
Methodology
The study employs a three-force decomposition approach to analyze the dynamics of the squared weight norm during AdamW training. It uses self-trained transformer models with known optimizer moments to measure the forces acting on the weight scale. The spline displacement method is introduced to recover alignment forces from sparse checkpoints in practical scenarios.
Results
The analysis reveals that alignment forces dominate the weight-scale growth phase, contributing 88-94% of the force budget. The transition from growth to relaxation of λ is explained by the balance between alignment and decay forces. The spline displacement method successfully recovers alignment forces with approximately 92-94% accuracy, outperforming naive methods. The peak value of λ varies with training data coherence, indicating a relationship between data characteristics and weight-scale dynamics.
Implications
Understanding the dynamics of weight evolution during training can enhance optimization strategies, improve model generalization, and inform architecture design. The findings may lead to better training practices and insights into the effects of data diversity on model performance.
Comparing Linear Probes with Mahalanobis Cosine Similarity
Interpretability
Theory
Large Language Models
- MCS provides a task-aware refinement for comparing linear probes, outperforming Euclidean cosine similarity.
- The linear relationship between MCS and OOD AUROC is validated across multiple models, layers, and datasets.
- Theoretical foundations explain the linearity of the AUROC-MCS relationship based on signal-to-noise ratios.
- Conditions for failure of linearity are identified and empirically verified.
Read more
Comparing Linear Probes with Mahalanobis Cosine Similarity
Summary
This paper investigates the effectiveness of Mahalanobis Cosine Similarity (MCS) as a method for comparing linear probes in machine learning, particularly in interpretability research. Linear probes are commonly used to assess model behavior, but their performance can vary significantly across different datasets. The authors extend previous findings that MCS, which adjusts the inner product by the covariance of out-of-distribution (OOD) data, is linearly related to the OOD AUROC (Area Under the Receiver Operating Characteristic curve) of a probe. They demonstrate that this relationship holds across various models, layers, and datasets, achieving high correlation coefficients (R2 > 0.93). The paper also provides a theoretical foundation for this linear relationship, showing that both OOD AUROC and MCS are sigmoid-shaped functions of the probe's signal-to-noise ratio (SNR). The authors identify conditions under which this linearity may fail and verify these conditions through empirical tests. Overall, MCS is presented as a robust alternative to traditional Euclidean cosine similarity for evaluating linear probes.
Methodology
The authors employed logistic regression probes trained on in-distribution (ID) and out-of-distribution (OOD) datasets across various models (Llama-70B, Llama-8B, Qwen-7B) and layers. They calculated MCS using the covariance of the OOD data and compared it to the traditional Euclidean cosine similarity (ECS). The study involved extensive empirical testing across 24 datasets from three domains: truthfulness, gender classification, and general NLP benchmarks.
Results
The study found that MCS consistently showed a near-linear relationship with OOD AUROC across different models and datasets, with R2 values exceeding 0.93 in all tested conditions. In contrast, ECS exhibited significantly lower correlation, with R2 values dropping to as low as 0.06. The theoretical analysis confirmed that both metrics are S-shaped functions of the probe's SNR, leading to their linear relationship under certain conditions.
Implications
The findings suggest that MCS can be a more reliable metric for evaluating the generalization performance of linear probes, particularly in interpretability research. This could enhance the understanding of model behavior across varying datasets and improve the robustness of interpretability tools in machine learning.
Spectral DPPs via NEPv: A Scalable Continuous Relaxation of Determinantal MAP for Diversity-Aware Data Selection
Optimization
Efficient ML
Theory
- Introduces a continuous relaxation of the DPP-MAP problem to address computational challenges.
- Develops a NEPV framework that allows for efficient diversity-aware data selection.
- Proposes an algorithm (NEPV-DPP) with near-linear scaling in large datasets.
- Focuses on theoretical guarantees of the algorithm's convergence and performance.
Read more
Spectral DPPs via NEPv: A Scalable Continuous Relaxation of Determinantal MAP for Diversity-Aware Data Selection
Summary
This paper addresses the challenge of selecting a diverse and high-quality subset from a large pool of candidates, a common task in machine learning applications such as data curation and active learning. The authors focus on Determinantal Point Processes (DPPs), which provide a principled approach to diversity but face computational challenges due to the NP-hard nature of their MAP objective. The paper introduces a continuous relaxation of the DPP-MAP problem over the Stiefel manifold, leading to a new family of Nonlinear Eigenvalue Problems with eigenvector dependency (NEPV). The proposed NEPV-DPP algorithm utilizes a self-consistent field (SCF) iteration that guarantees local contraction, allowing for efficient computation with a time complexity of O(ndk + nk²)t, where n is the number of candidates, k is the subset size, and t is the number of iterations. This approach scales near-linearly with n, making it feasible for large-scale data selection scenarios. The paper emphasizes the theoretical aspects of the relaxation and the algorithm's scalability, with plans for empirical validation in future work.
Methodology
The authors recast the DPP-MAP problem as a continuous optimization problem on the Stiefel manifold, leading to a Nonlinear Eigenvalue Problem with eigenvector dependency. They derive a self-consistent field iteration for solving this NEPV, ensuring efficient computation through matrix-vector products with the kernel.
Results
The NEPV-DPP algorithm demonstrates a significant reduction in computational complexity, allowing for diversity-aware subset selection in large-scale datasets, with a theoretical guarantee of local contraction in the SCF iteration.
Implications
This work has potential applications in various machine learning tasks requiring efficient data selection, such as active learning, data curation, and retrieval diversification, particularly in scenarios involving large datasets.
Agentic Symbolic Search: Characterizing PDEs Beyond Hand-crafted Expressions, Meshes, and Neural Networks
Theory
Interpretability
- ASYS automates the search for symbolic representations of PDE solutions, integrating prior knowledge and evolutionary search.
- The framework produces interpretable mathematical structures that can guide further analysis of complex PDEs.
- ASYS demonstrates the ability to recover known analytical forms and generate new approximations for PDEs lacking closed-form solutions.
- The approach highlights the limitations of traditional numerical methods and neural networks in providing explicit mathematical structures.
Read more
Agentic Symbolic Search: Characterizing PDEs Beyond Hand-crafted Expressions, Meshes, and Neural Networks
Summary
The paper introduces Agentic Symbolic Search (ASYS), a novel framework designed to characterize solutions of partial differential equations (PDEs) by leveraging prior knowledge from PDE theory and accumulated search experience. Unlike traditional methods that rely on hand-crafted expressions or numerical simulations, ASYS employs a coding agent to generate testable symbolic programs that encapsulate the mathematical structures of PDE solutions. The framework refines these structures through evolutionary search while optimizing continuous parameters using gradient-based techniques. ASYS successfully recovers known analytical forms for certain problems and constructs analytical approximations for others, thereby facilitating further mathematical analysis. The authors demonstrate ASYS's effectiveness across five diverse problems, yielding interpretable representations such as a geometric interface formula for Allen–Cahn 2D dynamics and a nine-parameter contraction law for Keller–Segel chemotactic blow-up, where no closed-form solutions were previously available. This work suggests a paradigm shift in PDE characterization, moving beyond traditional analytical and numerical approaches.
Methodology
ASYS employs a coding agent that translates PDE theory and constraints into symbolic programs. It utilizes evolutionary search to refine mathematical forms and gradient-based optimization to fit continuous parameters, allowing for a structured exploration of potential solutions.
Results
The experiments conducted with ASYS across five PDE problems resulted in the generation of interpretable representations, including a geometric interface formula for Allen–Cahn dynamics and a contraction law for Keller–Segel blow-up, demonstrating the framework's capability to produce meaningful mathematical structures where none existed before.
Implications
The implications of ASYS extend to various fields that rely on PDEs, including physics, engineering, and applied mathematics. By providing a method to derive interpretable solutions, ASYS can enhance understanding and facilitate further research into complex dynamic systems.
Efficient Neural Network Model Selection for Few-Class Application Datasets
Robotics
Efficient ML
- Introduces a measure of classification difficulty based on dataset properties for efficient model selection.
- Demonstrates 'few-class distinctiveness', showing different accuracy behaviors for few-class datasets compared to many-class datasets.
- Identifies and utilizes scaled models that are smaller and more efficient for few-class applications.
- Provides experimental evidence supporting the advantages of few-class model selection in practical applications.
Read more
Efficient Neural Network Model Selection for Few-Class Application Datasets
Summary
This paper addresses the challenge of selecting efficient neural network models for applications with few-class datasets, which are often overlooked in favor of models evaluated on larger datasets with many classes. The authors introduce a novel measure of classification difficulty based on data-side properties, termed 'few-class distinctiveness', which allows for faster model selection by quantifying the relationship between classification difficulty and class similarity. The proposed metric enables model comparisons that are 6 to 29 times faster than traditional training and testing methods. The authors also extend existing model families to create smaller models that maintain similar accuracy, achieving models up to 42% smaller than the YOLOv5-nano for mobile robot tasks. The paper demonstrates practical applications in resource-constrained environments such as mobile robots, drones, and IoT devices, highlighting significant efficiency gains without compromising performance.
Methodology
The authors develop a quantitative measure of classification difficulty that incorporates five data-side properties: number of classes, intra-class similarity, inter-class similarity, scale-resolution, and color. They conduct experiments to analyze how these properties affect classification performance and use this information to select existing models that are more efficient for few-class datasets.
Results
The proposed difficulty measure allows for model comparisons that are significantly faster than traditional methods, with improvements in efficiency demonstrated through the development of smaller models that achieve comparable accuracy. The paper reports models that are up to 42% smaller than existing benchmarks while maintaining performance, particularly in applications involving mobile robots, drones, and IoT devices.
Implications
The findings suggest that practitioners can leverage data-side properties to make informed decisions about model selection, leading to more efficient neural network applications in real-world scenarios with limited classes. This approach can significantly reduce computational resources and time in model training and selection processes.
Marginal Advantage Accumulation for Memory-Driven Agent Self-Evolution
Optimization
Reinforcement Learning
Large Language Models
- Introduces Marginal Advantage Accumulation (MAA) for effective memory optimization in agents.
- Addresses the lack of cross-batch evidence accumulation in existing methods.
- Achieves superior performance across multiple benchmarks compared to traditional methods.
- Reduces optimization-phase token consumption by about 75%.
Read more
Marginal Advantage Accumulation for Memory-Driven Agent Self-Evolution
Summary
This paper addresses the challenges in batch-style trace distillation for memory-driven agent self-evolution, where contradictory feedback across different batches complicates the optimization of memory operations. The authors propose a novel method called Marginal Advantage Accumulation (MAA), which introduces a cross-batch evidence accumulation mechanism that satisfies two structural conditions: alignability and comparability. MAA constructs differential signals to enable meaningful comparisons across batches and accumulates evidence for each operation using Exponential Moving Average (EMA). This approach allows for the identification of stably effective operations while reducing the noise from batch-specific characteristics. The method is designed as a post-processing architecture and demonstrates superior performance, achieving the best results in 14 out of 16 settings across four benchmarks and four target models, significantly outperforming existing batch-level distillation methods and matching or surpassing online alternatives in most scenarios. Furthermore, MAA reduces optimization-phase token consumption by approximately 75%.
Methodology
The MAA method involves creating a stable memory bank with identifiable operations across batches, utilizing differential signals to compare performance across different task distributions, and employing Exponential Moving Average (EMA) for evidence accumulation. The method focuses on aligning and comparing operations semantically to ensure effective aggregation of evidence.
Results
MAA outperformed existing batch-level distillation baselines in 14 out of 16 experimental settings across four benchmarks and four target models. It also matched or surpassed online alternatives in most cases while significantly reducing token consumption during the optimization phase.
Implications
The findings suggest that MAA can enhance the efficiency and effectiveness of memory-driven agents in various applications, including autonomous systems in scientific discovery, engineering, and everyday tasks. The method's ability to distinguish between stable and accidental operations could lead to more robust agent self-evolution strategies.
What Makes Effective Supervision in Latent Chain-of-Thought: An Information-Theoretic Analysis
NLP
Large Language Models
Theory
- Latent Chain-of-Thought models face challenges due to weak supervision signals leading to gradient attenuation and representational drift.
- The paper introduces trajectory supervision and space supervision as two complementary dimensions of process supervision.
- The Unified Latent Probe (ULP) is proposed to quantify the mutual information between latent trajectories and reasoning steps.
- Experiments reveal a strong correlation between information fidelity in latent chains and reasoning accuracy.
Read more
What Makes Effective Supervision in Latent Chain-of-Thought: An Information-Theoretic Analysis
Summary
This paper investigates the challenges of robust latent reasoning in Latent Chain-of-Thought (CoT) models, which utilize continuous hidden states for reasoning instead of verbose discrete sequences. The authors identify a dual collapse phenomenon where gradient attenuation and representational drift hinder effective learning. They propose a framework for process supervision, decomposing it into trajectory supervision, which provides dense reasoning signals, and space supervision, which maintains the semantic structure of the latent space. The study introduces the Unified Latent Probe (ULP) to measure the mutual information between latent trajectories and reasoning steps. Experimental results demonstrate a clear relationship between information fidelity in the latent chain and reasoning accuracy, suggesting a shift from geometric imitation to mutual information maximization as a more effective supervision strategy.
Methodology
The authors analyze Latent CoT using an information-theoretic framework, decomposing process supervision into trajectory and space supervision. They introduce the Unified Latent Probe (ULP) to measure mutual information and conduct empirical experiments to assess the effectiveness of different supervision strategies.
Results
The experiments show that trajectory supervision enhances training stability and reasoning accuracy by increasing gradient magnitudes. Space supervision's effectiveness varies, with generative reconstruction preserving information capacity better than geometric compression, which can collapse the reasoning manifold.
Implications
The findings suggest that improving supervision strategies in latent reasoning models can lead to better performance in complex reasoning tasks, potentially influencing future research in model training and architecture design.
IHBench: Evaluating Post-Interruption Recovery in Voice Agents with Structured Workflows
Audio & Speech
NLP
Large Language Models
- IHBench defines post-interruption recovery as a distinct evaluation axis for voice agents.
- The benchmark includes six types of interruptions and a two-axis scoring system for evaluation.
- Closed-weight models consistently outperform open-weight models in handling interruptions.
- The study highlights significant gaps in current models' recovery capabilities, indicating areas for improvement.
Read more
IHBench: Evaluating Post-Interruption Recovery in Voice Agents with Structured Workflows
Summary
The paper introduces IHBench, a benchmark designed to evaluate how voice agents recover from interruptions during structured workflows across various domains such as customer service and healthcare. While existing benchmarks focus on the mechanics of interruptions, IHBench specifically assesses the agent's ability to resume workflows correctly after an interruption. The authors define post-interruption recovery as a distinct evaluation axis, identifying six types of interruptions and proposing a two-axis scoring system based on task fulfillment and recovery quality. The benchmark includes synthetic multi-turn conversations with controlled interruptions and evaluation rubrics. The study evaluates 27 audio-language model configurations from prominent organizations, revealing that closed-weight models generally perform better in handling interruptions compared to open-weight models. The findings indicate significant gaps in post-interruption recovery capabilities among current models, suggesting that targeted evaluations can help identify training deficiencies.
Methodology
The authors developed IHBench, a benchmark consisting of synthetically generated multi-turn conversations that simulate interruptions in structured workflows. They evaluated 27 audio-language model configurations using a two-axis scoring system focused on task fulfillment and recovery quality. The evaluation included controlled interruptions and was validated through inter-judge agreement studies and comparisons with existing benchmarks.
Results
The evaluation revealed that closed-weight models are more robust to interruptions, achieving higher task fulfillment rates and degrading more slowly as conversation length increases. Open-weight models showed a significant decline in performance across various metrics, including a notable audio-versus-text modality gap. The study also found that many models struggled with post-interruption recovery, indicating a need for targeted training.
Implications
The findings suggest that improving post-interruption recovery in voice agents is crucial for enhancing user experience in real-time applications. The IHBench benchmark can guide future research and development efforts in training more effective voice agents capable of handling interruptions seamlessly.
Advances in Scientific Machine Learning for Coupled Fluid Flow and Transport
Theory
Efficient ML
- Introduction of SciML as a solution to computational challenges in fluid dynamics.
- Review of surrogate modeling techniques including PINNs and β-VAEs.
- Demonstration of applications in turbidity currents and thermal flow modeling.
- Discussion of high-performance computing strategies to enhance model efficiency.
Read more
Advances in Scientific Machine Learning for Coupled Fluid Flow and Transport
Summary
This chapter reviews recent advancements in Scientific Machine Learning (SciML) for modeling coupled fluid flow and transport phenomena, particularly focusing on systems governed by the incompressible Navier–Stokes and scalar transport equations. These systems, prevalent in applications like turbidity currents and thermal convection, are characterized by strong nonlinear coupling and multiscale behavior, leading to high computational costs for traditional simulations. The authors discuss state-of-the-art SciML approaches, including linear reduced-order methods such as Dynamic Mode Decomposition and nonlinear techniques like Physics-Informed Neural Networks (PINNs) and β-Variational Autoencoders (β-VAEs). They present their contributions, including surrogate modeling of turbidity currents using PINNs and extracting disentangled nonlinear modes from thermal flows with β-VAEs. The chapter emphasizes the mathematical and physical foundations of coupled fluid flow, alongside computational strategies like Adaptive Mesh Refinement/Coarsening (AMR/C) and scientific floating-point data compression. Overall, it illustrates how SciML can provide fast and accurate approximations of complex coupled systems while significantly reducing computational costs compared to full-order simulations. The chapter also highlights ongoing research directions, including real-time prediction and uncertainty quantification, which depend on the specific problem context.
Methodology
The authors employ a combination of linear reduced-order methods, such as Dynamic Mode Decomposition, and nonlinear neural network-based techniques, including Physics-Informed Neural Networks (PINNs) and β-Variational Autoencoders (β-VAEs), to construct surrogate models for coupled fluid flow and transport phenomena. They also integrate high-performance computing strategies like Adaptive Mesh Refinement/Coarsening (AMR/C) and data compression techniques.
Results
The chapter presents successful applications of SciML techniques in modeling turbidity currents and extracting nonlinear modes from thermal flows. These methods demonstrate the capability to provide efficient and accurate approximations of complex fluid dynamics, significantly reducing the computational burden associated with traditional high-fidelity simulations.
Implications
The findings suggest that SciML can revolutionize the modeling of complex fluid systems, enabling faster simulations and real-time predictions. This has potential applications in environmental monitoring, engineering design, and any field requiring efficient modeling of fluid dynamics and transport phenomena.
The Significance of Style Diversity in Annotation-Free Synthetic Data Generation
NLP
Large Language Models
Generative Models
- Introduces an annotation-free framework for synthetic dialogue generation.
- Demonstrates that style diversity is more critical than topic diversity for data utility.
- Proposes two novel stylization models (Univ and Exam) for enhancing linguistic style.
- Achieves up to 93.3% performance compared to human-annotated data.
Read more
The Significance of Style Diversity in Annotation-Free Synthetic Data Generation
Summary
This paper presents a novel framework for generating synthetic dialogue data for intent classification without relying on human-annotated seed data. The authors emphasize the importance of style diversity over topic diversity in enhancing the utility of synthetic data. The proposed framework generates dialogues based on intent definitions and incorporates two types of attributes: topic and style. To further improve the diversity of the generated data, the authors introduce two stylization models, Univ and Exam, which transform the generated utterances into varied, human-like styles. An LLM-as-a-judge filtering process is employed to ensure high data quality. Experimental results demonstrate that the proposed method achieves up to 93.3% of the performance of models trained on human-annotated data, highlighting that style diversity is crucial in preventing models from learning spurious correlations. The findings suggest that incorporating style attributes during the generation process is more effective than post-hoc adaptations.
Methodology
The authors developed a framework that generates synthetic dialogues using intent definitions without human annotations. They categorized attributes into topic and style, and introduced two stylization models to adapt generated utterances to human-like styles. An LLM was used to filter out low-quality samples, enhancing the overall quality of the generated data.
Results
The experimental results showed that the proposed framework achieved 90.7% and 93.3% accuracy on industrial and public datasets, respectively, compared to models trained on human-annotated data. The study confirmed that style diversity significantly improves the utility of synthetic data, while topic diversity had a lesser impact.
Implications
This work has significant implications for industries that require rapid development and testing of models in new domains without access to human-annotated data. It suggests that focusing on style diversity can lead to more effective synthetic data generation, which can be crucial for applications in dialogue systems and intent classification.
Learner-based Concept Drift Detection: Analysis and Evaluation
Theory
Time Series
Efficient ML
- Concept drift poses significant challenges to machine learning models in dynamic environments.
- The paper categorizes drift detection methods into SPC-based, Window-based, and Ensemble-based approaches.
- A total of 15 drift detection algorithms are reviewed and empirically evaluated.
- The study emphasizes the need for adaptive algorithms capable of handling non-stationary data distributions.
Read more
Learner-based Concept Drift Detection: Analysis and Evaluation
Summary
This paper addresses the critical challenge of concept drift in machine learning, particularly in evolving streaming environments where data distributions change over time. The authors provide a comprehensive survey of concept drift characteristics, types, and detection methods, focusing on learner-based detection approaches. They categorize drift detection methods into three main types: Statistical Control Process (SPC), Window-based, and Ensemble-based methods, reviewing 15 representative detectors. The study includes an empirical evaluation of these methods on both synthetic datasets, where drift locations are known, and real-world datasets, assessing their performance under various drift scenarios, including abrupt and gradual changes. The findings highlight the importance of timely and efficient drift detection for maintaining predictive accuracy in applications such as fraud detection, health monitoring, and environmental monitoring. By combining theoretical analysis with practical evaluation, the paper enhances the understanding of concept drift and the effectiveness of different detection strategies.
Methodology
The authors conducted a theoretical analysis of concept drift characteristics and categorized drift detection methods. They performed empirical evaluations using synthetic datasets with known drift locations and real-world datasets to assess the performance of various drift detection algorithms under different scenarios.
Results
The empirical evaluation revealed that different drift detection methods exhibited varying levels of effectiveness depending on the nature of the drift (sudden vs. gradual) and the characteristics of the datasets. The study provided insights into the strengths and weaknesses of each method, contributing to a better understanding of their applicability in real-world scenarios.
Implications
The findings of this study have significant implications for the development of adaptive machine learning systems that can maintain high accuracy in the face of changing data distributions. This is particularly relevant for applications in fraud detection, health monitoring, and other domains where timely decision-making is critical.
Understanding Key Features of Time Series Foundation Models from Epidemic Forecasting
Time Series
- Mixture-of-experts models outperform other architectures in influenza forecasting.
- Numerical transformer-based models are reliable, especially with appropriate pretraining.
- Hospitalization data can enhance forecasting accuracy when used as an auxiliary signal.
- The study emphasizes the importance of evaluating models under realistic public health constraints.
Read more
Understanding Key Features of Time Series Foundation Models from Epidemic Forecasting
Summary
This paper addresses the critical need for accurate short-term forecasting of seasonal influenza, which affects millions in the U.S. annually. The authors conduct a systematic evaluation of various forecasting models using influenza-like illness (ILI) and hospitalization time series data, focusing on both temporal and spatial generalization for 1-4 week ahead predictions. They compare classical neural networks, transformer-based models, pretrained time series foundation models, and LLM-based approaches. The study finds that a mixture-of-experts model, which integrates multiple pretrained forecasters, yields the best performance, highlighting the value of heterogeneous representations. Additionally, numerical transformer models are shown to provide reliable forecasts, with pretraining significantly enhancing performance, especially at longer prediction horizons. The research also explores the role of hospitalization data as an auxiliary signal, demonstrating its potential to improve forecasting accuracy. Overall, the findings offer actionable insights for model selection and pretraining strategies in epidemic forecasting.
Methodology
The authors compiled and standardized weekly ILI and hospitalization time series data at the U.S. HHS-region level. They evaluated 17 deep forecasting models, including classical neural networks and modern foundation models, under two generalization regimes: temporal (within-region) and spatial (across-region). The evaluation focused on 1-4 week ahead predictions, using a consistent preprocessing and training pipeline while reporting metrics such as Mean Squared Error (MSE) and Normalized Mean Squared Error (NNSE).
Results
The study found that the mixture-of-experts model achieved the highest performance across the evaluated tasks. Numerical transformer-based models provided reliable forecasts, particularly benefiting from pretraining aligned with influenza dynamics. LLM-based methods underperformed compared to numerical forecasters. The incorporation of hospitalization data as an auxiliary covariate showed improvements in specific scenarios, indicating its value in enhancing multi-horizon forecasting robustness.
Implications
The findings of this research have significant implications for public health agencies in improving epidemic forecasting accuracy, which can inform vaccination strategies, hospital resource allocation, and overall preparedness for influenza outbreaks. The insights on model selection and pretraining strategies can guide future research and operational practices in epidemic surveillance.
Execution-State Capsules: Graph-Bound Execution-State Checkpoint and Restore for Low-Latency, Small-Batch, On-Device Physical-AI Serving
NLP
Reinforcement Learning
Efficient ML
- FlashRT provides a low-latency execution environment optimized for small-batch AI serving.
- Execution-State Capsules enable efficient checkpointing and restoring of execution states, enhancing session management.
- The proposed system outperforms existing methods in terms of cold time-to-first-token speedup.
- The design integrates execution state management as a first-class object, improving responsiveness in interactive applications.
Read more
Execution-State Capsules: Graph-Bound Execution-State Checkpoint and Restore for Low-Latency, Small-Batch, On-Device Physical-AI Serving
Summary
This paper introduces Execution-State Capsules, a novel approach designed to enhance the efficiency of low-latency, small-batch on-device AI serving, particularly for physical-world interactive applications. The proposed system, FlashRT, is a backend-facing kernel runtime that utilizes captured graphs over contiguous static buffers, eliminating block-table indirection. This design choice results in a significant reduction in cold time-to-first-token (TTFT), achieving a speedup of 2.6–2.8 times compared to existing systems like vLLM. The Execution-State Capsule allows for checkpointing and restoring the execution state, enabling efficient session reuse, interruption handling, and re-entry without compromising performance. The paper demonstrates that the capsule mechanism not only improves latency but also manages execution state as a first-class object, enhancing the overall system's responsiveness. The results indicate that the capsule's restore process is byte-identical and token-identical, ensuring consistency across different applications, including language models and reinforcement learning tasks. The findings suggest that this approach is particularly beneficial for scenarios with strict latency requirements and limited computational resources.
Methodology
The methodology involves developing a white-box kernel runtime (FlashRT) that captures execution graphs over static buffers, allowing for efficient state management. The execution-state capsule mechanism is introduced to checkpoint and restore the execution state, facilitating low-latency interactions in AI applications. Performance is evaluated on NVIDIA GPUs, measuring time-to-first-token and comparing against existing systems.
Results
The results show that FlashRT achieves a cold time-to-first-token speedup of 2.6–2.8 times compared to vLLM, with further improvements observed as prefix length increases (up to 27 times). The capsule mechanism ensures byte-identical state restoration and token-identical outputs across different tasks, demonstrating its effectiveness in maintaining performance and correctness.
Implications
The findings suggest that Execution-State Capsules can significantly enhance the performance of on-device AI systems, particularly in applications requiring real-time interaction, such as language processing, speech recognition, and robotics. This approach could lead to more responsive AI applications in constrained environments.
Neural Additive and Basis Models with Feature Selection and Interactions
Interpretability
Efficient ML
Theory
- Introduction of a feature selection mechanism in NAM and NBM to enhance computational efficiency.
- Ability to handle high-dimensional datasets and capture feature interactions effectively.
- NAM-FS and NBM-FS models show better or comparable performance to existing GAMs.
- Demonstrated improved throughput over vanilla NAM and NBM on high-dimensional data.
Read more
Neural Additive and Basis Models with Feature Selection and Interactions
Summary
This paper addresses the challenge of low interpretability in deep neural networks (DNNs) by enhancing neural additive models (NAM) and neural basis models (NBM) with a feature selection mechanism. While NAM and NBM are known for their interpretability and performance, they struggle with computational efficiency when handling high-dimensional datasets or feature interactions. The authors propose incorporating a feature selection layer into both models, allowing for dynamic feature selection during training. This innovation reduces computational costs and model sizes, enabling the use of two-input neural networks even in high-dimensional contexts. The proposed models, termed NAM-FS and NBM-FS, demonstrate improved computational efficiency and maintain or exceed the predictive performance of existing generalized additive models (GAMs). Experimental results indicate that these models outperform vanilla NAM and NBM, particularly in high-dimensional scenarios, showcasing the effectiveness of the integrated feature selection mechanism.
Methodology
The authors developed NAM-FS and NBM-FS by integrating a feature selection layer into the existing architectures of NAM and NBM. This layer updates selection weights during training, allowing the models to dynamically choose relevant features and consider pairwise interactions without incurring excessive computational costs. The performance of the proposed models was evaluated against existing GAMs and other interpretable models using high-dimensional classification datasets.
Results
The experimental results showed that NAM-FS and NBM-FS are computationally efficient compared to vanilla NAM and NBM, particularly in high-dimensional datasets. They achieved better or comparable predictive performance to state-of-the-art GAMs, demonstrating the effectiveness of the feature selection mechanism integrated into the models.
Implications
The proposed models can be applied in fields requiring high interpretability and performance, such as healthcare and finance, where understanding feature contributions is crucial. The ability to efficiently handle high-dimensional data with interactions opens new avenues for research and application in various domains.