AI-generated summaries
Today's ML research,
without the noise.
Daily summaries of the latest machine learning papers from arXiv, processed every 8 hours.
48
Papers today
8h
Update frequency
7
Days of history
Demystifying Hidden-State Recurrence: Switchable Latent Reasoning with On-Policy Reinforcement Learning
Reinforcement Learning
Large Language Models
Interpretability
- Introduction of boundary tokens <swi> and </swi> to facilitate latent reasoning and mechanistic analysis.
- Switch framework allows for effective optimization of hidden-state recurrence using on-policy RL.
- Significant performance improvement on MATH-500 benchmark, outperforming previous methods.
- Mechanistic analysis reveals the functional role of latent steps in computation.
Read more
Demystifying Hidden-State Recurrence: Switchable Latent Reasoning with On-Policy Reinforcement Learning
Summary
This paper introduces Switch, a novel framework for switchable latent reasoning that addresses the challenges of optimizing hidden-state recurrence in reinforcement learning (RL) settings. The authors highlight the limitations of existing methods, which struggle with causal interpretability and optimization in on-policy RL due to the absence of explicit boundaries marking latent computations. By introducing discrete boundary tokens, <swi> and </swi>, the model can effectively manage latent reasoning, allowing for a well-defined policy ratio during training and facilitating mechanistic analysis of the latent steps. The training process involves a curriculum that transitions from visible reasoning to latent reasoning, using a Switch-GRPO objective that propagates gradients through the latent computation. The results demonstrate that Switch significantly outperforms previous methods, achieving a 79.3% accuracy on the MATH-500 benchmark, which is a 25.7-point improvement over the best existing baseline. Additionally, the mechanistic analysis reveals that the boundary tokens serve as learned policies that enable meaningful computations within the latent steps, rather than acting as inert placeholders. This work not only enhances the performance of latent reasoning models but also opens avenues for deeper understanding and analysis of their internal mechanisms.
Methodology
The authors propose a three-phase training process for the Switch model, which includes a supervised fine-tuning (SFT) stage that wraps reasoning spans in boundary tokens, a curriculum that replaces visible reasoning with latent steps, and a Switch-GRPO optimizer that allows for gradient propagation through the latent computation. This approach enables the model to learn when to invoke latent reasoning effectively.
Results
The Switch model achieves an accuracy of 79.3% on the MATH-500 benchmark, which is a 25.7-point increase over the strongest baseline using hidden-state recurrence. The Switch-GRPO further reduces the latent invocation rate while increasing accuracy on problems where latent reasoning is applied by 12.6 points.
Implications
The findings suggest that the Switch framework can enhance the performance and interpretability of large language models in reasoning tasks. It opens up new possibilities for mechanistic analysis in RL settings, potentially leading to more robust and explainable AI systems.
Graph-based Target Back-Propagation for Context Adaptation in Multi-LLM Agentic Systems
NLP
Large Language Models
Graph Learning
- GTBP provides a structured approach to credit assignment in multi-LLM systems using graph-based modeling.
- The framework ensures stable prompt updates over iterations, enhancing the effectiveness of context adaptation.
- GTBP consistently outperforms existing methods on benchmark datasets while maintaining computational efficiency.
- The method allows for automated prompt engineering, reducing reliance on manual intervention in LLM workflows.
Read more
Graph-based Target Back-Propagation for Context Adaptation in Multi-LLM Agentic Systems
Summary
This paper introduces Graph-based Target Back-Propagation (GTBP), a novel framework designed for context adaptation in multi-LLM agentic systems. Context adaptation automates the process of prompt engineering by iteratively revising prompts based on task feedback while keeping model weights unchanged. The authors identify limitations in existing methods, such as inaccurate credit assignment and lack of convergence guarantees, particularly in multi-module systems. GTBP addresses these issues by modeling workflows as directed acyclic graphs and propagating local target outputs backward through the graph. This method utilizes discrepancies between target outputs and actual outputs to guide prompt updates in a stage-wise manner. Theoretical analysis demonstrates that GTBP's updates stabilize over iterations and can effectively minimize the overall objective when paired with a capable LLM optimizer. Empirical results show that GTBP outperforms strong baseline methods across three benchmark datasets, achieving improved prompt optimization performance without incurring additional computational costs.
Methodology
The methodology involves modeling agentic workflows as directed acyclic graphs and employing a target-back-propagation mechanism to infer local target outputs for sub-modules. GTBP updates prompts based on discrepancies between these targets and actual outputs, facilitating a structured credit assignment process without requiring gradient access.
Results
GTBP was empirically validated on three benchmark datasets, demonstrating consistent improvements in prompt optimization performance compared to strong baseline methods, while maintaining a comparable computational cost.
Implications
The implications of this work suggest that GTBP can significantly enhance the efficiency and effectiveness of multi-LLM systems in various applications, particularly in automating prompt engineering and improving task performance in complex workflows.
Gefen: Optimized Stochastic Optimizer
Optimization
Efficient ML
- Gefen reduces the memory footprint of AdamW by approximately 8× without sacrificing performance.
- The optimizer automatically shares second-moment estimates and quantizes first moments using a learned codebook.
- The method is grounded in theoretical insights regarding Hessian affinity and squared gradients.
- Gefen enables larger micro-batches and improves throughput in distributed training settings.
Read more
Gefen: Optimized Stochastic Optimizer
Summary
The paper introduces Gefen, a memory-efficient optimizer designed to reduce the memory footprint associated with the widely used AdamW optimizer in deep learning. AdamW typically requires additional memory for first and second moment states, which can be substantial, especially for large models. Gefen addresses this by sharing second-moment estimates across parameter blocks and quantizing the first moment using a learned codebook, achieving an approximate 8× reduction in memory usage while maintaining performance levels comparable to AdamW. The authors provide a theoretical foundation for their approach, demonstrating that parameters with high Hessian affinity can effectively share second-moment statistics. Gefen infers the block structure from initial squared gradients, eliminating the need for architecture-specific metadata or hyperparameters. The empirical results show that Gefen achieves the lowest peak memory usage among AdamW-like optimizers while improving throughput in distributed training scenarios. The implementation is made publicly available, facilitating further research and application.
Methodology
Gefen employs a novel approach that combines automatic parameter grouping based on Hessian affinity with a learned quantization codebook for first-moment statistics. It infers block structures from initial squared gradients, avoiding manual configurations and architecture-specific requirements.
Results
Gefen demonstrated the lowest peak memory usage compared to other AdamW-like optimizers across various models and datasets, while maintaining or improving performance metrics. The reduced memory requirements allowed for larger batch sizes and enhanced training throughput, particularly in distributed training environments.
Implications
The development of Gefen has significant implications for training large-scale deep learning models, as it allows for reduced memory usage and increased throughput. This can lead to more efficient training processes and the ability to work with larger datasets or models, ultimately enhancing the capabilities of deep learning applications.
Machine Learning for Biomedical Raman Spectroscopy: From Spectral Acquisition to Clinical Translation
Multimodal
- Machine learning is essential for extracting diagnostically relevant information from complex Raman spectra.
- The paper discusses various preprocessing techniques and machine learning methods for unsupervised and supervised learning in Raman spectroscopy.
- Challenges such as dataset size, inter-instrument variability, and reproducibility must be addressed for effective clinical translation.
- Future developments should focus on standardization, explainability, and the integration of multimodal data.
Read more
Machine Learning for Biomedical Raman Spectroscopy: From Spectral Acquisition to Clinical Translation
Summary
This review paper discusses the integration of machine learning (ML) techniques in the workflow of biomedical Raman spectroscopy, which is a powerful tool for non-invasive and chemically specific analysis of biological samples. The authors highlight the challenges posed by high-dimensional, noisy Raman spectra, which are affected by factors such as fluorescence background and biological variability. The paper covers various aspects of the Raman spectroscopy pipeline, including preprocessing, signal correction, unsupervised learning for structure discovery, and supervised learning for diagnosis and molecular stratification. It emphasizes the importance of explainability and the integration of Raman data with other modalities like imaging and pathology. The authors also address the practical challenges hindering clinical translation, such as limited dataset sizes and variability across instruments. They advocate for a coordinated approach towards standardization, robust validation, and the development of deployable analytical frameworks to enhance the clinical utility of Raman spectroscopy powered by machine learning.
Methodology
The paper reviews existing literature and methodologies related to machine learning applications in Raman spectroscopy, including preprocessing techniques, clustering methods, and supervised learning approaches for diagnostic classification. It also discusses the integration of multimodal data and the importance of explainability in machine learning models.
Results
The review synthesizes findings from various studies, demonstrating the effectiveness of machine learning in improving the analysis of Raman spectra for biomedical applications. It identifies key challenges that need to be overcome for successful clinical implementation, such as data sharing and standardization.
Implications
The findings suggest that with improved machine learning methodologies and standardized practices, Raman spectroscopy can become a more reliable tool for clinical diagnostics and decision support, potentially transforming patient care in areas like cancer diagnosis and treatment monitoring.
Two-Layer Linear Auto-Regressive Models Estimate Latent States
Theory
Time Series
Optimization
- Two-layer linear auto-regressive models can approximate Kalman filtering.
- The optimization landscape of the model is benign despite non-convexity.
- Finite-sample guarantees are provided for prediction and parameter estimation errors.
- Numerical simulations confirm the recovery of latent state estimates.
Read more
Two-Layer Linear Auto-Regressive Models Estimate Latent States
Summary
This paper explores the capabilities of two-layer linear auto-regressive models in estimating latent states from partially observed linear dynamical systems. The authors demonstrate that these models, when trained using empirical risk minimization, can effectively approximate Kalman filtering. The study reveals that the hidden representations learned by the model align closely with the state estimates produced by the optimal Kalman filter, despite the model lacking explicit knowledge of the underlying dynamics. The authors provide three main insights: first, they establish that the Kalman filter can be approximated by an auto-regressive model with bounded truncation error; second, they show that the optimization landscape for the two-layer model is benign, with all stationary points being either strict saddles or global minima; and third, they offer finite-sample guarantees on prediction error, parameter estimation error, and latent state recovery. Numerical simulations validate these theoretical findings, illustrating that the latent representations of the auto-regressive models successfully recover state estimates.
Methodology
The authors formulated a two-layer auto-regressive model for estimating latent states and analyzed its optimization landscape. They established theoretical results regarding the model's ability to approximate the Kalman filter and provided finite-sample guarantees on various error metrics. Numerical simulations were conducted to support the theoretical claims.
Results
The study found that the two-layer linear auto-regressive models learned representations that closely matched the state estimates from the Kalman filter. The optimization landscape was shown to be favorable for convergence, and the finite-sample guarantees indicated reliable performance in terms of prediction and parameter estimation.
Implications
This work has significant implications for the development of auto-regressive models in various applications, particularly in fields requiring accurate state estimation from sequential data, such as robotics, control systems, and time series analysis. It also suggests a new direction for integrating classical control theory with modern deep learning techniques.
Emerging Flexible Designs for Geospatial Multimodal Foundation Models
Computer Vision
Multimodal
- Standardized benchmarking allows for fair comparison of geospatial foundation models.
- Insights into how tokenization and fusion strategies affect model robustness and spectral reasoning.
- Flexibility versus homogeneity trade-offs highlight the importance of aligning architecture with data diversity.
- The study emphasizes the need for adaptable models that can handle missing or novel modalities.
Read more
Emerging Flexible Designs for Geospatial Multimodal Foundation Models
Summary
This paper investigates the architectural diversity of foundation models (FMs) in the context of geospatial multimodal reasoning, particularly focusing on their flexibility across various spectral band configurations. The authors conduct a systematic comparison of three leading FM architectures—DOFA, SatMAE, and Flex—by standardizing pretraining conditions, including self-supervised learning objectives and datasets. The evaluation is performed using the GEOBench benchmark for classification and segmentation tasks. The study reveals significant insights into the trade-offs between model flexibility, modality alignment, and performance on downstream tasks. The findings indicate that while Flex's modular design enhances adaptability to heterogeneous data, it may underperform in spectrally homogeneous scenarios. This research provides practical guidance for developing next-generation geospatial foundation models that can effectively reason across multiple modalities.
Methodology
The authors standardized the pretraining of three FM architectures (DOFA, SatMAE, Flex) using a shared Sentinel-2 dataset and identical self-supervised learning strategies. They employed consistent model configurations and evaluated performance on the GEOBench benchmark for classification and segmentation tasks. The evaluation involved linear probing for classification and a shared decoder for segmentation, ensuring uniform adaptation across models.
Results
The study found that the architectural choices significantly impacted model performance, with Flex demonstrating improved adaptability to missing or heterogeneous bands compared to a standard Vision Transformer (ViT). However, Flex showed limitations in spectrally homogeneous settings, indicating a trade-off between flexibility and generalization. The standardized benchmarking revealed clear insights into the strengths and weaknesses of each architecture under controlled conditions.
Implications
The findings suggest that future geospatial foundation models should prioritize flexibility and adaptability to diverse data modalities. This research can inform the design of models for various applications in Earth observation, such as climate monitoring, natural hazard assessment, and agricultural yield prediction, ultimately enhancing human understanding of complex environmental phenomena.
M*: A Modular, Extensible, Serving System for Multimodal Models
Multimodal
Efficient ML
Audio & Speech
- M* is designed to serve diverse multimodal models efficiently.
- It utilizes a Walk Graph abstraction to represent model architectures as dataflow graphs.
- M* achieves lower latency and higher throughput compared to existing serving systems.
- The system supports flexible component placement and model-agnostic optimizations.
Read more
M*: A Modular, Extensible, Serving System for Multimodal Models
Summary
The paper introduces M*, a universal serving system designed for efficient deployment of composite multimodal models that integrate various components such as vision encoders, language backbones, and audio codecs. Current serving frameworks struggle with the architectural diversity of modern multimodal models, which often require different execution paths for different tasks. M* addresses this by representing models as dataflow graphs, allowing for flexible composition of model components and optimized execution across physical clusters. The authors present the Walk Graph abstraction, which captures the complexity of multimodal models and enables model-agnostic optimizations. M* is instantiated on several representative models, demonstrating significant improvements in latency and throughput compared to existing systems. The findings suggest that M* can effectively serve complex multimodal models with minimal developer effort, paving the way for more efficient AI applications.
Methodology
The authors developed M* by creating a flexible intermediate abstraction that decouples model architecture from system runtime. They implemented the Walk Graph to capture diverse model architectures and facilitate efficient execution. The system was tested on multiple representative models, measuring performance metrics such as latency and throughput.
Results
M* achieved, on average, 20% lower end-to-end latency than vLLM-Omni for text-to-image tasks and up to 2.9× lower real-time factor for text-to-speech workloads. It also outperformed the V-JEPA 2-AC rollout baseline for robotic planning by up to 12.5×, demonstrating substantial efficiency gains.
Implications
The development of M* could significantly enhance the deployment of complex AI models across various applications, including real-time speech interaction, image and video processing, and robotic planning. Its modular design may also reduce the development burden for AI practitioners.
Scalable anomaly detection via a univariate Christoffel function
Theory
Efficient ML
Interpretability
- Introduction of a univariate Christoffel function (UCF) for scalable anomaly detection.
- UCF addresses the computational limitations of traditional Christoffel function methods in high dimensions.
- Extensive benchmarking shows UCF outperforms 14 state-of-the-art anomaly detection methods.
- The method retains key theoretical properties such as support shape capture and on-off support behavior.
Read more
Scalable anomaly detection via a univariate Christoffel function
Summary
This paper addresses the challenges of anomaly detection in high-dimensional datasets using a novel approach based on a univariate Christoffel function (UCF). Traditional Christoffel function methods, while mathematically robust, struggle with scalability due to the need for matrix inversion that grows exponentially with data dimensions. The authors propose UCF, which simplifies the evaluation process by focusing on the squared distance between a target point and support points, allowing for effective anomaly detection without the computational burden of high-dimensional matrix operations. Extensive experiments on the ADBench benchmark demonstrate that UCF outperforms 14 state-of-the-art anomaly detection methods in terms of Average Precision, thus providing a scalable and theoretically grounded alternative to existing techniques. The work not only enhances the practical applicability of Christoffel function-based methods but also contributes to the broader field of anomaly detection by offering a robust solution that maintains the desirable properties of previous approaches.
Methodology
The authors developed a univariate Christoffel function that evaluates the squared distance between a target point and support points, simplifying the anomaly detection process. This method avoids the need for high-dimensional matrix inversion, making it computationally efficient and scalable for larger datasets.
Results
UCF consistently outperformed 14 state-of-the-art anomaly detection methods on the ADBench benchmark, demonstrating superior Average Precision. This indicates that UCF is not only effective but also scalable, addressing the limitations of previous Christoffel function-based approaches.
Implications
The proposed UCF method has significant implications for various domains requiring anomaly detection, such as fraud detection, network security, and industrial fault diagnosis. Its scalability and theoretical grounding make it a valuable addition to the toolkit of data scientists and practitioners in these fields.
ProPlay: Procedural World Models for Self-Evolving LLM Agents
Reinforcement Learning
Large Language Models
Robotics
- ProPlay integrates procedural world models with self-evolving agents to improve learning in partially observable environments.
- The framework uses a procedure graph to represent environment knowledge and causal transitions among tasks.
- Reliability records for transitions help agents estimate the effectiveness of past experiences.
- ProPlay allows agents to simulate future trajectories as soft guidance, balancing exploitation of knowledge with exploration.
Read more
ProPlay: Procedural World Models for Self-Evolving LLM Agents
Summary
The paper introduces ProPlay, a novel procedural world model designed for self-evolving agents that learn through interaction in partially observable environments. Traditional methods often rely on memory or planning modules but fail to integrate these components effectively to refine an agent's understanding of environmental dynamics. ProPlay addresses this gap by enabling agents to rehearse future procedural paths using a learned world knowledge framework. It abstracts successful trajectories into procedures organized in a procedure graph that captures causal transitions among task stages. Each transition is associated with a reliability record embedding, which estimates its contribution to task success based on past outcomes. Before each episode, ProPlay simulates procedural trajectories over the graph as structured soft guidance, allowing agents to exploit accumulated knowledge while remaining adaptable. After execution, the graph is refined based on environmental feedback. Experimental evaluations on public benchmarks demonstrate that ProPlay enhances agents' understanding of their environments and their self-evolution capabilities, outperforming strong baseline methods.
Methodology
ProPlay employs a procedural world model represented as a procedure graph, where nodes are procedures and edges encode transitions. It performs procedure-level preplay to simulate task-specific trajectories and refines the graph based on feedback from executed episodes.
Results
ProPlay was evaluated on three public interactive benchmarks: ScienceWorld, τ-Bench, and PlanCraft. The results indicate that the procedural abstraction significantly enhances agents' ability to learn and adapt to environment dynamics, particularly in tasks with complex, reusable multi-step causal structures.
Implications
The findings suggest that integrating procedural world models can lead to more effective self-evolving agents capable of adapting to new environments and tasks without extensive retraining. This approach could have applications in robotics, autonomous systems, and interactive AI.
A Comparative Study of Deep Learning Architectures for Multi-Horizon Behavioural Forecasting for Mobile Health
Time Series
- No single architecture dominates; PatchTST leads among trained models.
- TimesFM zero-shot model matches or exceeds trained models in low-data conditions.
- Participant-level fine-tuning significantly improves forecasting accuracy.
- The study provides insights into architecture selection and personalization strategies.
Read more
A Comparative Study of Deep Learning Architectures for Multi-Horizon Behavioural Forecasting for Mobile Health
Summary
This paper presents a comprehensive evaluation of various deep learning architectures for multi-horizon behavioral forecasting using data from wearable devices and smartphones. The authors benchmark six deep learning models, two zero-shot Foundation Models (FMs), and statistical baselines across three public datasets involving over 800 participants. The study focuses on predicting health-related features such as step counts, screen time, and sleep duration over 1-8 day horizons. Key findings indicate that no single architecture consistently outperforms others; however, PatchTST excels among trained models, while TimesFM, a zero-shot FM, performs comparably or better in low-data scenarios. The research also highlights the significant impact of participant-level fine-tuning, which can reduce root mean square error (RMSE) by 16-60%, with sleep metrics benefiting the most. This study is notable for being the first to jointly assess modern deep learning architectures, FMs, and personalization strategies for behavioral forecasting in mobile health contexts.
Methodology
The authors benchmarked six deep learning architectures, two zero-shot FMs, and statistical baselines on three public datasets. They reported per-feature metrics for step counts, screen time, and sleep duration across multiple forecasting horizons (1-8 days). A personalization study was conducted to assess the impact of fine-tuning on forecasting accuracy.
Results
The results showed that PatchTST was the top-performing model among those trained, while TimesFM performed well in zero-shot scenarios, particularly in low-data environments. Fine-tuning at the participant level led to a significant reduction in RMSE, with sleep metrics showing the greatest improvement. Overall, the findings suggest that architecture performance varies by feature and that personalization is crucial for enhancing forecasting accuracy.
Implications
The findings provide practical guidance for selecting forecasting architectures and implementing personalization strategies in mobile health applications. The results can inform the development of proactive health interventions based on wearable data, enhancing the effectiveness of health monitoring and support systems.
PostDeg: Placement Beats Parameterization in LayerNorm GNNs
Graph Learning
- LayerNorm in GNNs can erase important topology signals needed for effective node selection.
- The placement of positive scalars in relation to LayerNorm significantly affects the preservation of topology information.
- PostDeg, a parameter-free method, enhances performance in various graph-related tasks by restoring degree contrast.
- Empirical tests confirm that placement, rather than parameterization, is critical for achieving performance gains.
Read more
PostDeg: Placement Beats Parameterization in LayerNorm GNNs
Summary
This paper addresses the issue of how LayerNorm-based Graph Neural Networks (GNNs) erase topology signals that are crucial for node-selection policies. The authors identify that the erasure occurs due to the placement of positive per-node scalars in relation to LayerNorm. They propose a new method called PostDeg, which is a parameter-free post-LayerNorm inverse-degree scale that effectively restores the degree-conditioned magnitude contrast necessary for node ranking in tasks like influence maximization, network dismantling, and maximum independent set. The paper empirically validates the placement rule by demonstrating that the gains from PostDeg are not due to parameterization but rather the strategic placement of the topology scalar after LayerNorm. The authors also pre-register four falsifiers to test their hypothesis, all of which fail to reject the placement rule, reinforcing the significance of placement over parameterization in GNNs.
Methodology
The authors developed a new operator, PostDeg, which is applied after LayerNorm to scale node representations based on their inverse degree. They conducted empirical tests across multiple graph tasks and compared the performance of PostDeg against various control conditions to validate their placement rule. The methodology included the use of learned variants to separate the effects of placement from parameterization.
Results
PostDeg achieved performance improvements of +3.5% in influence maximization, +2.5% in network dismantling, and +5.6% in maximum independent set tasks, with consistent paired-seed wins across all tasks. The empirical absorption envelope for the placement rule was found to be tightly bounded, and none of the four falsifiers were able to reject the placement hypothesis.
Implications
The findings suggest that the design of GNN architectures should prioritize the placement of topology-sensitive scalars to enhance performance in graph-based tasks. This could lead to more effective GNN models that better utilize structural information in graphs.
To GAN or Not To GAN: Segmentation Analysis on Mars DEM
Computer Vision
Generative Models
- Automatic detection of Martian mounds is crucial for understanding the planet's surface and potential for life.
- Neural Network-based Semantic Segmentation methodologies were employed for mound detection.
- The study compared supervised segmentation models with GAN-based approaches.
- Data augmentation using GAN-generated images did not improve segmentation performance.
Read more
To GAN or Not To GAN: Segmentation Analysis on Mars DEM
Summary
This paper addresses the challenge of automatically detecting mounds on the Martian surface using Neural Network-based Semantic Segmentation methodologies. The authors emphasize the importance of identifying these morphologies for understanding the Martian environment, which could provide insights into the potential for extraterrestrial life, particularly in relation to water presence. The study compares traditional supervised semantic segmentation models with a generative adversarial approach (GAN) to enhance mound detection accuracy. However, the findings indicate that augmenting the training dataset with artificially generated data did not yield significant improvements in segmentation results. The research highlights the complexities of image segmentation in planetary science and the need for tailored approaches depending on the specific characteristics of the data and the segmentation task at hand.
Methodology
The authors utilized supervised semantic segmentation models alongside a generative adversarial network (GAN) approach to automatically detect mounds in Digital Elevation Models (DEMs) of Mars. The study involved comparing the performance of these methodologies to assess the effectiveness of data augmentation through GAN-generated images.
Results
The results revealed that the addition of artificially generated data from GANs did not enhance the accuracy of mound detection compared to traditional supervised segmentation methods. This finding suggests that the effectiveness of data augmentation in image segmentation tasks may vary based on the nature of the data and the specific segmentation objectives.
Implications
The implications of this research extend to planetary exploration, particularly in improving the capabilities of rovers to navigate and analyze Martian terrain. Understanding the limitations of data augmentation in segmentation tasks can inform future studies and methodologies in both planetary science and broader computer vision applications.
Representing Time Series as Structured Programs for LLM Reasoning
Large Language Models
Time Series
- Introduction of T2SP, a structured representation for time series that aligns with LLM capabilities.
- T2SP is deterministic, training-free, and compatible with off-the-shelf LLMs.
- Significant improvements in reasoning performance and reduced inference time compared to traditional methods.
- T2SP allows for effective time-series editing, captioning, and question answering without fine-tuning.
Read more
Representing Time Series as Structured Programs for LLM Reasoning
Summary
This paper addresses the challenge of effectively representing time series data for large language models (LLMs) to enhance their reasoning capabilities. Traditional methods often serialize raw numerical sequences or fine-tune LLMs on time-series data, which can lead to performance degradation due to modality mismatch. The authors propose a novel representation method called Time-Series-to-Structured-Program (T2SP), which transforms time series into a structured symbolic program format. This approach decomposes time series into trends, periods, and salient events, allowing LLMs to leverage their existing reasoning capabilities without the need for extensive training or fine-tuning. The evaluation of T2SP across three reasoning tasks—editing, captioning, and question answering—demonstrates significant improvements in performance, reduced reasoning time, and lower failure rates compared to raw-string representations. The results indicate that T2SP effectively bridges the gap between time series and LLMs, enabling better understanding and analysis of temporal data.
Methodology
The authors developed the Time-Series-to-Structured-Program (T2SP) representation, which decomposes raw time series data into structured components (trends, periods, events) and expresses them in a program-friendly syntax. This method does not require training and is designed to be easily interpretable by LLMs, facilitating reasoning tasks directly.
Results
The evaluation of T2SP showed consistent improvements across three reasoning tasks—editing, captioning, and question answering—compared to raw-string representations. The method reduced reasoning time and lowered failure rates, demonstrating its effectiveness in enabling LLMs to analyze time series data more efficiently.
Implications
The T2SP representation has the potential to enhance various applications in time-series analysis, allowing LLMs to perform complex reasoning tasks without the need for extensive retraining. This could lead to advancements in fields such as finance, healthcare, and environmental monitoring, where time-series data is prevalent.
Forecasting Is Not Attribution: Localizing Decoder Bypass in Graph-Based Neural Marketing Mix Models
Graph Learning
Time Series
Theory
- Identification of 'attribution bypass' in graph-based neural marketing mix models.
- Introduction of DICE-MMM as a diagnostic framework for separating graph recovery, forecasting, and decoder influence.
- Demonstration that low forecasting error does not equate to effective attribution.
- Empirical evidence showing the need for improved graph-support selection methods.
Read more
Forecasting Is Not Attribution: Localizing Decoder Bypass in Graph-Based Neural Marketing Mix Models
Summary
This paper addresses a critical issue in marketing mix models (MMM) that arises from the conflation of forecasting and attribution. The authors identify a failure mode termed 'attribution bypass,' where a high-capacity decoder achieves low forecasting error without effectively utilizing the underlying graph for attribution. To tackle this, they introduce DICE-MMM, a two-stage diagnostic framework designed to separate the tasks of graph recovery, forecasting accuracy, and decoder-induced influence alignment with the graph. The first stage involves training a graph encoder with a restricted decoder, while the second stage focuses on training a graph-safe latent decoder. The authors employ counterfactual influence graphs (CIG), autoregressive rollout influence graphs (AR-CIG), and frozen-decoder graph-swap tests to evaluate the decoder's performance. The results indicate that while DICE-MMM improves graph recovery compared to existing models, it does not guarantee accurate attribution. The findings highlight the importance of distinguishing between forecasting performance and the faithful use of graph structures in attribution, ultimately localizing the challenge to the selection of deployable graph-support interfaces.
Methodology
The authors propose a two-stage framework, DICE-MMM, which first trains a graph encoder with a restricted graph-mediated decoder to ensure that the graph discovery process is not dominated by a high-capacity response model. In the second stage, they freeze the encoder and train a graph-safe latent decoder, ensuring that the decoder's influence is aligned with the provided graph. They utilize CIG and AR-CIG diagnostics along with frozen-decoder graph-swap tests to assess the decoder's performance and the effectiveness of the learned graph.
Results
The empirical results show that while DICE-MMM improves stable graph recovery compared to CausalMMM, low mean squared error (MSE) does not certify effective attribution. In tests, decoders without graph input performed similarly to those with full graph input in terms of forecasting accuracy, but their alignment with the attribution graph (AR-CIG) was near chance. The use of an oracle graph significantly improved AR-CIG scores, indicating that the learned graph interfaces are currently insufficient for reliable attribution.
Implications
The findings suggest that practitioners using graph-based neural MMMs should be cautious of assuming that low forecasting error implies effective attribution. The DICE-MMM framework provides a diagnostic tool for identifying when a model fails to utilize the graph appropriately, guiding future research towards developing better graph-support selection methods for attribution tasks.
Speculative Rollback Correction for Quality-Diverse Web Agent Imitation
Reinforcement Learning
Robotics
Optimization
- Introduction of Speculative Rollback Correction (SRC) for interactive web agent training.
- SRC allows agents to learn from their own exploratory actions while still receiving expert feedback.
- The framework effectively mitigates compounding errors and state drift in long-horizon tasks.
- Extensive evaluations show consistent performance gains over baseline methods.
Read more
Speculative Rollback Correction for Quality-Diverse Web Agent Imitation
Summary
This paper addresses the challenges of training interactive web agents through imitation learning, particularly focusing on the timing of expert intervention during the learning process. The authors introduce a novel framework called Speculative Rollback Correction (SRC), which aims to improve the robustness of web agents by allowing them to learn from their own exploratory actions while still benefiting from expert guidance. SRC operates on a branch-level basis, where the agent executes a short speculative segment before a teacher review. This approach helps in identifying the first harmful deviation from the expert trajectory, allowing for a rollback that preserves useful prefixes of the agent's actions. The framework also incorporates a quality-diversity archive to retain high-quality trajectories, facilitating next-action supervised fine-tuning. The authors demonstrate that SRC effectively mitigates compounding errors and state drift, achieving significant performance improvements on long-horizon tasks in both web and desktop environments. The results indicate that SRC not only enhances training stability but also supports the learning of diverse solution paths, making it a promising approach for developing autonomous interactive agents.
Methodology
The SRC framework employs a branch-level training mechanism where the agent first executes a short speculative segment. After this segment, a teacher reviews the actions to identify harmful deviations. If a deviation is found, the framework rolls back to preserve the useful prefix of actions while correcting the harmful suffix. Additionally, a hard verifier assesses the overall success of the trajectory, and a lightweight quality-diversity archive retains high-quality successful trajectories for training.
Results
On the WebArena-Infinity benchmark, SRC collected 977 verifier-passing trajectories and 9,183 next-action examples. The fixed-horizon review approach demonstrated improved recovery versus query tradeoff compared to traditional step-level review, while also retaining diverse solution variants that passed the verifier.
Implications
The SRC framework has the potential to enhance the training of interactive agents in various applications, including web automation, GUI interaction, and other long-horizon decision-making tasks. Its ability to balance exploration and expert guidance may lead to more robust and adaptable agents capable of handling complex environments.
Realizing Native INT8 Compute for Diffusion Transformers on Consumer GPUs: A Fused INT8 GEMM Kernel for Ideogram 4.0
Generative Models
Efficient ML
- Identified inefficiencies in existing INT8 quantization for diffusion transformers on consumer GPUs.
- Developed a fused Triton INT8 GEMM kernel that effectively utilizes INT8 tensor cores.
- Achieved significant speed improvements, making INT8 the fastest variant for diffusion transformers.
- Demonstrated end-to-end performance gains without quality loss on consumer GPUs.
Read more
Realizing Native INT8 Compute for Diffusion Transformers on Consumer GPUs: A Fused INT8 GEMM Kernel for Ideogram 4.0
Summary
This paper addresses the inefficiencies in post-training INT8 quantization of diffusion transformers on consumer GPUs, particularly the NVIDIA RTX 3090. The authors identify that the current implementation of INT8 quantization is effectively a 'fake-quant' process, where weights and activations are quantized to INT8 but immediately dequantized back to bf16 for computation, thus failing to utilize the GPU's native INT8 tensor cores. To resolve this, they propose a new fused Triton INT8 GEMM kernel that performs the necessary computations directly on the INT8 tensor cores, integrating dequantization and bias addition within the GEMM operation. This approach leads to significant speed improvements, achieving 2.8–4.2 times faster performance than bf16 matrix multiplication. The end-to-end performance shows a speedup of approximately 9-10% at 768px resolution and a generation time of 156.5 seconds per image at 1024px on a single RTX 3090, outperforming both FP8 and NF4 alternatives without compromising quality. The findings highlight that this optimization is specific to consumer Ampere GPUs, as the same kernel does not yield benefits on A100 or B200 architectures, where native bf16 and FP8 paths are faster.
Methodology
The authors developed a fused Triton INT8 GEMM kernel that performs int8×int8 to int32 accumulation directly on Ampere's tensor cores. This kernel integrates per-token and per-channel dequantization and bias addition into the GEMM epilogue, allowing for a single kernel launch instead of separate dequantization and matrix multiplication steps. The kernel was autotuned for different GEMM shapes to optimize performance.
Results
The new kernel resulted in a speedup of 2.8–4.2 times compared to bf16 matrix multiplication and improved end-to-end generation times to 156.5 seconds per image at 1024px resolution, outperforming FP8 and NF4 alternatives. The quality metrics remained consistent with previous implementations, indicating no measurable quality loss.
Implications
This work has significant implications for optimizing diffusion transformers on consumer-grade GPUs, making high-resolution image generation more feasible and efficient. It also provides insights into the importance of utilizing hardware capabilities effectively, which could influence future developments in low-precision computing for machine learning.
Navigating the Safety-Fidelity Trade-off: Massive-Variate Time Series Forecasting for Power Systems via Probabilistic Scenarios
Time Series
- Introduction of PowerPhase, a benchmark for probabilistic forecasting in power systems with up to 36,964 channels.
- Incorporation of voltage-safety evaluations into forecasting metrics, addressing the physical constraints of power systems.
- Identification of a safety-fidelity trade-off where models are ranked differently based on distributional accuracy and constraint satisfaction.
- Development of PowerForge, a scenario-based quantile forecaster tailored for high-dimensional power system forecasting.
Read more
Navigating the Safety-Fidelity Trade-off: Massive-Variate Time Series Forecasting for Power Systems via Probabilistic Scenarios
Summary
This paper addresses the challenges of probabilistic forecasting in power systems, particularly focusing on the need for models that respect physical constraints while maintaining distributional accuracy. The authors introduce PowerPhase, a novel probabilistic forecasting benchmark that encompasses six transmission grids with up to 36,964 channels, significantly surpassing existing benchmarks that typically cap at 2,000 channels. PowerPhase incorporates voltage-safety evaluations and provides a suite of constraint-aware metrics to assess model performance beyond traditional scoring rules. The study reveals a trade-off between distributional accuracy and constraint satisfaction, termed safety-fidelity. To tackle the modeling challenges, the authors propose PowerForge, a scenario-based quantile forecaster designed to efficiently handle the high dimensionality of power systems. PowerForge employs architectural innovations such as anchor-based residual representations and type-aware decoding heads, achieving superior performance across various grid sizes and baseline models.
Methodology
The authors developed PowerPhase as a benchmarking framework for probabilistic forecasting in power systems, generating target trajectories through AC power-flow solutions. They introduced a suite of constraint-aware metrics to evaluate models on both distributional accuracy and safety. PowerForge, the proposed forecasting model, utilizes a scenario-based approach with architectural features that account for the unique characteristics of power system data, including causal relationships among variables.
Results
The evaluation of PowerForge against eight baseline models across five grid sizes and three random seeds showed that it consistently achieved the best average rank. The results highlighted the importance of considering both distributional accuracy and constraint satisfaction, revealing that models could perform differently when assessed on these criteria.
Implications
The findings suggest that probabilistic forecasting in power systems can be significantly improved by integrating physical constraints into model evaluation. The introduction of PowerPhase and PowerForge could enhance decision-making processes in grid operations, particularly as the integration of renewable energy sources increases the complexity and uncertainty in power systems.
A Zero-shot Generalized Graph Anomaly Detection Framework via Node Reconstruction
Graph Learning
- AlignGAD is a zero-shot GAD framework that generalizes to unseen graphs without fine-tuning.
- The framework reduces dependence on domain-specific semantics by aligning graph features and spectral distributions.
- It introduces a cluster-aware discrepancy scoring strategy that captures both individual node deviations and group-level abnormal patterns.
- Extensive experiments validate the effectiveness of AlignGAD across various real-world datasets.
Read more
A Zero-shot Generalized Graph Anomaly Detection Framework via Node Reconstruction
Summary
The paper presents AlignGAD, a novel zero-shot generalized graph anomaly detection (GAD) framework designed to identify abnormal nodes in unseen target graphs. Traditional GAD methods often rely on dataset-specific features and structures, limiting their generalizability across different domains. AlignGAD addresses this limitation through three main components: a Global Unification Module that aligns heterogeneous node features and normalizes graph signals in the spectral domain; a Clustering Module that creates cluster-aware graph views to capture group-level abnormal patterns; and a Node Discrepancy Scoring Module that measures reconstruction discrepancies and aggregates anomaly evidence from various graph views. The framework is evaluated on multiple real-world datasets, demonstrating its effectiveness in the zero-shot GAD setting, thereby providing a practical solution for detecting anomalies across heterogeneous graphs without the need for fine-tuning.
Methodology
AlignGAD employs a three-component architecture: (1) a Global Unification Module for aligning node features and normalizing graph signals, (2) a Clustering Module for constructing cluster-aware graph views, and (3) a Node Discrepancy Scoring Module that reconstructs node features and measures discrepancies to generate anomaly scores. This approach allows for the integration of both node-level and cluster-level information in anomaly detection.
Results
The experiments conducted on multiple real-world datasets show that AlignGAD effectively identifies anomalies in unseen graphs, outperforming existing methods in the zero-shot GAD setting. The results indicate that the framework can generalize well across different domains, providing robust anomaly detection capabilities.
Implications
AlignGAD has significant implications for real-world applications in areas such as cybersecurity, social network analysis, and intelligent surveillance, where detecting anomalies in heterogeneous graph data is crucial. Its ability to generalize across domains without requiring extensive retraining makes it a valuable tool for practitioners in these fields.
Flood and Harvest: The Provable Necessity of Trivia for Generating Valuable Mathematics via the Lens of Language Generation in the Limit
Theory
- The gap between formal verification and mathematical value is a significant constraint in AI-generated mathematics.
- Sound coverage is achievable only with a verifier, which can assert valid statements while covering unseen valuable ones.
- A phase transition occurs in trivia generation, where finite trivia allows optimal coverage, while infinite trivia increases coverage significantly.
- The necessity of generating trivial statements is proven to be essential for accessing valuable mathematical insights.
Read more
Flood and Harvest: The Provable Necessity of Trivia for Generating Valuable Mathematics via the Lens of Language Generation in the Limit
Summary
This paper addresses the challenge of generating valuable mathematics using AI systems linked to proof assistants. The authors propose a model of nested language generation in the limit, where a formal language (F) is verified by a membership oracle (proof checker) and contains an unknown valuable language (H). The study identifies a core (C) of literature that influences the generation process. The authors explore four critical questions: the relationship between verification and taste, the concept of sound coverage, the phase transition in trivia generation, and the necessity of generating trivial statements to uncover valuable mathematics. They conclude that while a perfect verifier cannot replace the need for discernment in mathematical discovery, a continuous stream of trivial outputs is essential for accessing unrecorded valuable mathematics. The 'flood' refers to this trivial output stream, while the 'harvest' signifies the valuable insights gained from it.
Methodology
The authors develop a theoretical framework for nested language generation in the limit, utilizing a membership oracle to model the relationship between formal verification and the generation of valuable mathematical statements. They analyze the implications of trivia generation through mathematical proofs and case studies.
Results
The paper establishes that verification does not equate to taste, sound coverage is contingent on the verifier, and a clear dichotomy exists in trivia generation that affects the coverage of valuable mathematics. The findings indicate that an infinite stream of trivial statements is necessary to uncover valuable insights, highlighting the importance of trivia in mathematical discovery.
Implications
The findings suggest that AI systems in mathematical discovery must balance between generating valid statements and ensuring the value of those statements. This has implications for the design of future AI systems that aim to assist in mathematical research and discovery, emphasizing the need for integrating discernment mechanisms alongside verification.
Provably Safe, Yet Scalable Reinforcement Learning
Reinforcement Learning
Robotics
Theory
- Introduction of the PS2-RL framework for safe reinforcement learning.
- Utilization of a safe-arrival value function to train a backup policy that generates implicit control-invariant sets.
- Implementation of a control-invariant layer for end-to-end training of safe RL policies.
- The framework achieves formal safety guarantees without excessive conservatism.
Read more
Provably Safe, Yet Scalable Reinforcement Learning
Summary
The paper introduces the PS2-RL framework, a novel approach to safe reinforcement learning (RL) that aims to optimize rewards while ensuring safety constraints are met. Traditional methods often lack formal safety guarantees or become overly conservative due to the reliance on explicit certificate functions for control-invariant sets. PS2-RL addresses these limitations through a two-phase architecture. In the first phase, a backup policy is trained using a safe-arrival value function, which helps in generating an implicit control-invariant set online. The second phase involves training an RL policy that incorporates safety guarantees from the learned backup policy through a differentiable projection layer. This method allows for scalability to high-dimensional systems without the need for explicit invariant set synthesis. The authors provide theoretical guarantees for the framework and demonstrate its effectiveness on robotic control tasks, achieving 100% safety during both training and deployment while outperforming existing methods.
Methodology
The PS2-RL framework consists of two phases: Phase I involves training a backup policy using a novel safe-arrival value function that characterizes optimal behavior for reaching a target set safely. Phase II trains the final RL policy through a differentiable projection layer that enforces safety constraints derived from the backup policy. This approach allows for the implicit generation of control-invariant sets without the need for explicit synthesis.
Results
The PS2-RL framework was evaluated on robotic control tasks, including unicycle lane-keeping and powerloop tracking for a 10-dimensional quadrotor. The results showed that PS2-RL achieved 100% safety during both training and deployment, significantly outperforming all baseline methods in terms of performance.
Implications
The PS2-RL framework has the potential to enhance the deployment of reinforcement learning in real-world applications, particularly in robotics, where safety is critical. Its scalability and formal safety guarantees could lead to broader acceptance of RL technologies in safety-sensitive environments.
When Does Routing Become Interpretable? Causal Probes on Block Attention Residuals
Interpretability
- Introduces a routing-ablation framework for analyzing Block AttnRes models.
- Demonstrates that explicit depth routing does not guarantee mechanistic interpretability.
- Identifies three distinct causal motifs in the Block AttnRes model.
- Finds a dissociation between routing mass and causal importance in the model.
Read more
When Does Routing Become Interpretable? Causal Probes on Block Attention Residuals
Summary
This paper investigates the interpretability of routing mechanisms in Block Attention Residuals (Block AttnRes), a model that enhances the traditional additive residuals in Transformers by introducing a learned softmax over earlier depth-source representations. The study aims to determine whether the explicit exposure of routing weights during the forward pass is sufficient for mechanistic interpretation. Two Block AttnRes checkpoints are analyzed: a vanilla Qwen3 model wrapped with a deterministic recency-bias schedule and a Block AttnRes Qwen3 trained from scratch. The findings reveal that while the baseline model's routing weights are content-independent and align with the recency-bias schedule, the trained Block AttnRes model exhibits distinct localized routing motifs that reflect different pathways of information flow. Notably, the study uncovers a dissociation between average routing mass and causal importance, indicating that high routing mass does not necessarily correlate with significant causal contributions. The authors conclude that while architectural exposure of routing is necessary, it is not sufficient for mechanistic interpretation, and any descriptive routing summaries should be treated as hypotheses requiring further causal testing.
Methodology
The study employs a routing-ablation framework to mask mutually exclusive source families and renormalize the remaining routing weights. It compares two models under identical routing-ablation interventions to assess the interpretability of routing weights. The analysis includes a vanilla Qwen3 model wrapped with a deterministic recency-bias schedule and a Block AttnRes model trained from scratch.
Results
The results indicate that the baseline model's routing weights reproduce the analytic predictions of the recency-bias schedule, while the Block AttnRes model reveals three localized routing motifs. Additionally, the study finds that the largest routing mass does not correspond to the largest causal contribution, highlighting a significant dissociation between routing mass and causal importance.
Implications
The findings suggest that while the architectural design of routing in Block AttnRes models allows for greater interpretability, it does not inherently provide mechanistic insights. This has implications for the development of interpretable AI systems, emphasizing the need for causal testing of routing mechanisms rather than relying solely on architectural exposure.
DIFF-ERO: A Conformance-Aware Loss for Deep Learning in Process Mining
Theory
Optimization
Time Series
- DIFF-ERO is a differentiable loss function that incorporates control-flow conformance into deep learning training.
- The loss function improves predictive performance in process mining tasks, particularly in maintaining structural fidelity.
- The methodology allows for batch-level supervision of conformance, enhancing the training signal during backpropagation.
- The empirical results indicate that models trained with DIFF-ERO converge towards the structural ground truth of process models.
Read more
DIFF-ERO: A Conformance-Aware Loss for Deep Learning in Process Mining
Summary
The paper introduces DIFF-ERO, a novel conformance-aware loss function designed for deep learning applications in process mining. Traditional loss functions like cross-entropy optimize local next-step predictions but fail to capture the global control-flow structure of processes, leading to models that may perform well on individual predictions yet exhibit imprecise overall behavior. DIFF-ERO addresses this limitation by integrating control-flow information directly into the training process through a differentiable formulation of entropy-based stochastic conformance. This approach constructs batch-level stochastic transition matrices with soft edge memberships, allowing for structural precision and recall signals to inform backpropagation. The authors demonstrate DIFF-ERO's effectiveness by implementing it in transformer encoder-decoder architectures for next-activity prediction, showing that it enhances predictive performance in scenarios where structural fidelity is crucial while maintaining comparable performance in other contexts. The paper also provides a theoretical analysis of the loss function's convergence properties and showcases its integration with cross-entropy to balance local accuracy with global structural fidelity.
Methodology
The authors developed DIFF-ERO as a differentiable loss function that utilizes entropy-based stochastic conformance metrics. It constructs stochastic transition matrices from mini-batches of training data, allowing for the integration of control-flow information into the training process. The loss function is implemented within transformer encoder-decoder architectures for next-activity prediction, and is analyzed theoretically for convergence properties.
Results
The empirical evaluation demonstrated that DIFF-ERO outperforms traditional loss functions in scenarios where structural accuracy is critical, while still achieving comparable performance in other cases. The models trained with DIFF-ERO showed improved convergence towards the structural ground truth of process models, indicating effective internalization of control-flow structures.
Implications
DIFF-ERO has significant implications for enhancing the accuracy and reliability of predictive models in process mining, particularly in applications where understanding and adhering to process structures is essential. This approach can be beneficial in various domains such as business process management, workflow optimization, and predictive analytics.
CARE: Controlling LLM-Generated Policies through Auditable Review of Evidence in Scientific Experimentation
Large Language Models
Optimization
- CARE provides a structured framework for integrating LLMs into scientific experimentation while ensuring safety through auditing.
- The Public-Evidence Intervention Gate allows for the evaluation of LLM proposals against empirical evidence before execution.
- CARE outperforms traditional optimization methods on benchmark datasets, demonstrating the potential of LLMs in HTE.
- The framework emphasizes the importance of maintaining a non-LLM optimizer as the default decision-maker.
Read more
CARE: Controlling LLM-Generated Policies through Auditable Review of Evidence in Scientific Experimentation
Summary
The paper introduces CARE, a novel framework designed to enhance the safety and reliability of large language models (LLMs) in high-throughput experimentation (HTE) by implementing an auditable control mechanism. Directly allowing LLMs to dictate experimental decisions can lead to unsafe outcomes, thus CARE maintains a non-LLM optimizer as the primary decision-maker while permitting LLMs to propose alternative actions. The framework employs a Public-Evidence Intervention Gate that evaluates the proposed actions against existing evidence before authorizing any changes, ensuring that only well-supported decisions are executed. This hybrid approach combines the innovative capabilities of LLMs with the reliability of traditional optimization methods. The authors validate CARE through extensive experiments on the Minerva and ChemLex benchmarks, demonstrating significant improvements in performance metrics compared to existing methods. The results indicate that LLMs can effectively contribute to HTE when their proposals are subjected to rigorous auditing, thus enhancing the overall optimization process without compromising safety.
Methodology
The methodology involves modeling high-throughput experimentation as a finite-pool sequential interaction problem. CARE separates the proposal generation by LLMs from the decision-making authority, where the incumbent optimizer acts as the default choice. The Public-Evidence Intervention Gate assesses LLM-generated proposals against public evidence before authorizing any changes, ensuring a systematic audit trail for all decisions made.
Results
CARE achieved the highest mean final-best values and best-so-far AUC on both Minerva and ChemLex datasets, with notable improvements from 80.0 to 88.5 on Minerva and from 83.9 to 92.1 on ChemLex compared to the public incumbent. The results indicate that the hybrid approach of CARE significantly enhances the reliability and effectiveness of LLMs in HTE.
Implications
The implications of this research extend to various scientific fields where high-throughput experimentation is critical. By ensuring that LLMs can contribute to decision-making processes without compromising safety, CARE could facilitate more efficient and innovative experimental designs, potentially accelerating discoveries in chemistry and other domains reliant on experimental data.
Is Spurious Correlation Removal Always Learnable?
Theory
- Invariant learning can fail even with identifiable invariant structures due to computational barriers.
- A separation parameter γ quantifies environment diversity and its impact on identifiability and sample complexity.
- Polynomial-time recovery algorithms may not achieve optimal rates under certain conditions, highlighting a gap between computational and statistical learnability.
- Experiments validate theoretical predictions regarding sample size and performance gaps in spurious correlation removal.
Read more
Is Spurious Correlation Removal Always Learnable?
Summary
This paper investigates the learnability of spurious correlation removal in machine learning, particularly in the context of invariant learning. The authors establish a computational barrier indicating that even when the invariant structure is statistically identifiable, it may not be learnable within polynomial time constraints. They introduce a black-box samplable supervised sparse recovery primitive to construct multi-environment instances where the predictive invariant direction can be identified but requires exhaustive search for recovery. The study emphasizes the importance of environment diversity, quantified by a separation parameter γ, which influences identifiability and sample complexity. The authors derive theoretical results regarding the minimax risk and sample size transitions based on the diversity of environments. Through experiments on synthetic and real datasets, they illustrate the predicted gaps in performance and provide diagnostics for assessing diversity in environments. The findings suggest that a few diverse environments can be more informative than many similar ones, challenging existing assumptions about the quantity of data needed for effective learning.
Methodology
The authors model spurious correlation removal as invariant subspace recovery from multi-environment data, focusing on a linear-Gaussian setting. They construct instances using a black-box samplable supervised sparse recovery primitive and analyze the implications of environment diversity through theoretical derivations and empirical experiments.
Results
The study establishes that while certain multi-environment instances allow for polynomial sample recovery via exhaustive search, any polynomial-time constant-accuracy recovery algorithm would contradict the black-box primitive. The results indicate that the required sample size decreases with increased environment diversity, and a critical scaling for sample size transitions is identified as proportional to 1/γ². Experiments confirm the theoretical predictions regarding compute-sample trade-offs and the importance of diversity in environments.
Implications
The findings have significant implications for the design of machine learning models that aim to generalize across different environments. Understanding the role of environment diversity can inform strategies for data collection and model training, potentially leading to more robust models that are less susceptible to spurious correlations.
Learning Urban Access Costs from Origin-Destination Flows via Inverse Optimal Transport
Optimization
Interpretability
- Introduces two inverse optimal transport models for estimating urban access costs.
- Demonstrates the application of the framework on large-scale school choice data in the Philippines.
- Estimates a subsidy-equivalent distance metric to inform subsidy calibration and facility placement.
- Highlights the spatial footprint of subsidies and their varying effectiveness based on geographic factors.
Read more
Learning Urban Access Costs from Origin-Destination Flows via Inverse Optimal Transport
Summary
This paper addresses the challenge of understanding urban access costs in mixed public-private service networks, specifically focusing on school choice in the Philippines. The study utilizes inverse optimal transport (OT) to recover latent cost functions that influence household decisions regarding school enrollment. By analyzing school-to-school enrollment flows, the author introduces two complementary inverse OT models: a distance-banded piecewise model with explicit subsidy terms and a neural cost model trained using a differentiable Sinkhorn forward pass. The framework is applied to a substantial dataset comprising 283,016 learner trips across 23,820 observed flows. The results yield a subsidy-equivalent distance metric, which quantifies the perceived travel cost offset by subsidies, providing insights into how subsidies can effectively redirect learners from congested public schools to private institutions. The findings highlight the importance of spatial considerations in subsidy design and urban service allocation, demonstrating that the impact of subsidies varies based on their geographic context and the distance sensitivity of households.
Methodology
The study employs inverse optimal transport to estimate cost functions from observed origin-destination flows. It utilizes two models: a piecewise distance-banded model with subsidy terms and a neural cost model optimized through a differentiable Sinkhorn operator. The analysis is based on a large dataset of school enrollment flows, linking observed data to road-network distances and subsidy amounts.
Results
The piecewise model achieved a reduction in the cost function error from 5.00 to 3.29, while the neural model further improved the fit, achieving an error of 2.84. The estimated subsidy-equivalent distances indicate that a 1,000-peso increase in subsidy offsets approximately 5.35 km, 6.07 km, and 1.00 km of perceived travel cost across different distance bands. The neural model revealed a nuanced interaction between distance and subsidy, suggesting that the effects of subsidies are more complex than previously understood.
Implications
The findings suggest that urban planners should consider the spatial distribution of subsidized facilities and the distance sensitivity of households when designing subsidy programs. The framework can be applied to various urban services beyond education, including healthcare and transportation, to enhance accessibility and service allocation.
Towards More General Control of Diffusion Models Using Jeffrey Guidance
Generative Models
Computer Vision
Theory
- Introduction of Jeffrey guidance for diffusion models, extending control capabilities beyond standard methods.
- Demonstrated significant improvements in FID scores when matching output distributions to target embeddings.
- Successfully applied Jeffrey guidance to achieve fairness in image generation by decorrelating attributes.
- Provides a principled framework for updating distributions with minimal perturbation to the original model.
Read more
Towards More General Control of Diffusion Models Using Jeffrey Guidance
Summary
This paper introduces Jeffrey guidance, a novel framework for enhancing the control of diffusion models during sampling. Traditional guidance methods often rely on heuristic approaches that do not explicitly define target distributions. Jeffrey guidance leverages Jeffrey's rule of conditioning to update marginal distributions towards a specified target while minimally altering the joint distribution. The authors demonstrate this approach by targeting a prescribed embedding distribution, achieving significant reductions in Fréchet Inception Distance (FID) on CIFAR-10 and FFHQ datasets. Additionally, they apply Jeffrey guidance to enforce fairness in the CelebA-HQ dataset, successfully decorrelating attributes such as gender and age. This work not only generalizes existing classifier guidance techniques but also opens new avenues for controlling diffusion models in more complex scenarios.
Methodology
The authors utilize Jeffrey's rule of conditioning to formulate a guidance method that updates the marginal distributions of diffusion models towards a target distribution. This approach is implemented during the sampling process, allowing for real-time adjustments without the need for retraining the model. The methodology includes targeting specific embedding distributions and applying fairness constraints to the generated outputs.
Results
The application of Jeffrey guidance resulted in substantial reductions in FID scores on CIFAR-10 and FFHQ datasets, indicating improved image quality. In the fairness application on CelebA-HQ, the method successfully decorrelated attributes, demonstrating its effectiveness in achieving gender parity and reducing bias in generated images.
Implications
Jeffrey guidance has the potential to enhance the flexibility and control of diffusion models in various applications, including image generation, text-to-image synthesis, and fairness in AI systems. This framework can lead to more interpretable and ethically aligned generative models, addressing concerns related to bias and representation in AI-generated content.
Uncertainty Estimation and Generalization Bounds for Modern Deep Learning
Theory
- Introduction of the Deep Variational Implicit Process (DVIP) for scalable Bayesian modeling.
- Development of two methods (VaLLA and FMGP) for calibrating uncertainty in deterministic networks.
- Unified probabilistic framework explaining generalization through diversity, smoothness, and stochasticity.
- Insights into double-descent behavior and the role of SGD as implicit regularization.
Read more
Uncertainty Estimation and Generalization Bounds for Modern Deep Learning
Summary
This dissertation explores the intersection of Bayesian principles and modern deep learning, focusing on uncertainty estimation and generalization bounds. It introduces the Deep Variational Implicit Process (DVIP), a scalable Bayesian framework that extends implicit processes to deep architectures, allowing for efficient variational inference and expressive non-Gaussian priors. Additionally, two post-hoc methods, Variational Linearized Laplace Approximation (VaLLA) and Fixed-Mean Gaussian Process (FMGP), are proposed to enhance uncertainty calibration in pretrained deterministic networks. The theoretical contributions address the question of why large, over-parameterized neural networks generalize well, developing a unified probabilistic framework that connects diversity, smoothness, and stochasticity. This framework formalizes how ensemble diversity reduces generalization error and interprets smoothness in the loss landscape as a factor in empirical loss concentration. The PAC-Chernoff bounds derived provide insights into double-descent behavior and analyze stochastic gradient descent (SGD) as a form of implicit regularization. Overall, the work presents practical tools for uncertainty estimation and theoretical insights into the probabilistic structure of deep learning, advocating for a Bayesian perspective in designing learning systems.
Methodology
The dissertation employs a combination of Bayesian inference, function-space modeling, and large-deviation theory. It introduces DVIP for scalable Bayesian modeling, along with VaLLA and FMGP for uncertainty calibration. The theoretical framework connects ensemble diversity, smoothness, and stochasticity to generalization performance.
Results
The DVIP framework achieves competitive performance with deep Gaussian processes at lower computational costs. The VaLLA and FMGP methods provide well-calibrated uncertainty estimates on large-scale tasks. The theoretical framework offers a coherent explanation for generalization in neural networks, including insights into double-descent behavior and the effects of stochastic optimization.
Implications
The findings suggest that integrating Bayesian principles into deep learning can enhance model reliability and interpretability, particularly in applications requiring uncertainty quantification. This work could influence future research directions in Bayesian deep learning and its applications across various domains.
The Weight Norm Sets the Grokking Timescale: A Causal Delay Law
Theory
Interpretability
- Weight norm causally controls the timescale of grokking in neural networks.
- A matched-counterfactual clamp shows that grokking can occur at any norm, with an exponential delay law governing the timescale.
- The study establishes a scaling law with a shared exponent across different tasks, indicating the weight norm's dominant role in grokking timescale.
- Normalization techniques like LayerNorm affect the relationship between weight norm and function, altering the delay law.
Read more
The Weight Norm Sets the Grokking Timescale: A Causal Delay Law
Summary
This paper investigates the relationship between weight norm and the timescale of grokking in neural networks, aiming to reconcile conflicting theories regarding the role of weight norm in generalization. The authors demonstrate that the weight norm causally influences the grokking timescale, rather than acting as a strict threshold. Through experiments on modular arithmetic tasks, they find that networks first memorize by increasing weight norm and then generalize as the norm stabilizes at a concentrated value, denoted as ‖𝑊‖𝑐. A matched-counterfactual clamp experiment shows that grokking occurs at any norm held during training, with the time to grok following an exponential delay law. This indicates that while the norm does not prevent grokking, holding it above ‖𝑊‖𝑐 delays the process. The authors establish a scaling law with a shared exponent across different tasks, suggesting that the weight norm is a dominant factor in grokking timescale, significantly more influential than learning rate. They also explore the effects of normalization techniques like LayerNorm and find that the delay law is consistent across different architectures, indicating a broader applicability of their findings. The paper concludes with a discussion on the implications of these results for understanding neural network generalization and the interplay between weight norms and circuit formation.
Methodology
The authors conducted experiments on neural networks trained on modular arithmetic tasks, measuring weight norms during training and employing a matched-counterfactual clamp to hold the weight norm at specific values throughout the training process. They analyzed the relationship between weight norm and grokking timescale using statistical methods to establish causal relationships and scaling laws.
Results
The results indicate that grokking occurs when the weight norm reaches a concentrated value, with the time to grok following an exponential delay law. The shared exponent across different tasks suggests a scaling law, and the findings reveal that the weight norm is a more significant factor in grokking timescale than learning rate. The effects of normalization techniques were also observed, showing variations in the delay law across architectures.
Implications
These findings have implications for the design and training of neural networks, particularly in understanding how weight norms influence generalization and the dynamics of learning. The results could inform strategies for optimizing training processes and improving model performance on various tasks.
CLARITree: Cholesky and Lookahead Accelerations for Regression with Interpretable Piecewise Linear Trees
Efficient ML
Interpretability
Optimization
- CLARITree offers a near-optimal algorithm for sparse piecewise linear regression trees.
- The method integrates lookahead-style split optimization with efficient Cholesky updates.
- Empirical results show significant improvements in accuracy and scalability over greedy baselines.
- The algorithm is designed to handle continuous features effectively.
Read more
CLARITree: Cholesky and Lookahead Accelerations for Regression with Interpretable Piecewise Linear Trees
Summary
The paper introduces CLARITree, a novel algorithm designed for constructing interpretable piecewise linear regression trees. Traditional methods for building regression trees often rely on greedy induction, which can lead to suboptimal performance. While optimal methods exist, they are computationally expensive and not scalable for general linear regression trees. CLARITree combines a lookahead search strategy with efficient rank-one Cholesky updates of the Gram matrix to achieve a favorable balance between computational efficiency, predictive accuracy, and sparsity. The authors demonstrate both theoretically and empirically that CLARITree outperforms greedy approaches and scales better than existing optimal methods, making it suitable for small to large-scale datasets. The algorithm maintains numerical stability and efficiency during split evaluations, allowing for fast and exact assessments of candidate splits without the need for repeated refitting.
Methodology
The authors developed CLARITree by combining a lookahead search strategy with rank-one Cholesky updates of the Gram matrix. This approach allows for efficient evaluation of candidate splits while maintaining numerical stability. The algorithm is designed to optimize splits globally up to a certain depth before applying greedy induction, thus enhancing both performance and scalability.
Results
CLARITree consistently outperformed greedy regression trees, achieving near-optimal accuracy on small and medium datasets while demonstrating significant scalability on larger datasets. The empirical results indicated a mean squared error (MSE) of 4.03 and an R-squared value of 0.97 for CLARITree, compared to 15.41 MSE and 0.88 R-squared for greedy trees.
Implications
The development of CLARITree has potential implications for various applications requiring interpretable machine learning models, particularly in fields where understanding model decisions is crucial. Its efficiency and scalability make it suitable for real-world datasets where traditional methods may falter.
Physics-Informed Neural Networks for Chemotherapy Pharmacokinetics: Benchmarking the Clinical Estimator and Exposing Parameter Identifiability
Theory
Optimization
Time Series
- PINNs can effectively model chemotherapy pharmacokinetics, providing insights into unobservable tissue drug concentrations.
- The PINN approach matches the performance of traditional NLS estimators while also revealing parameter identifiability issues.
- In cases where traditional methods fail, PINNs can still converge to meaningful solutions, demonstrating their robustness.
- Sparse observations can significantly enhance the identifiability of parameters in complex pharmacokinetic models.
Read more
Physics-Informed Neural Networks for Chemotherapy Pharmacokinetics: Benchmarking the Clinical Estimator and Exposing Parameter Identifiability
Summary
This paper explores the application of Physics-Informed Neural Networks (PINNs) in the context of chemotherapy pharmacokinetics (PK), where drug concentrations in plasma are measurable but tissue concentrations, crucial for understanding tumor efficacy and toxicity, are not. The authors benchmark a PINN against traditional nonlinear least-squares (NLS) estimators and a data-only multilayer perceptron (MLP) on two PK problems. In a linear two-compartment model, the PINN performs comparably to NLS while also estimating tissue concentration in a single training pass, outperforming the MLP significantly. In a more complex Michaelis-Menten model, the NLS fails due to mis-specification, while the PINN reveals the non-identifiability of certain parameters when only plasma data is available. By incorporating sparse tissue observations, the PINN demonstrates improved parameter recovery, highlighting its ability to expose structural identifiability issues that traditional methods may overlook. The authors argue that PINNs provide a unified framework that integrates known dynamics with heterogeneous measurements, offering insights into parameter identifiability in pharmacokinetic modeling.
Methodology
The authors employed Physics-Informed Neural Networks (PINNs) to model chemotherapy pharmacokinetics, comparing their performance against a standard nonlinear least-squares estimator and a data-only multilayer perceptron. They conducted experiments on both a linear two-compartment model and a Michaelis-Menten extension, analyzing the ability of each method to recover tissue concentration and identify parameters from plasma data.
Results
In the linear two-compartment model, the PINN achieved results comparable to the NLS estimator while successfully estimating tissue concentration. In the Michaelis-Menten model, the NLS produced meaningless results due to mis-specification, while the PINN identified the non-identifiability of certain parameters, converging to a solution that indicated k12 approaching zero. With additional sparse tissue observations, the PINN improved parameter recovery significantly, demonstrating its effectiveness in complex scenarios.
Implications
The findings suggest that PINNs can serve as a powerful tool for modeling complex biological systems, particularly in pharmacokinetics where traditional methods may struggle. This approach could lead to better understanding and optimization of chemotherapy dosing strategies, ultimately improving patient outcomes. Additionally, the ability of PINNs to expose structural identifiability issues may guide future research in parameter estimation and model development in various biological contexts.
Federated Learning for Feature Generalization with Convex Constraints
Federated Learning
- FedCONST introduces convex constraints to enhance feature generalization in Federated Learning.
- The method adaptively adjusts update magnitudes based on global model parameter strengths.
- Empirical results show significant improvements in generalization and robustness compared to existing FL methods.
- The approach maintains high computational and communication efficiency.
Read more
Federated Learning for Feature Generalization with Convex Constraints
Summary
The paper addresses the challenges of generalization in Federated Learning (FL) due to heterogeneous client data, which often leads to local models overfitting their specific data distributions. The authors propose a novel approach called FedCONST, which adaptively modulates update magnitudes based on the strength of global model parameters. This method employs linear convex constraints to stabilize training and enhance feature transferability during aggregation. By ensuring that well-learned features remain close during local updates while emphasizing under-learned features, FedCONST effectively aligns local and global objectives, reducing overfitting and promoting better generalization across diverse FL environments. The authors validate their approach through theoretical foundations and empirical analyses, demonstrating that FedCONST outperforms existing FL methods across various datasets and model architectures, achieving state-of-the-art performance.
Methodology
FedCONST applies client-consistent convex constraints derived from the global model's weight magnitudes to stabilize training and enhance feature generalization. The method focuses on retaining well-learned features while emphasizing under-learned ones, using a Gradient Signal to Noise Ratio (GSNR) analysis to validate its effectiveness.
Results
The experiments demonstrate that FedCONST significantly outperforms existing Federated Learning methods across various models and datasets, achieving state-of-the-art performance while maintaining efficiency in computation and communication.
Implications
The findings suggest that FedCONST can be applied to improve generalization in Federated Learning scenarios, particularly in environments with heterogeneous data distributions. This could have significant implications for applications in healthcare, finance, and other fields where data privacy and local data distribution are critical.
Boltzmann Attention: Learnable Ising Couplings for Cooperative Attention
NLP
Large Language Models
Theory
- Introduces Boltzmann Attention, which incorporates learnable Ising couplings for enhanced attention modeling.
- Addresses limitations of standard attention mechanisms by allowing for inter-position correlations.
- Demonstrates significant performance improvements in language modeling tasks compared to traditional softmax attention.
- Establishes a connection between attention mechanisms and statistical physics, particularly through the Ising model.
Read more
Boltzmann Attention: Learnable Ising Couplings for Cooperative Attention
Summary
This paper introduces Boltzmann Attention, an innovative attention mechanism that enhances the standard attention model by incorporating learnable Ising couplings to capture inter-position correlations. Traditional attention mechanisms primarily rely on individual query-key similarities, which limits their ability to model cooperative or antagonistic relationships between attention decisions. The proposed Boltzmann Attention formulates attention as an interacting spin system, where each position is represented as a binary spin that can either attend or ignore. By introducing pairwise couplings, the model allows for richer representations of dependencies, particularly in tasks where relational relevance is crucial. The authors demonstrate the effectiveness of Boltzmann Attention through experiments on character-level language modeling and synthetic bracket matching, showing consistent improvements over standard softmax attention, especially as sequence lengths increase. The findings suggest that explicitly modeling inter-position interactions can significantly enhance attention-based sequence modeling. Additionally, the Ising model framework opens avenues for quantum-computing-based sampling strategies, with the authors showcasing that diabatic quantum annealing can be utilized for practical training while maintaining competitive performance.
Methodology
The authors formulate attention as an interacting spin system using an Ising model, where each key position is assigned a binary spin. They introduce learnable pairwise couplings between these spins to capture inter-position dependencies, and derive attention weights from the marginal spin magnetizations under the Boltzmann distribution. The model is evaluated through experiments on character-level language modeling and synthetic bracket matching.
Results
Boltzmann Attention consistently outperformed standard softmax attention in various tasks, with improvements becoming more pronounced as the sequence length increased. An ablation study confirmed that the performance gains were primarily due to the learnable pairwise couplings introduced in the model.
Implications
The findings suggest that incorporating learnable inter-position interactions can significantly enhance the capabilities of attention mechanisms in sequence modeling tasks. Furthermore, the connection to quantum computing opens new avenues for efficient training and scaling of attention models.
Emotional regulation improves deep learning-based image classification
Computer Vision
- Introduction of Emotional Regulation as a framework for modeling emotion in deep learning.
- Demonstrated improvements in image classification tasks using emotion-augmented models.
- Emotional pre-training enhances performance over traditional non-emotional models.
- Evidence of the effectiveness of emotion-inspired architectures in deep learning.
Read more
Emotional regulation improves deep learning-based image classification
Summary
This paper explores the impact of emotional regulation on deep learning, particularly in image classification tasks. The authors introduce a novel framework called Emotional Regulation, which incorporates artificial subjective experiences to enhance neural network performance. Unlike existing methods that focus solely on objective neurophysiological factors, this approach balances non-emotional and emotionally-influenced responses during task optimization. The study employs pre-training on various emotional datasets using ResNet and Vision Transformer (ViT) architectures, benchmarking against CIFAR-10 and CIFAR-100 datasets. Results demonstrate that models utilizing Emotional Regulation significantly outperform traditional models, establishing it as a new state-of-the-art method in emotion-augmented deep learning. The findings suggest that incorporating emotional states can improve optimization in machine learning tasks, paving the way for further research into emotion-inspired architectures.
Methodology
The study utilized a framework called Emotional Regulation, which involved pre-training deep learning architectures (ResNet and ViT) on various emotional datasets. The methodology focused on balancing non-emotional and emotionally-influenced predictions during the optimization of image classification tasks.
Results
The results indicated that models trained with the Emotional Regulation framework achieved superior performance compared to traditional models on CIFAR-10 and CIFAR-100 benchmarks. This improvement supports the hypothesis that emotional states can enhance deep learning task optimization.
Implications
The findings suggest that integrating emotional regulation into deep learning frameworks can lead to better generalization and performance in image classification tasks. This approach encourages further exploration of emotion-inspired architectures in artificial intelligence, potentially influencing various applications in computer vision and beyond.
Zeta: Dual Whitening for Matrix Optimization via Coordinate-Adaptive Preconditioning
Optimization
- Zeta introduces a dual whitening approach to optimize matrix operations in neural networks.
- The method corrects scale heterogeneity in momentum matrices, which is prevalent in deep learning models.
- Theoretical proofs support the effectiveness of coordinate whitening followed by spectral whitening.
- Empirical results demonstrate Zeta's superior performance compared to existing optimizers like AdamW and Muon.
Read more
Zeta: Dual Whitening for Matrix Optimization via Coordinate-Adaptive Preconditioning
Summary
The paper introduces Zeta, a dual whitening optimizer designed to enhance matrix optimization in large-scale neural network training. It addresses a critical vulnerability in existing matrix-aware optimizers, such as Muon, which rely on Newton–Schulz iteration that is sensitive to input conditioning. The authors demonstrate that raw momentum matrices exhibit significant coordinate-wise scale heterogeneity, which can hinder optimization performance. To correct this, Zeta employs a two-step whitening process: coordinate whitening normalizes the momentum matrix entries by their running second moments, followed by spectral whitening that equalizes the singular values of the matrix. This ordered approach is mathematically justified, as coordinate whitening establishes the necessary statistical isotropy for effective spectral whitening. Theoretical analysis confirms that this dual pipeline reduces orthogonalization error and improves the condition number of the input matrix. Empirical evaluations across various tasks, including language modeling and vision tasks, show that Zeta consistently outperforms strong baselines in terms of convergence speed and generalization performance, highlighting the importance of addressing scale imbalance in matrix optimization.
Methodology
Zeta employs a dual whitening process consisting of coordinate whitening and spectral whitening. Coordinate whitening normalizes the momentum matrix entries to reduce scale disparity, while spectral whitening applies Newton–Schulz orthogonalization to equalize singular values. The ordering of these operations is mathematically justified, ensuring effective optimization.
Results
Zeta was evaluated on language models ranging from 0.6B to 8B parameters, mixture-of-experts architectures, and vision tasks. The results indicate that Zeta matches or surpasses the performance of established optimizers like AdamW and Muon in terms of convergence speed and downstream task performance.
Implications
The findings suggest that addressing scale imbalance in matrix optimization can significantly enhance the training efficiency and performance of large-scale neural networks. This has potential applications in various domains, including NLP and computer vision, where large models are prevalent.
FedSPC: Shared Parameter Correction for Personalized Federated Learning
Federated Learning
- FedSPC corrects shared-parameter updates in PFL to mitigate the impact of client-specific objectives.
- The method is modular and applicable across various PFL settings, enhancing flexibility.
- Experimental results show significant performance improvements across multiple PFL methods and datasets.
Read more
FedSPC: Shared Parameter Correction for Personalized Federated Learning
Summary
The paper introduces Federated Shared Parameter Correction (FedSPC), a novel approach designed to enhance personalized federated learning (PFL) by addressing the optimization challenges associated with shared parameters. In PFL, models are typically divided into shared and personalized parameters, with the shared parameters being updated based on client-specific local objectives. This can lead to inconsistent updates that weaken the shared representation. FedSPC applies a control-variate correction specifically to the shared parameters, leaving personalized parameters unchanged. This modular correction method can be integrated into various PFL frameworks, including those with shared feature extractors, shared classifiers, and fully shared models with local regularization. The authors conducted experiments on CIFAR-100 and Tiny-ImageNet datasets using different architectures (ViT, ResNet-34, and VGG-11) and demonstrated that FedSPC significantly improves the performance of several representative PFL methods, thereby providing a robust solution to the shared-parameter optimization problem in PFL settings.
Methodology
FedSPC employs a control-variate correction technique tailored for PFL, focusing solely on the shared parameters of the model while keeping personalized parameters intact. The method identifies the shared parameter block and applies corrections to its updates based on client-specific data distributions.
Results
The experiments conducted on CIFAR-100 and Tiny-ImageNet demonstrated that FedSPC outperforms existing PFL methods, including FedPer, FedRep, FedBABU, LG-FedAvg, and Ditto, indicating its effectiveness in stabilizing shared parameter updates and improving overall model performance.
Implications
FedSPC has the potential to enhance the effectiveness of personalized federated learning in various applications, particularly in scenarios where client data is heterogeneous. This could lead to better model performance in real-world applications such as healthcare, finance, and mobile device learning.
DTVEM-RE: A Hierarchical Random-Effects Extension of the Differential Time-Varying Effect Model for Person-Specific Multi-Lag Estimation in Intensive Longitudinal Data
Time Series
- DTVEM-RE allows for person-specific multi-lag coefficient estimation, addressing limitations of the original DTVEM model.
- The model demonstrates strong parameter recovery and credible interval coverage in simulation studies.
- Empirical results indicate substantial variability in autoregressive effects across individuals, highlighting the importance of idiographic approaches.
- DTVEM-RE outperforms traditional methods in predictive accuracy for intensive longitudinal data.
Read more
DTVEM-RE: A Hierarchical Random-Effects Extension of the Differential Time-Varying Effect Model for Person-Specific Multi-Lag Estimation in Intensive Longitudinal Data
Summary
This paper introduces DTVEM-RE, an extension of the Differential Time-Varying Effect Model (DTVEM), which addresses the limitation of assuming a single group-level lag structure in intensive longitudinal data analysis. The original DTVEM model, while effective in identifying optimal lag structures, does not account for individual differences in dynamics, which is crucial in psychopathology research. DTVEM-RE incorporates hierarchical random effects to estimate person-specific multi-lag coefficients, utilizing Hamiltonian Monte Carlo in Stan for estimation. The paper presents three main contributions: a simulation study confirming accurate recovery of the between-person variance parameter, an empirical demonstration using ecological momentary assessment data showing significant variability in autoregressive effects across individuals, and a multi-lag extension revealing robust heterogeneity across multiple lags. The findings suggest that DTVEM-RE provides a principled approach to idiographic analysis in clinical psychology, enhancing predictive accuracy and understanding of individual symptom dynamics.
Methodology
DTVEM-RE employs a hierarchical Bayesian framework with random effects to estimate individual lag profiles. The exploratory stage utilizes factor-smooth generalized additive mixed models (GAMMs) to derive individual lag curves, while the confirmatory stage applies a state-space vector autoregression (VAR) model estimated via Hamiltonian Monte Carlo in Stan.
Results
The simulation study confirmed accurate recovery of the between-person variance parameter (τa) with minimal bias and credible interval coverage between 90-93%. In empirical applications, significant variability in lag-1 autoregressive effects was observed across three affect items, with high correlation between hierarchical Bayesian and independent GAMM estimates. DTVEM-RE achieved the best predictive log-likelihood and root-mean-square error compared to four baseline methods. The multi-lag analysis indicated that all nine τk estimates across items and lags were statistically significant, with varying degrees of heterogeneity.
Implications
DTVEM-RE enhances the understanding of individual psychological dynamics in clinical settings, allowing for more personalized treatment approaches. Its ability to capture person-specific lag structures can lead to improved predictive models in psychopathology and related fields.
Selecting Samples on Graphs: A Unified Dataset Pruning Framework for Lossless Training Acceleration
Graph Learning
Efficient ML
Optimization
- Introduces a unified graph-based framework for dataset pruning that combines intrinsic and extrinsic sample evaluations.
- Frames the dataset pruning problem as a Maximum Weight Clique Problem (MWCP) and provides a principled greedy solution.
- Proves formal approximation guarantees for a broad family of importance metrics under mild conditions.
- Demonstrates significant training time reduction (over 40%) without sacrificing accuracy on standard benchmarks.
Read more
Selecting Samples on Graphs: A Unified Dataset Pruning Framework for Lossless Training Acceleration
Summary
This paper addresses the challenge of high computational costs associated with training on large datasets by proposing a unified dataset pruning (DP) framework that leverages graph-based modeling. Traditional DP methods either focus on intrinsic signals, assessing samples independently, or extrinsic signals, promoting diversity through pairwise relations. However, these methods often lack robustness across varying pruning ratios and data distributions. The authors introduce a graph-based approach where dataset samples are represented as nodes with weights reflecting intrinsic value, and edges representing extrinsic relationships. This formulation allows the pruning problem to be framed as a Maximum Weight Clique Problem (MWCP), which is NP-hard. To tackle this, the authors develop a greedy algorithm based on sample-wise marginal gains, proving that their unified objective has formal approximation guarantees under certain conditions. Extensive experiments demonstrate that their method significantly reduces training time by over 40% on ImageNet-1k with ResNet-50, while maintaining accuracy, outperforming existing DP methods.
Methodology
The authors model the dataset as a weighted graph, where node weights represent intrinsic importance and edge weights represent extrinsic relationships. They reformulate the dataset pruning problem as a Maximum Weight Clique Problem (MWCP) and develop a greedy algorithm that evaluates samples based on marginal gains, ensuring efficient and scalable pruning. The theoretical foundation includes proving approximation guarantees for various importance metrics.
Results
The proposed framework consistently outperforms existing dataset pruning methods, achieving over 40% reduction in training time on ImageNet-1k with ResNet-50, while maintaining model accuracy. The experiments validate the effectiveness of the unified approach in diverse scenarios.
Implications
This work has significant implications for improving the efficiency of training deep learning models on large datasets, potentially enabling faster model development and deployment in various applications, including computer vision and natural language processing.
Small LLMs: Pruning vs. Training from Scratch
Large Language Models
Efficient ML
- Pruning provides a strong initialization advantage over random initialization for small LLMs.
- The advantage of pruning diminishes as the pruning ratio increases and with extended training.
- When training from scratch with a full token budget, coarser pruning can be matched or surpassed.
- Pruning is recommended when the training token budget is limited, while training from scratch can be viable with sufficient resources.
Read more
Small LLMs: Pruning vs. Training from Scratch
Summary
This paper investigates the effectiveness of pruning versus training small language models (LLMs) from scratch. The authors prune the Llama-3.1-8B model at ratios of 0.5 to 0.8 using various methods that target different granularities (depth, width, and sparsity). They conduct experiments under two controlled settings: one where both models are trained with the same token budget and another where the training from scratch is given the full token budget of the pruning pipeline. The findings reveal that pruned models consistently outperform those initialized randomly when the training budget is limited, although this advantage diminishes with higher pruning ratios. When given the full token budget, pruned models still retain an edge, particularly with fine-grained pruning methods. The results suggest that pruning serves as an effective initialization strategy, especially when resources are constrained, while training from scratch can be competitive with coarser pruning when ample training tokens are available.
Methodology
The authors employed six pruning methods across different granularities (depth, width, and sparsity) to prune the Llama-3.1-8B model. They compared the performance of pruned models against those initialized randomly under two settings: equal training token budget and equal total token budget, to isolate the effects of initialization and training resources.
Results
The results demonstrated that pruned models consistently outperformed randomly initialized models under a limited training token budget. However, as the pruning ratio increased, the performance advantage diminished. When the training from scratch was allowed to use the full token budget, pruned models still showed superior performance, particularly with fine-grained pruning methods, while coarser structured pruning could be competitive.
Implications
The findings suggest that pruning can be a practical approach for developing efficient small LLMs, particularly in scenarios with limited training resources. This could have significant implications for deploying LLMs in resource-constrained environments, enhancing accessibility and efficiency in various applications.
SemPiper: Interactive Code Synthesis for Semantic Operators in Machine Learning Pipelines
Large Language Models
NLP
Optimization
- SemPipes extends ML pipelines with LLM-powered semantic data operators for enhanced flexibility and control.
- Developers can use natural language instructions to define high-level operations, which are synthesized into optimized Python code.
- The SemPiper interface allows users to visualize and interact with pipeline components, enhancing understanding and usability.
- The approach reduces the need for LLM calls during inference, streamlining the pipeline development process.
Read more
SemPiper: Interactive Code Synthesis for Semantic Operators in Machine Learning Pipelines
Summary
The paper presents SemPipes, a novel programming model designed to enhance machine learning (ML) pipelines by integrating declarative, LLM-powered semantic data operators (SemOps). Traditional ML pipeline development is often tedious and error-prone due to extensive data preparation and feature engineering requirements. SemPipes addresses these challenges by allowing developers to specify high-level natural language instructions for data-centric operations, which are then translated into optimized Python code at training time based on dataset characteristics and pipeline context. The authors introduce SemPiper, an interactive interface that visualizes the computational graphs of the pipelines, synthesized operator implementations, and optimization trajectories. Through three end-to-end scenarios, including fraud detection and multimodal data integration, the demonstration showcases how SemPipes enables controllable and optimizable integration of LLM capabilities into ML pipeline development. The approach reduces reliance on LLMs during inference, allowing for semi-automated pipeline construction while maintaining flexibility for iterative development in interactive environments.
Methodology
The authors developed SemPipes, a programming model that utilizes LLMs to generate task-specific implementations of semantic operators during pipeline training. The model separates the specification of what to compute from how to execute it, allowing for high-level natural language instructions. The synthesized code is optimized using an evolutionary search process based on downstream model performance. The SemPiper interface provides an interactive platform for users to explore and modify ML pipelines.
Results
The demonstration of SemPipes through SemPiper highlighted its effectiveness in synthesizing operator implementations and optimizing them for various ML tasks. The interactive scenarios showcased the ability to generate and refine code based on user instructions and dataset characteristics, leading to improved predictive performance in the presented applications.
Implications
SemPipes has the potential to significantly streamline the development of ML pipelines by reducing the complexity of data preparation and feature engineering. Its integration of LLM capabilities allows for more intuitive and efficient pipeline construction, which could enhance productivity for data scientists and machine learning practitioners. The approach also opens avenues for further research in interactive ML development tools and the application of semantic operators in diverse data processing scenarios.
A Stationarity-and-Coupling Criterion for Training-Free Time-Lagged Spectral Embeddings of Multivariate Time Series
Time Series
- Introduces a training-free descriptor D(τ) for multivariate time series based on time-lagged correlation matrices.
- Establishes a falsifiable applicability criterion for the descriptor, focusing on stationarity and temporal coupling.
- Validates the descriptor's effectiveness on four paradigms while demonstrating its limitations on three others.
- Proposes a two-part pre-flight test to predict the applicability of the descriptor before training.
Read more
A Stationarity-and-Coupling Criterion for Training-Free Time-Lagged Spectral Embeddings of Multivariate Time Series
Summary
This paper investigates training-free fixed-length descriptors for multivariate time series, focusing on the applicability of such descriptors rather than just their performance on benchmarks. The authors introduce D(τ), a descriptor derived from the time-lagged correlation matrix, which is designed to work under specific conditions of stationarity and temporal coupling. The central contribution is a falsifiable applicability criterion that determines when D(τ) can effectively separate classes based on their temporal coupling rather than marginal power. The authors derive conditions under which the descriptor is effective and propose a two-part pre-flight test to assess applicability before training. They validate this criterion across various paradigms, demonstrating that D(τ) performs competitively in scenarios that meet the stationarity and coupling conditions, while failing in cases that do not. The descriptor is highlighted for its compactness, training-free nature, and operational efficiency, making it suitable for applications where labeled data and computational resources are limited.
Methodology
The authors derive the applicability criterion from a stationary Gaussian VAR(1) generative model, focusing on the conditions under which D(τ) can distinguish between classes based on cross-channel temporal coupling. They conduct a two-part pre-flight test involving an augmented Dickey–Fuller stationarity check and a power-baseline saturation check to assess the descriptor's applicability.
Results
The descriptor D(τ) achieved competitive performance on four paradigms (Sleep-EDF sleep staging, BCI-IV-2a motor imagery, MIT-BIH arrhythmia, and ESC-50 environmental sound), reaching an accuracy of 88.5±4.5% on Sleep-EDF. In contrast, it failed on three paradigms that violated the applicability criterion, collapsing to chance in non-stationary cases and being outperformed by simple power baselines in power-discriminated tasks.
Implications
The findings suggest that D(τ) can serve as a robust, training-free feature extractor for multivariate time series analysis in scenarios with limited labeled data and computational resources. The applicability criterion provides a framework for selecting appropriate methods for different time series tasks, potentially guiding future research in training-free representation learning.
Neural Variability Enhances Artificial Network Robustness
Theory
- Structured noise derived from activation covariance improves ANN robustness.
- Robustness benefits most from structured noise in response to naturalistic modifications.
- Noise structure from adversarial attacks generalizes better across different attack types.
- The approach is biologically plausible, reflecting neural variability in the brain.
Read more
Neural Variability Enhances Artificial Network Robustness
Summary
This paper investigates the role of structured noise in enhancing the robustness of artificial neural networks (ANNs) against adversarial attacks and naturalistic image modifications. The authors draw parallels between the inherent variability observed in biological neural responses and the potential benefits of incorporating correlated noise into ANNs. They hypothesize that structured noise, derived from the covariance of activations under modified versus clean inputs, can improve network robustness more effectively than unstructured noise. The study employs a standard neural network architecture, specifically a modified LeNet model, and evaluates its performance on the Fashion MNIST dataset. By injecting Gaussian noise with a covariance structure that reflects the nature of the modifications, the authors demonstrate that this approach can lead to a more robust decision boundary, capable of generalizing across different types of adversarial attacks. The findings suggest that structured noise not only enhances robustness but also aligns with biological principles of neural variability, providing a novel strategy for improving the resilience of artificial networks.
Methodology
The authors utilized a standard neural network architecture based on LeNet, trained on the Fashion MNIST dataset. They introduced structured Gaussian noise into the network's activations, with the covariance derived from the differences in model activations between clean and modified inputs. The network's robustness was evaluated against various adversarial attacks and naturalistic image modifications, comparing performance with established defense methods such as Gaussian Data Augmentation and Adversarial Training.
Results
The results indicate that injecting structured noise significantly enhances the robustness of the ANN against both adversarial attacks and naturalistic modifications. The structured noise approach led to improved decision boundaries, allowing the network to maintain performance even when faced with perturbations. The study found that while structured noise is particularly effective against specific types of modifications, its benefits can generalize across different adversarial scenarios.
Implications
The findings suggest that incorporating structured noise into ANNs could be a viable strategy for developing more robust machine learning models. This approach not only enhances performance in adversarial settings but also aligns with biological insights into neural variability, potentially informing future research in both artificial intelligence and neuroscience.
Fed-FBD: Federated Functional Block Diversification for Isolation, Privacy, and Surgical Unlearning
Federated Learning
- FED-FBD provides architecturally guaranteed block-level isolation to prevent adversarial contamination.
- The framework achieves inherent privacy-by-design, reducing the risk of membership inference.
- Surgical unlearning is enabled, allowing for the removal of a client's contributions in sub-second time without retraining.
- Experimental results show that FED-FBD maintains competitive accuracy compared to FedAvg while offering enhanced security features.
Read more
Fed-FBD: Federated Functional Block Diversification for Isolation, Privacy, and Surgical Unlearning
Summary
The paper introduces FED-FBD, a novel federated learning architecture designed to address critical issues in medical data training, such as adversarial contamination, privacy concerns, and the right to be forgotten. Unlike traditional federated learning methods like FedAvg, which treat clients as black boxes, FED-FBD decomposes a ResNet backbone into six functional blocks and maintains a warehouse of color variants. Each block is independently tracked and contributor-stamped, allowing for block-level isolation, inherent privacy, and surgical unlearning capabilities. The architecture ensures that adversarial clients cannot contaminate clean model variants, and it structurally suppresses memorization of individual data points, achieving a membership-inference AUC of 0.50 before any privacy mechanisms are applied. The paper presents experimental results across multiple datasets, demonstrating that FED-FBD incurs only a modest accuracy trade-off while effectively isolating adversarial influences and enabling rapid unlearning of contributions from departed clients.
Methodology
FED-FBD employs a modular architecture that decomposes a ResNet model into six functional blocks, each independently tracked and contributor-stamped. The framework maintains a warehouse of color variants, ensuring that each block's contributions are isolated and can be audited. The methodology includes experiments on six MedMNIST-2D datasets, PathMNIST, and CIFAR-10 to evaluate performance and security features.
Results
The experiments reveal that FED-FBD incurs a modest accuracy gap of 0.3%–3.1% on adequately sized datasets compared to traditional methods. It effectively confines adversarial attacks to the poisoned client's own blocks, with a maximum AUC drift of ±0.01 on clean colors. Additionally, it demonstrates a membership-inference AUC of 0.50 ± 0.01 before privacy mechanisms are applied, indicating strong privacy guarantees.
Implications
The implications of FED-FBD are significant for the deployment of federated learning in sensitive domains such as healthcare, where data privacy and security are paramount. The ability to isolate adversarial influences and enable rapid unlearning could enhance trust in federated learning systems and facilitate compliance with data protection regulations.
The Geometry of Phase Transitions in Generative Dynamics via Projection Caustics
Generative Models
Theory
- Introduces a geometric perspective on phase transitions in generative models.
- Defines projection caustics as critical regions where multiple data support branches coexist.
- Develops the Critical Boundary Detector (CBD) for diagnosing score-direction instability.
- Demonstrates the CBD's effectiveness in various generative models for predicting mode commitment.
Read more
The Geometry of Phase Transitions in Generative Dynamics via Projection Caustics
Summary
This paper explores the geometric underpinnings of phase transitions in continuous-state generative models, such as diffusion and flow-matching models. It identifies that these models, while evolving continuously, can exhibit abrupt qualitative changes in their outputs, akin to phase transitions. The authors propose a framework where denoising is viewed as gradient descent on a free energy landscape, with sharp transitions occurring near projection caustics—regions where multiple nearest-point projections onto the data support exist. To diagnose these transitions, they introduce the Critical Boundary Detector (CBD), which identifies instability in score directions along generative trajectories. The CBD is tested across various models, including toy examples and latent text-to-image diffusion models, demonstrating its ability to localize mode commitment and predict sensitive intervention windows. The findings connect the geometry of data with the dynamics of diffusion generation, providing insights into the mechanisms behind discrete-like behaviors in continuous generative processes.
Methodology
The authors analyze the behavior of generative models through geometric interpretations, particularly focusing on projection caustics. They develop the Critical Boundary Detector (CBD) to identify regions of instability in score directions along generative trajectories. The methodology includes theoretical analysis and empirical testing on toy models and established diffusion models.
Results
The CBD successfully identifies critical regions where generative models exhibit rapid changes in output, correlating with mode commitment and semantic branching. The results indicate that the CBD can effectively predict intervention-sensitive windows, enhancing control over generative processes.
Implications
The findings suggest that understanding the geometric structure of data can improve the controllability of generative models. The CBD could be utilized in various applications, including targeted interventions in generative tasks, enhancing the robustness and interpretability of generative dynamics.
Where Black-box Drug-Target Interaction Prediction Models Look: Cross-Method Explainability
Interpretability
Graph Learning
- The study combines various XAI techniques to enhance the interpretability of DTI models.
- It highlights the role of bridge nodes and edges in linking drug and protein features.
- The results indicate that explainability can reveal important biological patterns and relationships.
- The findings can help prioritize external validation in computational drug discovery.
Read more
Where Black-box Drug-Target Interaction Prediction Models Look: Cross-Method Explainability
Summary
This paper addresses the challenge of interpretability in drug-target interaction (DTI) and drug-target affinity (DTA) prediction models, which often operate as 'black boxes'. The authors conduct an interpretability audit of the BridgeDPI architecture across three datasets: Gao, Human, and C.elegans. They employ a combination of gradient-based attribution methods (such as integrated gradients and saliency maps) alongside feature-wise occlusion ablation to mitigate single-explainer bias. The study reveals that explainability serves as a model criticism tool, uncovering modality dominance, dataset-dependent effects, and chemistry-consistent motifs. The findings suggest that while these analyses do not replace structural or experimental validation, they can generate testable hypotheses for drug discovery. This research is significant as it is among the first to apply multiple post-hoc XAI techniques to DTI models, emphasizing the importance of interpretability in understanding model predictions and guiding therapeutic design.
Methodology
The authors applied multiple post-hoc explainable AI (XAI) techniques, including integrated gradients, saliency maps, layer-wise relevance propagation, SmoothGrad, and perturbation-based methods, to the BridgeDPI model. They conducted a comprehensive analysis across three datasets, focusing on sensitivity and effects at various input levels and through graph convolution.
Results
The analysis revealed that explainability is most effective when used as a model criticism tool, uncovering modality dominance and dataset-dependent effects. The study identified key predictive features and chemistry-consistent motifs, demonstrating that different methods can converge on similar insights, thus enhancing the robustness of the findings.
Implications
The insights gained from this study can inform future research in computational drug discovery by providing a clearer understanding of how DTI models make predictions. This can lead to the development of more reliable and interpretable models, ultimately aiding in the design of novel therapeutics.
SupraBench: A Benchmark for Supramolecular Chemistry
Large Language Models
NLP
- Introduction of SUPRABENCH, the first benchmark for evaluating LLMs in supramolecular chemistry.
- Definition of four fundamental tasks and one auxiliary task for comprehensive evaluation.
- Release of SUPRAPMC, a large corpus of supramolecular chemistry articles for domain adaptation.
- Benchmarking reveals substantial headroom for improvement in LLM performance across tasks.
Read more
SupraBench: A Benchmark for Supramolecular Chemistry
Summary
The paper introduces SUPRABENCH, the first benchmark specifically designed for evaluating large language models (LLMs) in the context of supramolecular chemistry. This field, which focuses on non-covalent host-guest assemblies, faces challenges in the design and verification of host-guest systems, often requiring extensive computational resources and time. The authors collaborate with domain experts to define four fundamental tasks: binding affinity prediction, top-binder selection, solvent identification, and host-guest description, along with an auxiliary vision-based task for molecular identification. To support these tasks, they release SUPRAPMC, a curated corpus of 16 million tokens from supramolecular chemistry literature. The benchmarking results reveal that while LLMs show promise, there remains significant room for improvement across all tasks. Domain adaptation through pretraining on SUPRAPMC enhances performance but also presents trade-offs. The study highlights distinct difficulty profiles and failure modes across the tasks, providing insights into the current limitations of LLMs in this domain. Overall, SUPRABENCH and SUPRAPMC aim to facilitate future research and development in supramolecular chemistry.
Methodology
The authors collaborated with experts to define evaluation tasks and created a benchmark framework. They released a large text corpus (SUPRAPMC) for domain adaptation and conducted systematic evaluations of various LLMs, including domain-adapted models, across the defined tasks.
Results
The benchmarking results indicate that LLMs exhibit significant performance gaps across all tasks, with domain adaptation improving results but not uniformly. The analysis reveals distinct challenges and failure modes associated with each task, highlighting areas for future research.
Implications
The introduction of SUPRABENCH and SUPRAPMC has the potential to accelerate advancements in supramolecular chemistry by providing a standardized evaluation framework for LLMs, facilitating the development of more effective computational tools for host-guest system design.
Out-of-Distribution (OOD) Detectors for Open-Set RF Fingerprinting
Theory
- Introduces a unified mathematical framework for OOD detection in RF fingerprinting based on information theory.
- Demonstrates OOD detector tuning without the need for auxiliary OOD data, addressing a major practical challenge.
- Achieves comparable performance to traditional methods using OOD data, while outperforming baseline approaches without OOD tuning.
- Establishes a baseline for future research in open-set RF fingerprinting.
Read more
Out-of-Distribution (OOD) Detectors for Open-Set RF Fingerprinting
Summary
This paper addresses the challenges of applying out-of-distribution (OOD) detection methods to open-set radio-frequency (RF) fingerprinting, which is critical for identifying wireless emitters in dynamic environments. Traditional RF fingerprinting systems struggle with unknown transmitters and temporal drift, leading to distribution shifts at test time. The authors propose a unified mathematical framework based on information theory to systematically analyze and adapt existing OOD detection methods for RF fingerprinting. A significant contribution is the introduction of tuning techniques for OOD detectors that do not require auxiliary OOD data, which is often impractical to collect in RF environments. The authors evaluate their methods on the POWDER RF fingerprinting dataset, demonstrating that detectors tuned without OOD data can achieve performance comparable to those using true OOD tuning data, while significantly outperforming baseline methods that lack OOD tuning data. This work establishes a foundation for future research in open-set RF fingerprinting and highlights the potential of OOD detection techniques in this domain.
Methodology
The authors adapt state-of-the-art OOD detection methods from the machine learning literature, particularly focusing on feature-shaping approaches. They develop a unified framework that allows for systematic analysis and tuning of these methods without requiring OOD data. The evaluation is conducted using the POWDER RF fingerprinting dataset to assess the effectiveness of the proposed methods.
Results
The experimental results show that the OOD detectors tuned without any given OOD data achieve performance levels comparable to those that utilize true OOD tuning data. Furthermore, these detectors significantly outperform baseline methods that do not have access to OOD tuning data, indicating the practical viability of the proposed approach for RF fingerprinting.
Implications
The findings suggest that OOD detection methods can be effectively applied to RF fingerprinting, enhancing the security and reliability of wireless communication systems. This work opens avenues for further research in open-set recognition and the development of robust RF fingerprinting systems capable of handling unknown transmitters.
A fully GPU-based workflow for building physics emulators of hypersonic flows
Efficient ML
- Introduction of a fully GPU-based workflow for hypersonic flow emulation.
- Integration of data generation, surrogate pre-training, and physics-aware refinement in a single pipeline.
- Evaluation of two neural architectures and their performance trade-offs in different data scenarios.
- Demonstration of a target-free refinement method that enhances physical consistency without reference fields.
Read more
A fully GPU-based workflow for building physics emulators of hypersonic flows
Summary
This paper presents a novel fully GPU-based workflow designed for creating physics emulators specifically for hypersonic flows, which are characterized by extreme conditions and complex physical phenomena such as shock waves. Traditional reduced-order models and neural emulators often struggle to accurately predict these flows due to steep gradients and the need for physical consistency. The authors introduce a differentiable high-fidelity solver, JAX-Fluids, to facilitate rapid dataset generation and improve the neural emulator's performance through residual-based refinement. The workflow encompasses GPU-accelerated data generation, pre-training of neural emulators, and a unique target-free refinement process that enhances physical consistency without requiring reference flow fields. The study evaluates two neural architectures (AB-UPT and vision transformer) and examines their performance under varying data conditions. Key findings indicate that while AB-UPT excels in data-rich environments, the vision transformer performs better in data-scarce settings. The target-free refinement approach significantly reduces conservation residuals, suggesting that the pre-trained models effectively capture dominant flow structures. Overall, this work contributes to the development of reliable physics emulators that can be effectively utilized in real-world engineering applications.
Methodology
The methodology involves a fully GPU-accelerated workflow that integrates data generation using a differentiable solver (JAX-Fluids), pre-training of neural emulators with two different architectures (AB-UPT and vision transformer), and a target-free residual-based refinement process. The refinement process backpropagates the residuals of the underlying PDE into the emulator weights, improving physical consistency without the need for reference flow fields.
Results
The experiments reveal that AB-UPT achieves the highest accuracy in data-abundant scenarios, while the vision transformer outperforms in data-scarce situations. The target-free refinement significantly reduces conservation residuals, indicating that the pre-trained models capture the essential flow structures effectively. The study highlights the trade-offs between deterministic and probabilistic training paradigms, with flow matching providing better out-of-distribution performance.
Implications
The proposed workflow has significant implications for engineering applications involving hypersonic flows, as it enables the development of reliable physics emulators that can be integrated into design processes. This approach can potentially reduce computational costs and improve the accuracy of simulations in high-speed transport and propulsion systems.
Decoupled Latent Optimization of Diffusion Models for Full Waveform Inversion
Optimization
Generative Models
- DLO improves the robustness and realism of FWI by decoupling latent optimization into a quadratic-penalty objective.
- The method preserves classical FWI's initialization while integrating a diffusion sampler for enhanced prior consistency.
- DLO outperforms traditional regularization techniques and existing diffusion-based methods in various acquisition conditions.
- The trained diffusion model shows effective transferability to different geological benchmarks, recovering intricate fault structures.
Read more
Decoupled Latent Optimization of Diffusion Models for Full Waveform Inversion
Summary
This paper introduces Decoupled Latent Optimization (DLO) as a novel approach to Full Waveform Inversion (FWI), which aims to recover subsurface velocity from seismic recordings. Traditional methods struggle with the ill-posed nature of FWI, often leading to unrealistic geological structures or sensitivity to noise. DLO addresses these issues by reformulating the latent optimization process into a quadratic-penalty objective that operates over both an auxiliary physical variable and a latent variable. This method allows for the preservation of classical FWI's smoothed-velocity initialization while utilizing a diffusion sampler that contributes through decoded prior samples. The authors demonstrate that DLO significantly outperforms classical regularizers and existing diffusion-based methods across various scenarios, including clean, noisy, and missing-trace acquisitions. Furthermore, the trained prior model shows effective transferability to different benchmarks, successfully recovering complex geological features and maintaining robustness against initialization and measurement noise.
Methodology
The authors propose a decoupled latent optimization framework that reformulates the standard latent optimization into a quadratic-penalty objective. This involves an auxiliary physical variable and a latent variable, where the data-fidelity gradient operates in physical space and the diffusion sampler provides decoded prior samples. The method retains the smoothed-velocity initialization typical of classical FWI.
Results
DLO was tested on the OpenFWI benchmark and demonstrated superior performance compared to classical regularizers and existing diffusion-based methods, even under challenging conditions such as noise and missing data. The prior model, trained on a set of OpenFWI models, effectively transferred to other benchmarks, successfully recovering complex geological features.
Implications
The DLO framework has significant implications for seismic imaging and geophysical exploration, providing a more reliable and realistic method for subsurface modeling. Its robustness to noise and initialization issues could enhance applications in resource exploration, seismic hazard assessment, and environmental monitoring.