AI-generated summaries
Today's ML research,
without the noise.
Daily summaries of the latest machine learning papers from arXiv, processed every 8 hours.
62
Papers today
8h
Update frequency
7
Days of history
Forgetting to Witness: Efficient Federated Unlearning and Its Visible Evaluation
Federated Learning
Generative Models
Efficient ML
- Introduction of a complete pipeline for federated unlearning, including an efficient unlearning approach and a novel evaluation framework.
- Utilization of knowledge distillation to facilitate the unlearning process without requiring historical data.
- Development of Skyeye, a visualization framework that assesses the forgetting capacity of federated unlearning models.
- Demonstration of the effectiveness of the proposed methods through comprehensive experimental results.
Read more
Forgetting to Witness: Efficient Federated Unlearning and Its Visible Evaluation
Summary
This paper addresses the emerging field of federated unlearning, which is crucial for ensuring data privacy in federated learning environments. The authors propose a comprehensive pipeline for federated unlearning that includes a novel unlearning approach and an evaluation framework called Skyeye. The proposed unlearning method is efficient and maintains model accuracy without the need for historical data storage. It utilizes a knowledge distillation model where a teacher model (an incompetent model) guides a student model (the one that needs to unlearn) during training. The student learns to ignore the deleted data based on the teacher's outputs. Additionally, the Skyeye framework visualizes the forgetting capacity of the unlearning models by integrating the unlearning model into a Generative Adversarial Network (GAN). This allows for the generation of samples that reflect the model's knowledge and forgetting capabilities. The paper presents comprehensive experiments demonstrating the effectiveness of both the federated unlearning approach and the Skyeye evaluation framework, highlighting their potential to address regulatory compliance and enhance data privacy in federated learning.
Methodology
The authors propose a federated unlearning approach that employs a knowledge distillation model, where a teacher model guides a student model to unlearn deleted data. The evaluation framework, Skyeye, integrates the unlearning model into a GAN to visualize the model's forgetting capability through sample generation.
Results
The experiments conducted illustrate that the proposed federated unlearning approach is efficient and maintains model accuracy while effectively removing the influence of deleted data. The Skyeye framework successfully visualizes the forgetting capacity of the models, providing a novel method for evaluating federated unlearning.
Implications
The findings of this paper have significant implications for enhancing data privacy in federated learning systems, particularly in light of regulatory requirements for data deletion. The proposed methods can be applied in various domains where federated learning is utilized, ensuring compliance with data protection laws while maintaining model performance.
LMI-Net: Linear Matrix Inequality--Constrained Neural Networks via Differentiable Projection Layers
Optimization
Theory
Efficient ML
- LMI-Net introduces a differentiable projection layer specifically designed for enforcing LMI constraints in neural networks.
- The framework utilizes Douglas–Rachford splitting for efficient projection and implicit differentiation.
- Theoretical convergence guarantees ensure that the model reliably satisfies LMI constraints.
- Experimental results show improved feasibility and robustness under various conditions compared to traditional soft-constrained models.
Read more
LMI-Net: Linear Matrix Inequality--Constrained Neural Networks via Differentiable Projection Layers
Summary
This paper introduces LMI-Net, a novel framework for integrating linear matrix inequality (LMI) constraints into neural networks through differentiable projection layers. LMIs are crucial for certifying the stability and robustness of dynamical systems, yet traditional learning-based methods often struggle to maintain these constraints. LMI-Net addresses this by employing a differentiable projection layer that enforces LMI constraints during the learning process. The authors utilize a Douglas–Rachford splitting method to project onto the feasible set defined by the LMI constraints, allowing for efficient forward and backward passes in neural networks. Theoretical guarantees are provided to ensure convergence to feasible points, thus certifying the reliability of the model. Experimental evaluations demonstrate that LMI-Net significantly enhances feasibility over soft-constrained models, particularly under distribution shifts, while maintaining fast inference speeds. This work bridges the gap between semidefinite programming and modern learning techniques, offering a scalable solution for incorporating convex control constraints into learning systems.
Methodology
The authors developed a tailored splitting scheme for LMI constraints, decomposing the feasible set into manageable components. The Douglas–Rachford algorithm is employed to perform projections efficiently, allowing for both forward pass computations and implicit differentiation during backpropagation.
Results
LMI-Net was evaluated on tasks such as invariant ellipsoid synthesis and joint controller-and-certificate design for disturbed linear systems. The results indicated a substantial improvement in feasibility over soft-constrained models, particularly in scenarios involving distribution shifts, while also ensuring fast inference times.
Implications
The proposed LMI-Net framework has significant implications for control systems and machine learning applications where stability and robustness are critical. It enables the development of reliable models that can adapt to varying system parameters and disturbances, enhancing the applicability of neural networks in safety-critical domains.
Training Without Orthogonalization, Inference With SVD: A Gradient Analysis of Rotation Representations
Computer Vision
Robotics
Theory
- Removing orthogonalization during training improves rotation estimation in deep learning.
- SVD orthogonalization introduces gradient pathologies that hinder training performance.
- Direct 9D regression is preferable to 6D regression in the unorthogonalized training regime.
- The paper provides a theoretical analysis of the SVD Jacobian's spectrum and its implications for gradient flow.
Read more
Training Without Orthogonalization, Inference With SVD: A Gradient Analysis of Rotation Representations
Summary
This paper investigates the effects of orthogonalization during the training of neural networks for 3D rotation estimation, specifically focusing on the use of Singular Value Decomposition (SVD) at inference. The author presents a detailed gradient analysis of SVD orthogonalization for 3x3 matrices and projections onto the special orthogonal group SO(3). The findings reveal that incorporating SVD during training introduces significant gradient distortions, particularly when the predicted rotation matrix deviates from the SO(3) manifold. The paper derives the exact spectrum of the SVD Jacobian, demonstrating that it retains only a fraction of the gradient information, which can lead to ambiguous updates and suboptimal convergence. In contrast, the analysis shows that training with direct 9D regression without orthogonalization and applying SVD only at inference yields better performance. The paper also compares SVD with Gram-Schmidt orthogonalization, highlighting the asymmetric gradient signals produced by the latter, thus reinforcing the preference for 9D parameterization over 6D. Overall, this work provides a theoretical foundation for the recent empirical findings advocating for the removal of orthogonalization during training.
Methodology
The author conducts a gradient analysis of SVD and Gram-Schmidt orthogonalization methods, focusing on their impact on the training of neural networks for 3D rotation estimation. The analysis includes deriving the spectrum of the SVD Jacobian and comparing it with the Gram-Schmidt Jacobian to quantify gradient distortions and information loss.
Results
The study finds that the SVD Jacobian has a rank of 3 with specific nonzero singular values, leading to quantifiable gradient distortion. It shows that SVD backpropagation retains only one-third of the gradient energy, while Gram-Schmidt produces asymmetric gradient signals. The results support the conclusion that training without orthogonalization and applying SVD only at inference is more effective.
Implications
This research has significant implications for the design of neural networks in computer vision and robotics, particularly in tasks involving 3D rotation estimation. It suggests that training strategies should avoid orthogonalization to enhance convergence and performance, potentially influencing future methodologies in rotation representation learning.
Optimal-Transport-Guided Functional Flow Matching for Turbulent Field Generation in Hilbert Space
Generative Models
- FOT-CFM generalizes Conditional Flow Matching to infinite-dimensional Hilbert spaces.
- Incorporation of Optimal Transport theory allows for efficient and accurate turbulence generation.
- The method achieves high fidelity in reproducing turbulent statistics with fewer function evaluations.
- Neural Operators are used for parameterizing the vector field, enabling resolution-invariant dynamics.
Read more
Optimal-Transport-Guided Functional Flow Matching for Turbulent Field Generation in Hilbert Space
Summary
This paper presents Functional Optimal Transport Conditional Flow Matching (FOT-CFM), a novel generative framework designed for high-fidelity turbulence modeling in infinite-dimensional Hilbert spaces. Traditional generative models struggle with turbulence data due to their discrete nature, which does not align well with the continuous, function-valued characteristics of turbulent flows. FOT-CFM addresses this by generalizing Conditional Flow Matching (CFM) to infinite dimensions, allowing for direct manipulation of probability measures and avoiding density-based constructions. The authors incorporate Optimal Transport (OT) theory to create straight-line probability paths between noise and data measures, enhancing the generative flow and reducing trajectory curvature. This method enables simulation-free training and significantly accelerates the sampling process. The framework is evaluated on complex chaotic systems, including the Navier-Stokes equations, demonstrating its ability to reproduce high-order turbulent statistics and energy spectra with reduced inference latency compared to existing methods.
Methodology
The authors developed FOT-CFM by extending Conditional Flow Matching to infinite-dimensional spaces, utilizing Optimal Transport to construct deterministic paths between noise and data distributions. The framework employs Neural Operators to parameterize the vector field, facilitating continuous operator learning independent of discretization. The training is simulation-free, allowing for efficient sampling.
Results
FOT-CFM demonstrated superior performance in generating turbulent fields, accurately reproducing high-order statistics and energy spectra across various chaotic systems. The method achieved significant reductions in inference latency compared to diffusion-based and curved ODE-based baselines, showcasing its effectiveness in turbulence modeling.
Implications
The proposed framework has potential applications in various fields requiring accurate turbulence modeling, such as climate prediction, fluid dynamics, and engineering design. Its ability to handle function-valued data directly may also influence future developments in generative modeling for other scientific domains.
On the Geometry of Positional Encodings in Transformers
NLP
Large Language Models
Theory
- Positional information is essential for Transformers to perform order-sensitive tasks.
- Distinct vector representations for sequence positions are learned during training.
- An optimal positional encoding approximates statistical distances between word distributions.
- The sinusoidal encoding is theoretically justified as nearly optimal for certain corpora.
Read more
On the Geometry of Positional Encodings in Transformers
Summary
This paper addresses the theoretical underpinnings of positional encodings in Transformers, which are crucial for processing sequences of words. The author establishes a mathematical framework to explore three main questions: the necessity of positional information, the structure of learned positional encodings, and the characteristics of an optimal positional encoding. Theorem 1 proves that without positional signals, Transformers cannot distinguish between permutations of input sequences, rendering them ineffective for tasks sensitive to word order. The Positional Separation Theorem (Theorem 4) shows that distinct positions in a sequence are assigned unique vector representations during training. The paper also discusses the impossibility of achieving an exact reproduction of statistical distances between word distributions at different positions, proposing a multidimensional scaling approach to approximate this geometry. The study introduces a stress metric to evaluate the quality of various positional encodings, revealing that the sinusoidal encoding is nearly optimal for corpora with smoothly varying positional statistics. Experiments validate the theoretical findings using synthetic and real-world datasets, demonstrating that Attention with Linear Biases (ALiBi) outperforms sinusoidal and Rotary Position Embedding (RoPE) encodings in terms of stress reduction.
Methodology
The paper employs theoretical proofs to establish the necessity and structure of positional encodings, alongside multidimensional scaling to approximate optimal encodings. Experiments are conducted on synthetic and real-world datasets to validate the theoretical predictions.
Results
Theoretical results confirm that Transformers without positional encodings cannot solve order-sensitive tasks. The Positional Separation Theorem indicates that training leads to distinct embeddings for different positions. The best approximation of positional encoding is achieved through multidimensional scaling, with empirical results showing that ALiBi encoding significantly reduces stress compared to sinusoidal and RoPE encodings.
Implications
The findings provide a foundational understanding of positional encodings, guiding future research and development of more effective encoding schemes in Transformers, potentially improving performance in NLP tasks that require sensitivity to word order.
Automated Attention Pattern Discovery at Scale in Large Language Models
Large Language Models
Interpretability
- Introduces AP-MAE as a scalable method for analyzing attention patterns in LLMs.
- Demonstrates that attention patterns can predict the correctness of model outputs.
- Shows that AP-MAE generalizes across different models with minimal loss.
- Establishes attention patterns as a tractable and informative object of study.
Read more
Automated Attention Pattern Discovery at Scale in Large Language Models
Summary
This paper addresses the challenge of interpretability in large language models (LLMs) by proposing a novel approach to analyze attention patterns at scale. The authors highlight the limitations of existing mechanistic interpretability methods, which often fail to generalize across tasks and are computationally expensive. To overcome these challenges, they introduce the Attention Pattern – Masked Autoencoder (AP-MAE), a vision transformer-based model designed to efficiently reconstruct masked attention patterns from Java code datasets. The study demonstrates that attention patterns can serve as scalable signals for global interpretability, revealing recurring behaviors across multiple inferences. The AP-MAE model shows high accuracy in reconstructing masked patterns and generalizes well across unseen models. Additionally, it can predict the correctness of model outputs without ground truth, achieving accuracies between 55% and 70%. The authors also explore targeted interventions that can enhance prediction accuracy by up to 13.6%. Overall, the findings establish attention patterns as a valuable object of study for interpretability and suggest that AP-MAE can guide more detailed mechanistic analyses in LLMs.
Methodology
The authors employ a vision transformer-based model, AP-MAE, to analyze attention patterns in large language models. They mine completion scenarios from Java code datasets, leveraging the structured nature of source code. The methodology includes three stages: analyzing learned attention patterns, predicting output correctness based on these patterns, and implementing dynamic interventions to improve model accuracy.
Results
The AP-MAE model successfully reconstructs masked attention patterns with high accuracy and generalizes well across unseen models. It can predict the correctness of model generations with accuracies ranging from 55% to 70%. Additionally, targeted interventions based on attention patterns lead to a performance improvement of 13.6% in next-token accuracy.
Implications
The findings suggest that attention patterns can significantly enhance the interpretability of large language models, providing insights into model behavior and guiding future mechanistic analyses. The AP-MAE framework can serve as a foundation for further research in large-scale interpretability and targeted model interventions.
Generative models for decision-making under distributional shift
Generative Models
Optimization
Theory
- Generative models can effectively represent and transform distributions under distributional shifts.
- The framework introduced leverages mathematical tools for constructing decision-relevant distributions.
- Generative models enhance the operations research toolkit by supporting representation, robustness, and inference.
- The paper provides theoretical guarantees for the use of generative models in decision-making contexts.
Read more
Generative models for decision-making under distributional shift
Summary
This paper presents a tutorial on the application of modern generative models, particularly flow- and score-based methods, for decision-making in scenarios where there is a distributional shift. The authors argue that traditional data-driven decision-making often relies on nominal distributions derived from historical data, which may not accurately reflect the deployment conditions that decision-makers face. The tutorial introduces a unified framework that utilizes mathematical concepts such as pushforward maps, Fokker–Planck equations, and Wasserstein geometry to construct decision-relevant distributions. The authors emphasize the importance of generative models in representing nominal uncertainty, generating stressed distributions for robustness, and producing conditional distributions under partial observations. The paper also discusses theoretical guarantees related to generative models, including convergence properties and error bounds for posterior sampling. Overall, the tutorial aims to provide a principled approach to using generative models for robust decision-making and uncertainty quantification in operations research.
Methodology
The authors develop a unified framework based on mathematical tools such as pushforward maps, continuity, Fokker–Planck equations, and optimization in probability space. They explore the use of flow- and score-based generative models to learn nominal uncertainty and construct distributions relevant for robust decision-making.
Results
The tutorial outlines how generative models can be applied to learn nominal uncertainty, generate adverse distributions for robustness, and produce conditional distributions under partial observations. Theoretical guarantees are provided, including convergence for iterative flow models and error bounds for posterior sampling.
Implications
The findings suggest that generative models can significantly improve decision-making processes in operations research by providing more accurate representations of uncertainty. This can lead to better planning, robustness in decision-making, and enhanced inference capabilities in various applications.
One Model for All: Multi-Objective Controllable Language Models
NLP
Large Language Models
Optimization
- Introduction of Multi-Objective Control (MOC) for training LLMs to accommodate diverse user preferences.
- MOC integrates multi-objective optimization principles into RLHF, allowing for a single model to handle multiple objectives.
- Demonstrated superior performance in controllability, output quality, and generalization compared to baseline methods.
- MOC's training cost is comparable to single-objective RLHF, making it efficient for practical applications.
Read more
One Model for All: Multi-Objective Controllable Language Models
Summary
This paper addresses the challenge of aligning large language models (LLMs) with diverse human preferences, which is crucial for enhancing their safety and helpfulness. Traditional reinforcement learning from human feedback (RLHF) methods typically focus on a fixed reward based on average human ratings, limiting adaptability to individual user preferences. The authors propose a novel approach called Multi-Objective Control (MOC), which allows a single LLM to generate personalized outputs across varying user preferences by employing multi-objective optimization (MOO) principles. MOC trains the model as a preference-conditioned policy network, improving computational efficiency and enabling fine-tuning of a 7B-parameter model on a single GPU. The paper demonstrates that MOC outperforms existing methods in three key areas: controllability of outputs concerning user preferences, quality and diversity of generated outputs, and generalization to unseen preferences. The findings suggest that MOC can effectively create scalable and customizable LLMs suitable for real-world applications requiring personalized interactions.
Methodology
The authors formulated multi-objective controllability as an MOO problem with preference vector constraints. MOC employs a novel optimization algorithm that allows for explicit policy improvement without relying on human preference data. The method uses dynamic weighting to manage trade-offs among multiple objectives while maintaining computational efficiency.
Results
Extensive experiments showed that MOC consistently outperformed baseline methods in controllability, achieving a clear relationship between user preferences and model outputs. MOC also excelled in the quality of the solution set, as measured by the hyper-volume metric, and demonstrated robust generalization to unseen preferences.
Implications
The MOC framework has significant implications for developing LLMs that can adapt to individual user preferences in real-time, enhancing user satisfaction and engagement. This approach could be applied in various domains, including personalized education, customer service, and interactive entertainment.
On Dominant Manifolds in Reservoir Computing Networks
Time Series
Theory
- Establishes a link between training data and the geometry of reservoir computing dynamics.
- Demonstrates that dominant modes correspond to approximations of Koopman eigenfunctions.
- Introduces a spectral analysis framework for characterizing trained linear RC systems.
- Discusses generalization to nonlinear reservoirs through dominance theory.
Read more
On Dominant Manifolds in Reservoir Computing Networks
Summary
This paper investigates the emergence of low-dimensional dominant manifolds in Reservoir Computing (RC) networks, particularly in the context of temporal forecasting tasks. The authors establish a connection between the geometry of recurrent network dynamics and the training data's intrinsic dimensionality and information content. They demonstrate that for training data derived from autonomous dynamical systems, the dominant modes of the trained reservoir can be related to approximations of the Koopman eigenfunctions of the original system. This relationship highlights a significant link between reservoir computing and the Dynamic Mode Decomposition algorithm. The authors perform a spectral analysis of the trained linear RC system, characterizing its dominant eigenvalues and eigenvectors in relation to the training data. They also discuss the potential for extending these insights to nonlinear reservoirs through tangent dynamics and differential p-dominance, providing a framework for understanding the emergence of low-dimensional dominant invariant manifolds in more complex systems.
Methodology
The authors utilize a simplified linear and continuous-time reservoir model to analyze the emergence of dominant manifolds during training. They perform spectral analysis to relate the dominant eigenvalues and eigenvectors of the trained reservoir to the training data, and explore the implications of dominance theory for both linear and nonlinear reservoirs.
Results
The study finds that training modifies the reservoir in a structured manner, resulting in low-dimensional dominant subspaces that are closely tied to the intrinsic properties of the training data. The connection to the Koopman operator is established, providing a theoretical foundation for understanding the dynamics of trained RC networks.
Implications
The findings could enhance the understanding of how reservoir computing can be applied to time-series prediction and system identification, and may inform future research on the dynamics of neural networks and their relation to brain computation.
Learning-Based Multi-Criteria Decision Making Model for Sawmill Location Problems
Optimization
- Introduction of a novel LB-MCDM framework that minimizes reliance on subjective expert judgment.
- Integration of machine learning with GIS-based spatial analysis for dynamic site suitability assessment.
- Identification of the Supply-Demand Ratio as the most influential factor in sawmill location selection.
- Demonstration of the framework's effectiveness through a case study in Mississippi.
Read more
Learning-Based Multi-Criteria Decision Making Model for Sawmill Location Problems
Summary
This paper presents a Learning-Based Multi-Criteria Decision-Making (LB-MCDM) framework that integrates machine learning with GIS-based spatial location analysis to address the complexities of sawmill location selection. The proposed model aims to enhance the efficiency, profitability, and sustainability of timber supply chains by providing a data-driven and unbiased approach to site suitability assessment. The framework employs five machine learning algorithms—Random Forest, Support Vector Classifier, XGBoost, Logistic Regression, and K-Nearest Neighbors—to identify optimal sawmill locations in Mississippi. The Random Forest Classifier demonstrated the highest performance among the models tested. The study also utilizes SHAP (SHapley Additive exPlanations) to analyze the importance of various criteria, revealing that the Supply-Demand Ratio is the most significant factor influencing site suitability, followed by distances to roads, rail lines, and urban areas. The validation of the suitability maps indicates that approximately 10-11% of the Mississippi landscape is highly suitable for sawmill establishment, showcasing the practical utility of the LB-MCDM framework in real-world applications.
Methodology
The LB-MCDM framework combines machine learning algorithms with GIS-based spatial analysis to evaluate sawmill site suitability. Five classification models were trained on a dataset of over 11,000 candidate locations, incorporating various features such as distance to roads and rail lines, unemployment rates, and market dynamics. The SHAP technique was employed to assess the importance of each criterion in the decision-making process.
Results
The Random Forest Classifier achieved the highest accuracy in predicting suitable sawmill locations. The analysis revealed that 10-11% of the Mississippi landscape is highly suitable for sawmill establishment, with the Supply-Demand Ratio emerging as the most critical factor influencing site selection.
Implications
The LB-MCDM framework offers a robust tool for strategic planning in the timber industry, enabling stakeholders to make informed decisions based on comprehensive data analysis. The dynamic nature of the suitability maps allows for continuous updates, facilitating timely responses to changes in market conditions and resource availability.
General Multimodal Protein Design Enables DNA-Encoding of Chemistry
Generative Models
Multimodal
- DISCO enables simultaneous design of protein sequences and 3D structures without pre-defined motifs.
- The model generates enzymes that catalyze new-to-nature reactions with high activity.
- Random mutagenesis confirms that enzyme activity can be enhanced through directed evolution.
- DISCO expands the searchable space for DNA-encoded chemical reactivity.
Read more
General Multimodal Protein Design Enables DNA-Encoding of Chemistry
Summary
The paper introduces DISCO, a novel multimodal model designed for the co-design of protein sequences and 3D structures without the need for pre-specified catalytic residues. This model addresses limitations in current enzyme design methodologies by allowing for the simultaneous optimization of sequence and structure, thus enabling the creation of enzymes capable of catalyzing new-to-nature reactions. DISCO is conditioned on reactive intermediates and generates diverse heme enzymes with unique active-site geometries. These enzymes demonstrate high catalytic activities, surpassing those of existing engineered enzymes. The authors also highlight the potential for improving enzyme activity through directed evolution, showcasing DISCO's ability to broaden the scope of genetically encodable transformations and enhance the discovery of biocatalysts previously unknown to biology.
Methodology
DISCO employs a joint distribution model for protein sequences and 3D structures, utilizing a masked discrete diffusion process for sequences and continuous diffusion for atomic coordinates. It trains a single deep neural network that integrates these modalities, allowing for the co-generation of sequences and structures based on arbitrary biomolecules. The architecture includes a protein language model and a Pairformer stack for contextual representation, along with a cross-modal recycling mechanism for enhanced generation.
Results
DISCO successfully generates diverse heme enzymes with novel active-site geometries that catalyze various carbene-transfer reactions, achieving activities that exceed those of engineered enzymes. The model's design-to-test workflow facilitates the discovery of biocatalysts and protein motifs that were previously unknown, significantly broadening the potential for new chemical transformations.
Implications
The findings suggest that DISCO could revolutionize enzyme engineering by providing a scalable method for designing enzymes capable of catalyzing unprecedented reactions. This has significant implications for chemical manufacturing, drug synthesis, and therapeutic applications, potentially leading to more efficient and innovative biocatalytic processes.
Topological Characterization of Churn Flow and Unsupervised Correction to the Wu Flow-Regime Map in Small-Diameter Vertical Pipes
Theory
- Introduces the first topology-based characterization of churn flow using Euler Characteristic Surfaces.
- Develops an unsupervised learning framework that blends ECS-derived kernels with gas velocity.
- Demonstrates significant discrepancies between ECS-inferred transitions and existing models, indicating under-prediction of slug flow persistence.
- Achieves high classification accuracy and recall rates without labeled training data, surpassing traditional supervised methods.
Read more
Topological Characterization of Churn Flow and Unsupervised Correction to the Wu Flow-Regime Map in Small-Diameter Vertical Pipes
Summary
This paper addresses the long-standing lack of a quantitative mathematical definition for churn flow, a chaotic regime in vertical two-phase flow. The authors introduce a topology-based characterization using Euler Characteristic Surfaces (ECS) and propose an unsupervised regime discovery method through Multiple Kernel Learning (MKL). By applying this framework to 37 unlabeled air-water trials, the study reveals that 64% of the total weight in the model is derived from topology-based features. The ECS-inferred transition point for slug/churn flow is found to be significantly higher than predictions from existing models, highlighting the under-prediction of slug persistence in small-diameter pipes. Validation with additional data from Texas A&M University demonstrates that the unsupervised framework achieves high accuracy and recall rates, outperforming traditional supervised methods that require extensive labeled datasets. This work not only provides a novel mathematical definition of churn flow but also shows that unsupervised topological descriptors can effectively challenge and refine established mechanistic models.
Methodology
The authors utilize Euler Characteristic Surfaces (ECS) to characterize churn flow topologically and employ Multiple Kernel Learning (MKL) to blend ECS-derived kernels with gas velocity data. The framework is applied to unlabeled experimental trials to discover flow regimes without prior labeling.
Results
The unsupervised framework achieves 95.6% accuracy in classifying four flow regimes and 100% recall for churn flow, while the ECS-derived transition point for slug/churn flow is found to be +3.81 m/s above predictions from existing models, indicating a significant improvement in understanding flow dynamics in small-diameter pipes.
Implications
This research has practical implications for industries relying on accurate flow regime identification in gas-lift operations and liquid loading in gas wells. The findings could lead to improved designs for siphon strings and better management of multiphase flow systems.
Autoencoder-Based Parameter Estimation for Superposed Multi-Component Damped Sinusoidal Signals
Time Series
- Development of an autoencoder-based method for parameter estimation of damped sinusoidal signals.
- Evaluation of the method under different training data distributions (Gaussian vs. uniform).
- High accuracy in estimating parameters even in challenging scenarios with noise and overlapping components.
- Demonstration of the method's robustness in practical applications.
Read more
Autoencoder-Based Parameter Estimation for Superposed Multi-Component Damped Sinusoidal Signals
Summary
This paper presents an innovative approach for estimating parameters of superposed multi-component damped sinusoidal signals using an autoencoder-based method. Damped sinusoidal oscillations are prevalent in various physical systems, and accurately estimating their parameters—such as frequency, phase, decay time, and amplitude—can be challenging, especially in the presence of noise and rapid decay. The authors develop a method that leverages the latent space of autoencoders to extract these parameters from noisy signals. They conduct experiments using both Gaussian and uniform training distributions to evaluate the impact of training data on estimation accuracy. The results demonstrate that the proposed method achieves high parameter estimation accuracy, even in complex scenarios involving subdominant components or nearly opposite-phase signals. The study highlights the robustness of the autoencoder approach, suggesting its potential as a valuable tool for analyzing short-duration, noisy signals across various applications.
Methodology
The authors utilize an autoencoder architecture to compress high-dimensional input data into a low-dimensional latent space, which captures essential information about the damped sinusoidal signals. They generate noisy waveform data composed of superposed components and train the autoencoder to estimate the physical parameters of interest. The performance is evaluated through waveform reconstruction and parameter estimation accuracy metrics.
Results
The proposed autoencoder method successfully estimates the parameters of both two-component and five-component damped sinusoidal signals with high accuracy. The analysis shows that the method remains effective even when components are subdominant or nearly cancel each other out. Comparisons between Gaussian and uniform training distributions reveal that the method is robust to variations in training data.
Implications
This research has significant implications for fields that rely on the analysis of damped sinusoidal signals, such as structural health monitoring, vibration analysis, and various physical sciences. The autoencoder-based approach could enhance the accuracy and efficiency of parameter estimation in noisy environments, facilitating better understanding and monitoring of dynamic systems.
Representational Collapse in Multi-Agent LLM Committees: Measurement and Diversity-Aware Consensus
Large Language Models
NLP
Efficient ML
- Representational collapse occurs when agents produce similar outputs, reducing the effectiveness of majority voting.
- The Diversity-Aware Latent Consensus (DALC) protocol improves accuracy by incorporating diversity weights based on embedding geometry.
- Ablation studies show that hint sharing enhances performance more than diversity weighting alone.
- The choice of embedding proxy significantly affects both the severity of representational collapse and the accuracy of downstream tasks.
Read more
Representational Collapse in Multi-Agent LLM Committees: Measurement and Diversity-Aware Consensus
Summary
This paper investigates the phenomenon of representational collapse in multi-agent large language model (LLM) committees, where agents conditioned on different roles produce highly similar outputs, undermining the assumption of complementary evidence. The study measures pairwise similarity among agents' outputs across 100 GSM8K questions, revealing a mean cosine similarity of 0.888 and an effective rank of 2.17, indicating a lack of diversity in the agents' responses. To address this issue, the paper introduces a training-free consensus protocol called Diversity-Aware Latent Consensus (DALC), which computes diversity weights based on embedding geometry. DALC achieves an accuracy of 87% on GSM8K, outperforming the self-consistency method at a lower token cost. The research also highlights the importance of encoder choice in modulating collapse severity and downstream accuracy. Through ablation experiments, the study confirms that hint sharing significantly contributes to performance improvements, and that representational collapse is measurable and exacerbated in more challenging tasks. The findings underscore the necessity of considering embedding proxies as critical design decisions in latent communication protocols for multi-agent systems.
Methodology
The authors measure pairwise similarity of outputs from three Qwen2.5-14B agents conditioned on different role prompts using cosine similarity and effective rank metrics. They introduce the DALC protocol, which involves generating chains of thought, embedding them, optionally projecting to maximize orthogonality, and aggregating answers through diversity-weighted voting.
Results
The DALC protocol achieved 87% accuracy on the GSM8K benchmark, surpassing the self-consistency method's 84% accuracy while using 26% fewer tokens. The study found a mean cosine similarity of 0.888 among agent outputs, indicating significant representational collapse, which worsened on more difficult tasks.
Implications
The findings suggest that multi-agent LLM systems may not be leveraging the full potential of diverse outputs, leading to inefficiencies. The DALC protocol offers a practical solution to enhance the performance of such systems, making it relevant for applications in collaborative AI and decision-making tasks.
Scaling DPPs for RAG: Density Meets Diversity
NLP
Large Language Models
Optimization
- Introduction of ScalDPP, a diversity-aware retrieval mechanism for RAG using DPPs.
- Development of a scalable dynamic kernel construction method to enhance complementarity in context selection.
- Introduction of Diverse Margin Loss (DML) to optimize the embedding space for better retrieval outcomes.
- Demonstration of ScalDPP's superiority over standard RAG models in experimental evaluations.
Read more
Scaling DPPs for RAG: Density Meets Diversity
Summary
This paper addresses the limitations of standard Retrieval-Augmented Generation (RAG) systems, which often prioritize relevance over diversity, leading to redundant and ineffective context during the generation process. The authors propose ScalDPP, a novel retrieval mechanism that incorporates Determinantal Point Processes (DPPs) to optimize for both density and diversity in the retrieved evidence. ScalDPP employs a lightweight P-Adapter to model inter-chunk dependencies and enhance context selection, while a new objective called Diverse Margin Loss (DML) is introduced to ensure that complementary evidence is prioritized over redundant alternatives. The methodology allows for scalable and efficient retrieval, overcoming the computational challenges typically associated with DPPs. Experimental results demonstrate that ScalDPP outperforms traditional RAG approaches, validating the importance of considering both density and diversity in retrieval tasks.
Methodology
The authors propose ScalDPP, which integrates DPPs into RAG systems through a P-Adapter that allows for efficient modeling of inter-chunk dependencies. The retrieval process is enhanced by dynamically constructing a kernel matrix during inference and employing MAP inference for subset selection. The Diverse Margin Loss (DML) is used to optimize the P-Adapter, ensuring that the selected contexts are both dense and complementary.
Results
Experimental results indicate that ScalDPP significantly improves the quality of retrieved contexts compared to traditional RAG models, effectively balancing the need for relevant and diverse evidence. The proposed methods demonstrate enhanced performance in generating factually grounded responses.
Implications
The findings suggest that incorporating diversity into retrieval mechanisms can lead to more effective use of external knowledge in language models, potentially improving the accuracy and reliability of generated content in various applications, such as question answering and conversational agents.
Algebraic Diversity: Group-Theoretic Spectral Estimation from Single Observations
Theory
- Introduces a framework for group-theoretic spectral estimation from single observations.
- Proves that the symmetric group is optimal for algebraic diversity in signal processing.
- Demonstrates significant performance improvements in various applications, including MUSIC and massive MIMO.
- Reveals new algebraic structures in transformer models, suggesting enhancements for large language models.
Read more
Algebraic Diversity: Group-Theoretic Spectral Estimation from Single Observations
Summary
This paper introduces a theoretical framework that allows for the replacement of temporal averaging over multiple independent observations of a noisy signal with an algebraic group action on a single observation. The author defines a group-averaged estimator, FG, which is constructed by applying the action of a finite group G to a single observation vector. The paper proves a General Replacement Theorem that establishes FG as a consistent estimator of population-level subspace decomposition under specific conditions regarding signal transformation and noise distribution. An Optimality Theorem demonstrates that the symmetric group SM is universally optimal for algebraic diversity, as its Cayley graph spectral decomposition yields the Karhunen–Loève (KL) transform, which is optimal for variance concentration and reconstruction error. The framework is applied to various scenarios, including the MUSIC algorithm for direction-of-arrival estimation, massive MIMO channel estimation, and single-pulse waveform characterization, achieving significant performance improvements over traditional methods. The paper also explores the implications of group actions in transformer neural networks, revealing new algebraic structures in internal representations and suggesting potential enhancements in model efficiency and accuracy. Additionally, the framework is extended to colored noise environments, providing a group-theoretic characterization of noise covariance. The central insight is that both temporal averaging and symmetric group action serve as mechanisms to extract invariant structures from noisy observations, with algebraic diversity enabling effective signal extraction from a single measurement.
Methodology
The methodology involves defining a group-averaged estimator FG based on group actions applied to single observation vectors. Theoretical proofs establish the consistency and optimality of this estimator, followed by empirical applications across various signal processing tasks. The framework is also extended to analyze the internal structures of transformer neural networks and to characterize noise covariance in colored noise environments.
Results
The results demonstrate that the group-averaged estimator achieves equivalent performance to multi-snapshot methods in direction-of-arrival estimation and massive MIMO channel estimation, with up to 64% higher effective throughput. In waveform classification, the proposed method outperforms FFT-based approaches, maintaining high accuracy even in challenging conditions. The exploration of transformer models reveals that the cyclic group is not the best fit for most attention heads, suggesting potential for model optimization.
Implications
The findings suggest that algebraic diversity can significantly enhance signal processing techniques, particularly in scenarios with limited observations. The insights into transformer models may lead to improved architectures and training strategies, while the group-theoretic approach to noise characterization could inform better noise mitigation strategies in various applications.
Controllable Image Generation with Composed Parallel Token Prediction
Generative Models
Computer Vision
Multimodal
- Introduces a new framework for composing conditional distributions in discrete generative models.
- Achieves a 63.4% relative reduction in error rates compared to previous methods across three datasets.
- Offers a 2.3× to 12× speed-up in real-time generation over comparable continuous methods.
- Demonstrates effective control over image generation through concept weighting and negation.
Read more
Controllable Image Generation with Composed Parallel Token Prediction
Summary
This paper addresses the limitations of conditional discrete generative models in composing multiple input conditions, particularly for novel combinations outside the training data. The authors propose a theoretically-grounded framework for composing discrete probabilistic generative processes, specifically focusing on masked generation as a special case. Their approach allows for precise specification of various input conditions, including the ability to emphasize or negate individual conditions through concept weighting. By leveraging the compositional vocabulary of VQ-VAE and VQ-GAN, the proposed method achieves a significant reduction in error rates and improvements in image quality across multiple datasets. The authors demonstrate that their method not only outperforms existing state-of-the-art techniques but also offers substantial speed advantages, making it applicable to pre-trained text-to-image models for enhanced control over image generation.
Methodology
The authors derive a theoretically-grounded formulation for composing discrete generative model outputs, particularly focusing on conditional parallel token prediction (absorbing diffusion). This framework allows for the integration of multiple input conditions and the application of concept weighting to control the emphasis or negation of specific attributes. The method is evaluated on three datasets: positional CLEVR, relational CLEVR, and FFHQ.
Results
The proposed method achieves state-of-the-art performance with a 63.4% reduction in error rates and an average absolute FID improvement of -9.58 across the evaluated datasets. It outperforms existing methods in terms of accuracy and maintains significant speed advantages, achieving real-time generation speeds 2.3 to 12 times faster than comparable continuous approaches.
Implications
This research has significant implications for the field of image generation, particularly in enhancing the controllability and flexibility of generative models. The ability to compose multiple conditions and control their influence opens new avenues for applications in creative industries, automated design, and interactive media. The release of the source code also encourages further exploration and development in this area.
Improving Feasibility via Fast Autoencoder-Based Projections
Reinforcement Learning
Optimization
Efficient ML
- Proposes a data-driven approach for enforcing complex operational constraints using an autoencoder.
- Introduces a structured, convex latent representation for efficient feasibility corrections.
- Demonstrates significant speed improvements in feasibility enforcement compared to traditional methods.
- Empirical results show near-perfect feasibility in constrained optimization and safer actions in reinforcement learning.
Read more
Improving Feasibility via Fast Autoencoder-Based Projections
Summary
This paper addresses the challenge of enforcing complex operational constraints in learning and control systems, particularly those that are nonconvex. The authors propose a novel data-driven approach that utilizes a trained autoencoder as an approximate projector to rapidly correct infeasible predictions. By training the autoencoder with an adversarial objective, the authors create a structured, convex latent representation of the feasible set, allowing for quick projections of neural network outputs onto this representation before decoding back to the original feasible set. The method, referred to as FAB (Fast Autoencoder-Based projections), is designed to be a plug-and-play solution that enhances the feasibility of outputs from standard neural networks without the computational burden of traditional solvers. Empirical tests across various constrained optimization and reinforcement learning problems demonstrate that FAB consistently achieves near 100% feasibility in a fraction of a millisecond, outperforming existing methods in terms of speed and reliability.
Methodology
The authors train an autoencoder using an adversarial objective to learn a structured, convex latent representation of the feasible set. This autoencoder serves as an approximate projector that allows for rapid corrections of neural network outputs by mapping them to feasible points in a low-latency manner.
Results
The empirical validation shows that the FAB method provides efficient approximate projections, achieving close to 100% feasibility in constrained optimization tasks within milliseconds. In reinforcement learning scenarios, FAB consistently yields safer actions compared to traditional methods like PPO and TRPO.
Implications
The proposed method has significant implications for real-world applications in robotics, energy systems, and industrial automation, where enforcing complex constraints is critical for safety and performance. The approach allows for faster and more reliable decision-making in systems that require adherence to operational constraints.
Modeling Patient Care Trajectories with Transformer Hawkes Processes
Time Series
- Introduces a Transformer Hawkes Process framework for modeling healthcare utilization trajectories.
- Imbalance-aware training objective improves sensitivity to rare but critical healthcare events.
- Demonstrates improved prediction of event types and timing on real-world healthcare data.
- Provides qualitative interpretability of model predictions related to patient risk and vulnerability.
Read more
Modeling Patient Care Trajectories with Transformer Hawkes Processes
Summary
This paper addresses the challenge of modeling patient healthcare utilization trajectories, which consist of irregularly time-stamped events such as outpatient visits, inpatient admissions, and emergency department encounters. The authors propose a novel framework that combines Transformer-based history encoding with a Hawkes point process to effectively model these trajectories in continuous time, particularly under conditions of severe class imbalance typical in healthcare data. The proposed method incorporates an imbalance-aware optimization strategy using weighted cross-entropy, which enhances sensitivity to rare but clinically significant events without distorting the underlying event distribution. The framework is evaluated on real-world patient utilization data, demonstrating improved predictive performance for both event types and timing. The findings suggest that this approach can provide valuable insights for healthcare systems in managing high-need, high-cost patient populations.
Methodology
The authors utilize a Transformer-based Hawkes Process framework to model multi-type healthcare events and their timing through a conditional intensity formulation. They introduce an imbalance-aware training objective based on weighted cross-entropy loss, with class weights derived from the inverse square root of event frequencies, to address the severe class imbalance in healthcare utilization data.
Results
The proposed framework significantly enhances the modeling of individualized, imbalanced care trajectories, leading to improved predictions of future healthcare events and their timing. The evaluation on real-world data shows that the model effectively captures the temporal dynamics of patient interactions with the healthcare system.
Implications
The findings suggest that the proposed modeling approach can aid healthcare systems in identifying and managing high-need, high-cost patient populations by providing clinically meaningful insights into patient care trajectories.
Top-K Retrieval with Fixed-Size Linear-Attention Completion: Backbone- and KV-Format-Preserving Attention for KV-Cache Read Reduction
NLP
Large Language Models
Efficient ML
- Introduces a hybrid attention module that combines exact anchors with Top-K retrieval and a fixed-size completion term.
- Maintains the original backbone language model and KV-cache format to ensure compatibility.
- Utilizes a subtractive completion cache computed at prefill time to estimate contributions from unretrieved tokens.
- Demonstrates improved performance in long-context benchmarks, especially in high-entropy attention scenarios.
Read more
Top-K Retrieval with Fixed-Size Linear-Attention Completion: Backbone- and KV-Format-Preserving Attention for KV-Cache Read Reduction
Summary
The paper addresses the challenge of long-context generation in autoregressive decoder-only Transformers, which is increasingly limited by the traffic generated during decode-time key-value (KV) cache access. The authors propose a novel retrieval-completion attention module that maintains the backbone weights and KV-cache format while reducing KV-cache read traffic. Their method computes exact attention over a small set of anchors and a query-dependent Top-K selection of tokens, while estimating the contributions of unretrieved mid-region tokens using a fixed-size feature-map summary created during prefill. This approach allows for a single normalization step that recovers the softmax mass without incurring additional KV reads. The proposed method demonstrates significant improvements over traditional Top-K selection methods, particularly in scenarios with high-entropy attention heads, thereby enhancing the efficiency of long-context generation.
Methodology
The authors developed a retrieval-completion attention module that computes exact attention over a small anchor set and a query-dependent Top-K set. They created a fixed-size cache during prefill using positive feature maps and employed a one-way subtraction method to account for unretrieved tokens. The contributions from exact and estimated tokens are merged in the unnormalized domain, followed by a single normalization step.
Results
The proposed method outperformed selection-only Top-K approaches at matched token-equivalent read budgets, showing the most significant gains in high-entropy attention heads across various long-context benchmarks.
Implications
This work has potential applications in improving the efficiency of long-context generation in language models, particularly in scenarios where memory bandwidth is a limiting factor. The approach could be beneficial for applications requiring real-time processing of long sequences, such as conversational agents and document summarization.
FNO$^{\angle θ}$: Extended Fourier neural operator for learning state and optimal control of distributed parameter systems
Optimization
Theory
- Introduction of FNO∠θ, an extended architecture of the Fourier neural operator.
- Utilization of the Ehrenpreis-Palamodov principle to represent states and optimal controls of linear PDEs.
- Modification of the FNO layer to incorporate complex frequency variables for improved learning.
- Demonstrated significant performance enhancements in learning state and optimal control for the nonlinear Burgers’ equation.
Read more
FNO$^{\angle θ}$: Extended Fourier neural operator for learning state and optimal control of distributed parameter systems
Summary
This paper introduces an extended Fourier neural operator (FNO) architecture, termed FNO∠θ, designed to learn the state and linear quadratic optimal control of systems governed by partial differential equations (PDEs). The authors leverage the Ehrenpreis-Palamodov fundamental principle, demonstrating that both the state and optimal control of linear PDEs with constant coefficients can be expressed as integrals in the complex domain. This insight leads to a modification of the FNO layer, extending the frequency variable in the inverse Fourier transform to the complex domain, thereby enhancing its capability to capture the integral representation from the fundamental principle. The performance of the proposed FNO∠θ is evaluated through numerical experiments on the nonlinear Burgers’ equation, revealing significant improvements in training errors and prediction accuracy for non-periodic boundary values compared to the standard FNO architecture.
Methodology
The authors propose a new architecture, FNO∠θ, which modifies the traditional FNO by introducing a complex variable in the inverse Fourier transform. This allows the model to leverage the integral representation of solutions to linear PDEs as dictated by the Ehrenpreis-Palamodov principle. The methodology includes theoretical analysis for linear PDEs and numerical experiments to validate the performance of the proposed architecture.
Results
The numerical experiments demonstrate that FNO∠θ achieves order of magnitude improvements in training errors and provides more accurate predictions for non-periodic boundary values when applied to the nonlinear Burgers’ equation, outperforming the standard FNO.
Implications
The findings suggest that the extended FNO architecture can be effectively utilized in various applications involving distributed parameter systems, such as fluid dynamics and climate modeling, where accurate state prediction and optimal control are critical. The ability to learn from complex representations may also lead to advancements in computational efficiency for solving PDEs.
Towards Intelligent Energy Security: A Unified Spatio-Temporal and Graph Learning Framework for Scalable Electricity Theft Detection in Smart Grids
Graph Learning
Time Series
- Development of a comprehensive AI-driven framework for electricity theft detection.
- Integration of multi-source data including electrical, environmental, and renewable energy inputs.
- Use of hybrid anomaly detection models combining LSTM, TCN, and Autoencoders.
- Application of graph-based learning techniques to capture spatial dependencies.
Read more
Towards Intelligent Energy Security: A Unified Spatio-Temporal and Graph Learning Framework for Scalable Electricity Theft Detection in Smart Grids
Summary
This paper addresses the critical issue of electricity theft and non-technical losses in smart grids, which significantly impact economic sustainability and grid reliability. The authors propose the SmartGuard Energy Intelligence System (SGEIS), an integrated AI framework designed for real-time electricity theft detection and intelligent energy monitoring. The SGEIS combines various methodologies, including supervised machine learning, deep learning-based time-series modeling, Non-Intrusive Load Monitoring (NILM), and graph-based learning to analyze both temporal and spatial consumption patterns. A comprehensive data processing pipeline is established, featuring multi-scale temporal analysis and rule-based anomaly labeling. The framework employs deep learning models such as Long Short-Term Memory (LSTM), Temporal Convolutional Networks (TCN), and Autoencoders for detecting abnormal usage patterns, alongside ensemble learning methods like Random Forest and XGBoost for classification. Graph Neural Networks (GNNs) are utilized to model grid topology and spatial dependencies, enhancing the detection of correlated anomalies across interconnected nodes. The experimental results indicate that the Gradient Boosting model achieves a ROC-AUC of 0.894, while graph-based models exceed 96% accuracy in identifying high-risk nodes. The proposed hybrid framework demonstrates improved detection robustness by integrating temporal, statistical, and spatial intelligence, making SGEIS a scalable and practical solution for real-world smart grid applications.
Methodology
The methodology involves a comprehensive data processing pipeline that integrates supervised machine learning, deep learning models for time-series analysis, NILM for appliance-level disaggregation, and graph-based learning to capture spatial dependencies. The framework employs various models including LSTM, TCN, Autoencoders, and ensemble classifiers like Random Forest and XGBoost, alongside GNNs for spatial anomaly detection.
Results
The experimental evaluation shows that the Gradient Boosting model achieves a ROC-AUC score of 0.894, while graph-based models successfully identify high-risk nodes with over 96% accuracy. The hybrid framework enhances detection robustness by effectively integrating temporal, statistical, and spatial intelligence.
Implications
The SGEIS framework offers a scalable solution for electricity theft detection in smart grids, with potential applications in enhancing grid reliability and reducing economic losses due to non-technical losses. Its integration of advanced machine learning techniques could lead to more efficient energy management and monitoring systems.
Diagonal-Tiled Mixed-Precision Attention for Efficient Low-Bit MXFP Inference
Large Language Models
Efficient ML
- Introduction of Diagonal-Tiled Mixed-Precision Attention (DMA) for efficient LLM inference.
- Development of a fully fused GPU kernel that integrates quantization and attention computation.
- Empirical evaluations show DMA maintains generation quality with significant speedup.
- DMA addresses challenges of low-bit quantization and unfused operations in attention mechanisms.
Read more
Diagonal-Tiled Mixed-Precision Attention for Efficient Low-Bit MXFP Inference
Summary
This paper addresses the high inference costs associated with transformer-based large language models (LLMs) due to the quadratic complexity of attention mechanisms and memory bandwidth limitations of high-precision operations. The authors introduce a novel low-bit mixed-precision attention kernel called Diagonal-Tiled Mixed-Precision Attention (DMA), which utilizes the microscaling floating-point (MXFP) data format. DMA employs a tiling-level mixed-precision design that partitions the attention matrix into low- and high-precision regions, optimizing for both speed and accuracy. The kernel is implemented using Triton, allowing for hardware-level parallelism and memory efficiency. Extensive evaluations on NVIDIA B200 GPUs demonstrate that DMA achieves lossless generation quality while significantly speeding up inference through kernel fusion. The authors also provide insights into the trade-offs between efficiency and accuracy, making their approach practical for real-world applications.
Methodology
The authors propose a mixed-precision design that partitions the attention matrix into regions of low and high precision, retaining critical information while leveraging low-precision formats for speed. They implement a fused memory-efficient kernel that integrates quantization, microscaling transformation, and attention computation into a single workflow, minimizing memory access and kernel launch overhead.
Results
Experimental results indicate that DMA achieves lossless generation quality compared to full-precision attention baselines. The method demonstrates significant speed improvements due to kernel fusion and efficient memory usage, with ablation studies confirming the effectiveness of mixed-precision tiling and quantization granularity.
Implications
The proposed DMA approach has the potential to enhance the efficiency of LLM inference in practical applications, making it feasible to deploy large language models in resource-constrained environments without sacrificing performance. This could lead to broader adoption of LLMs in various applications requiring real-time processing.
Learning Stable Predictors from Weak Supervision under Distribution Shift
Theory
- Formalization of supervision drift as a failure mode in weakly supervised learning.
- Development of a controlled evaluation protocol to isolate supervision drift effects.
- Demonstration of partial cross-domain robustness but severe temporal non-transferability.
- Introduction of feature stability analysis as a diagnostic for detecting non-transferability.
Read more
Learning Stable Predictors from Weak Supervision under Distribution Shift
Summary
This paper addresses the challenge of learning from weak supervision in the presence of distribution shifts, a phenomenon termed 'supervision drift'. The authors formalize supervision drift as changes in the conditional relationship between features and weak labels across different contexts. They conduct a case study using CRISPR-Cas13d transcriptomic perturbation experiments, where weak labels are derived from RNA-seq responses. The study establishes a controlled non-IID benchmark with explicit domain and temporal shifts, allowing for the evaluation of model performance under these conditions. Results indicate that while weak supervision can yield meaningful learning within a fixed context, it fails to generalize across timepoints, leading to significant performance degradation. The authors propose feature stability analysis as a diagnostic tool to detect non-transferability, emphasizing the importance of understanding supervision drift in weakly supervised learning scenarios. This work highlights a critical risk in relying solely on in-domain performance metrics, suggesting that robust evaluation must consider potential shifts in supervision mechanisms.
Methodology
The authors utilize a controlled experimental design to study supervision drift in transcriptomic perturbation experiments. They construct weak labels from RNA-seq responses and evaluate model performance across structured domain and temporal shifts. Various models, including linear and tree-based approaches, are assessed for their predictive accuracy and transferability across contexts.
Results
The study finds that weak supervision supports meaningful learning within a fixed biological context, achieving a ridge regression R2 of 0.356 and Spearman correlation of 0.442. However, performance significantly degrades under temporal shifts, with models yielding negative R2 values and near-zero rank correlations. The results indicate that the failures are attributed to supervision drift rather than model capacity.
Implications
The findings suggest that strong performance in a specific context does not guarantee robustness in changing environments, underscoring the need for careful evaluation of weakly supervised models. The proposed feature stability analysis could serve as a practical tool for practitioners to assess model transferability before deployment in real-world applications.
Cross-Machine Anomaly Detection Leveraging Pre-trained Time-series Model
Time Series
- Proposes a novel framework for cross-machine anomaly detection using pre-trained time-series models.
- Integrates a domain-invariant feature extractor to enhance generalization across different machines.
- Utilizes Random Forest Classifiers for feature disentanglement into machine-related and condition-related aspects.
- Demonstrates superior performance over existing methods in detecting anomalies in industrial settings.
Read more
Cross-Machine Anomaly Detection Leveraging Pre-trained Time-series Model
Summary
This paper addresses the challenge of anomaly detection in manufacturing systems where machines of the same type exhibit different behaviors due to various unobservable factors. The authors propose a cross-machine time-series anomaly detection framework that integrates a domain-invariant feature extractor with an unsupervised anomaly detection module. Utilizing the pre-trained foundation model MOMENT, the feature extractor employs Random Forest Classifiers to disentangle embeddings into machine-related and condition-related features. The latter are designed to be invariant to differences among individual machines, enhancing the generalization of downstream anomaly detectors to unseen target machines. Experiments conducted on an industrial dataset from three different machines performing the same operation demonstrate that the proposed method significantly outperforms both raw-signal-based and MOMENT-embedding feature baselines, confirming its effectiveness in improving cross-machine generalization and anomaly detection capabilities.
Methodology
The proposed framework combines a domain-invariant feature extractor with an unsupervised anomaly detection module. The feature extractor uses the pre-trained MOMENT model to generate embeddings, which are then processed by Random Forest Classifiers to separate machine-related features from condition-related features. This disentanglement allows for the extraction of features that are invariant to the differences among machines, facilitating effective anomaly detection.
Results
The experimental results indicate that the proposed method outperforms both raw-signal-based approaches and those using MOMENT embeddings. The framework effectively enhances the generalization capabilities of anomaly detectors when applied to unseen target machines, confirming its practical applicability in industrial environments.
Implications
The findings suggest that the proposed anomaly detection framework can significantly improve the reliability and quality of manufacturing processes by enabling timely detection of anomalies across different machines. This has potential applications in various industries, particularly those reliant on complex machinery, such as semiconductor and pharmaceutical manufacturing.
Not All Turns Are Equally Hard: Adaptive Thinking Budgets For Efficient Multi-Turn Reasoning
Large Language Models
Reinforcement Learning
Efficient ML
- Introduces Turn-Adaptive Budgets (TAB) for efficient multi-turn reasoning in LLMs.
- Models multi-turn reasoning as a multi-objective Markov Decision Process.
- Achieves up to 35% token savings while maintaining accuracy on benchmarks.
- Proposes TAB All-SubQ for further optimization using prior knowledge of sub-questions.
Read more
Not All Turns Are Equally Hard: Adaptive Thinking Budgets For Efficient Multi-Turn Reasoning
Summary
This paper addresses the challenge of compute efficiency in multi-turn reasoning for large language models (LLMs) as their reasoning performance plateaus. The authors propose a novel approach called Turn-Adaptive Budgets (TAB), which formulates multi-turn reasoning as a sequential compute allocation problem modeled as a multi-objective Markov Decision Process (MDP). TAB learns to allocate computational resources adaptively based on the difficulty of each turn in a conversation, maximizing task accuracy while adhering to global token constraints. The method is trained using Group Relative Policy Optimization (GRPO) and demonstrates significant improvements in efficiency. The authors also introduce a variant called TAB All-SubQ, which utilizes knowledge of all sub-questions to further optimize token allocation. Experiments on mathematical reasoning benchmarks show that TAB can save up to 35% of tokens while maintaining accuracy compared to static baselines, and TAB All-SubQ can save up to 40% tokens, highlighting the importance of planning in multi-turn reasoning efficiency.
Methodology
The authors formulate multi-turn reasoning as a sequential compute allocation problem and model it as a multi-objective Markov Decision Process (MDP). They develop the TAB policy using Group Relative Policy Optimization (GRPO), which learns to allocate token budgets based on the conversation history and the difficulty of each turn. The approach is evaluated against static and off-the-shelf LLM baselines on mathematical reasoning tasks.
Results
TAB demonstrates a superior accuracy-tokens tradeoff, achieving up to 35% token savings while maintaining or improving accuracy compared to static baselines. The TAB All-SubQ variant further enhances efficiency, saving up to 40% tokens by considering the entire conversation trajectory and all sub-questions.
Implications
The findings suggest that adaptive compute allocation can significantly enhance the efficiency of multi-turn reasoning in LLMs, potentially leading to reduced costs and improved performance in real-world applications where computational resources are limited. This work emphasizes the importance of planning in reasoning tasks and could inform future designs of adaptive reasoning systems.
The Role of Generator Access in Autoregressive Post-Training
NLP
Large Language Models
Theory
- Generator access significantly influences the effectiveness of autoregressive post-training.
- Prefix control allows learners to revisit and extend previously constructed prefixes, breaking the no-reset barrier.
- Observation richness becomes meaningful only after prefix control is established.
- The study establishes a clear distinction between prefix control and observation richness in the context of learning.
Read more
The Role of Generator Access in Autoregressive Post-Training
Summary
This paper investigates the constraints imposed by generator access on autoregressive post-training in large language models. The central question is whether learners can only utilize fresh root-start rollouts or if they can revisit previously constructed prefixes to query the next-token rule. The study introduces two key concepts: prefix control, which determines the prefixes the learner can access, and prefix observation, which defines what the learner can see at those prefixes. The findings reveal that the ability to control prefixes significantly impacts the learning process, as it allows for richer observations and more effective sampling strategies. The paper demonstrates that weak prefix control can eliminate barriers in the no-reset regime, leading to improved outcomes in KL-regularized post-training. The results suggest that the interface to the generator is crucial for enhancing the performance of autoregressive models, highlighting the importance of how learners interact with the generator rather than solely focusing on optimization techniques.
Methodology
The paper employs a theoretical framework to analyze the interaction between learners and autoregressive generators. It characterizes the no-reset regime, explores the implications of weak local reset, and examines how different access models affect learning outcomes. The analysis is supported by formal proofs and a taxonomy of generator access types.
Results
The study finds that allowing weak local reset significantly enhances the learner's ability to reach informative prefixes, thus overcoming the limitations of the no-reset regime. Once prefix control is granted, richer observations can lead to better sampling strategies, demonstrating that the generator's interface plays a critical role in the learning process. The paper also shows that modifying the generator access can create an exponential gap in the number of queries needed for effective post-training.
Implications
These findings suggest that improving the interface through which learners access generators can lead to substantial advancements in the performance of autoregressive models. This has potential applications in enhancing the efficiency and effectiveness of large language models in various reasoning tasks and could inform future research on model training and architecture design.
Grokking as Dimensional Phase Transition in Neural Networks
Theory
- Grokking is identified as a dimensional phase transition in neural networks.
- Effective dimensionality (D) transitions from sub-diffusive to super-diffusive at generalization onset.
- Gradient field geometry, rather than network architecture, determines the effective dimensionality.
- The study employs finite-size scaling to analyze gradient avalanche dynamics across multiple model sizes.
Read more
Grokking as Dimensional Phase Transition in Neural Networks
Summary
This paper investigates the phenomenon of 'grokking' in neural networks, characterized by an abrupt transition from memorization to generalization during training. The author proposes that grokking can be understood as a dimensional phase transition, where the effective dimensionality (D) of the gradient field geometry shifts from sub-diffusive (D < 1) to super-diffusive (D > 1) at the point of generalization onset. This transition is analyzed through finite-size scaling (FSS) of gradient avalanche dynamics across various model scales. The findings indicate that the effective dimensionality reflects the geometry of the gradient field rather than the architecture of the network itself. The study presents evidence of this transition through time-resolved evolution, aggregate scaling analysis, and phase-resolved validation, demonstrating that grokking induces a crossover in cascade dynamics. The methodology employs the XOR boolean function as a controlled testbed, allowing for precise identification of grokking epochs across multiple model sizes. The results reveal that the dimensionality transition is robust across different topologies, suggesting new insights into the trainability of overparameterized networks and the underlying mechanisms of learning dynamics.
Methodology
The methodology includes finite-size scaling (FSS) of gradient avalanche dynamics across eight model scales, using the XOR boolean function as a controlled testbed. The study analyzes time-resolved evolution of effective dimensionality and employs a Threshold-based Diffusion Update model to quantify correlation structures in gradients during training.
Results
The results show that effective dimensionality D evolves from approximately 0.90 (sub-diffusive) to 1.20 (super-diffusive) during the grokking transition. The study confirms that this transition is robust across different network topologies and is indicative of self-organized criticality in the training dynamics.
Implications
The findings have significant implications for understanding the learning dynamics of neural networks, particularly in overparameterized settings. They suggest that the geometry of the gradient field plays a crucial role in the transition from memorization to generalization, potentially guiding future research on training algorithms and network architectures.
Hidden in the Multiplicative Interaction: Uncovering Fragility in Multimodal Contrastive Learning
Multimodal
- Identifies fragility in Symile's symmetric treatment of modalities, which can degrade performance under misalignment.
- Introduces Gated Symile, a gating mechanism that adapts modality contributions based on reliability.
- Demonstrates improved retrieval accuracy on both synthetic and real-world datasets compared to existing methods.
- Highlights the importance of modeling modality-specific reliability in multimodal contrastive learning.
Read more
Hidden in the Multiplicative Interaction: Uncovering Fragility in Multimodal Contrastive Learning
Summary
This paper addresses the limitations of existing multimodal contrastive learning methods, particularly focusing on the Symile approach, which utilizes a multiplicative interaction objective to capture higher-order cross-modal dependencies. The authors identify a critical fragility in Symile's design: it treats all modalities symmetrically without accounting for differences in reliability, which can lead to performance degradation when dealing with misaligned or unreliable modalities. To overcome this issue, they propose Gated Symile, a novel contrastive gating mechanism that adapts the contributions of each modality based on their reliability. This mechanism uses an attention-based approach to suppress unreliable inputs by interpolating embeddings towards learnable neutral directions and includes a NULL option for cases where reliable cross-modal alignment is unlikely. The effectiveness of Gated Symile is demonstrated through evaluations on a synthetic benchmark designed to expose the fragility of the multiplicative interaction, as well as on three real-world tri-modal datasets. The results show that Gated Symile consistently outperforms well-tuned versions of Symile and CLIP models in terms of top-1 retrieval accuracy, highlighting the importance of incorporating modality-specific reliability in multimodal contrastive learning.
Methodology
The authors propose Gated Symile, which employs an attention-based gating mechanism to modulate the contributions of different modalities in a contrastive learning framework. This involves computing candidate-conditioned gate weights to down-weight unreliable modalities and interpolating embeddings towards learnable neutral directions. The method is evaluated on a synthetic benchmark and three real-world tri-modal datasets to assess its performance and robustness.
Results
Gated Symile achieves higher top-1 retrieval accuracy than well-tuned Symile and CLIP models across various datasets. The synthetic benchmark explicitly reveals how misalignment in a single modality can distort the contrastive score, demonstrating the effectiveness of the proposed gating mechanism in improving robustness and performance.
Implications
The findings suggest that incorporating modality-specific reliability can significantly enhance the robustness of multimodal contrastive learning systems, making them more effective in real-world applications where data may be incomplete or misaligned. This has potential applications in fields such as medical imaging, where multiple data modalities are often used.
ACES: Who Tests the Tests? Leave-One-Out AUC Consistency for Code Generation
Large Language Models
Theory
Optimization
- Introduces LOO-AUC as a method to evaluate test reliability without knowing code correctness.
- Proposes ACES, a scoring system with two variants for weighting tests based on their discriminative power.
- Demonstrates that ACES achieves state-of-the-art results in Pass@k across multiple benchmarks.
- Establishes a theoretical foundation linking test consistency to discriminative power.
Read more
ACES: Who Tests the Tests? Leave-One-Out AUC Consistency for Code Generation
Summary
The paper addresses the challenge of selecting code candidates generated by large language models (LLMs) using tests that may themselves be incorrect. Traditional methods either treat all tests equally or rely on heuristics, leading to a circular dependency where the correctness of tests cannot be determined without knowing the correctness of the code. The authors propose a novel approach called leave-one-out AUC (LOO-AUC), which evaluates the ability of tests to distinguish between correct and incorrect code without requiring knowledge of their correctness. By holding out one test and ranking code candidates based on the remaining tests, the authors measure the agreement between the held-out test's results and the ranking. They introduce ACES (AUC ConsistEncy Scoring), which includes two variants: ACES-C, which provides closed-form weights for tests, and ACES-O, which iteratively optimizes test weights. Both methods operate on the binary pass matrix and achieve state-of-the-art performance in code generation benchmarks.
Methodology
The authors formalize code ranking as a weighted voting problem over a binary pass matrix, using leave-one-out evaluation to assess the discriminative power of tests. They derive the LOO-AUC metric and develop two algorithms: ACES-C, which uses closed-form weights, and ACES-O, which optimizes weights iteratively through a differentiable objective.
Results
The proposed ACES methods significantly improve Pass@k performance on various code generation benchmarks, with ACES-O performing well as a standalone method and ACES-C serving effectively as a plug-and-play scorer when combined with other filtering techniques.
Implications
The findings suggest that the proposed methods can enhance the reliability of code generation systems by improving the selection of candidate solutions based on test performance, potentially leading to more robust applications in software development and automated programming.
Gym-Anything: Turn any Software into an Agent Environment
Reinforcement Learning
Multimodal
Generative Models
- Gym-Anything enables the automatic creation of interactive environments for a wide variety of software applications.
- The framework uses a multi-agent approach for environment setup and auditing, enhancing reliability and scalability.
- CUA-World includes over 10,000 long-horizon tasks, significantly expanding the scope of agent training and evaluation.
- The proposed auditing mechanism improves task completion rates in long-horizon benchmarks.
Read more
Gym-Anything: Turn any Software into an Agent Environment
Summary
The paper introduces Gym-Anything, a novel framework designed to convert any software into an interactive environment for computer-use agents (CUAs). The authors highlight the limitations of existing benchmarks that focus on short-horizon tasks and a narrow range of software, which do not reflect the complexities of real-world workflows. Gym-Anything addresses this by framing environment creation as a multi-agent task, where a coding agent sets up the software and an audit agent verifies the setup against a quality checklist. The framework was applied to 200 software applications, resulting in CUA-World, a comprehensive collection of over 10,000 long-horizon tasks across various domains, including medical science and engineering. The paper also presents a long-horizon benchmark, CUA-World-Long, which challenges agents with tasks requiring over 500 steps. The authors demonstrate that distilling successful trajectories into a vision-language model significantly improves performance, and they provide a feedback mechanism for auditing completed tasks. All code and data are made publicly available to support future research in realistic computer-use agents.
Methodology
The authors developed Gym-Anything, which standardizes environment creation through a coding agent that writes setup scripts and an audit agent that verifies the setup. They employed a propose-and-amplify strategy to generate realistic tasks, using a combination of agentic and non-agentic models to scale task generation effectively.
Results
The implementation of Gym-Anything led to the creation of CUA-World, encompassing over 10,000 tasks across 200 software applications. The long-horizon benchmark, CUA-World-Long, demonstrated improved performance metrics, with a notable increase in task completion rates from 11.5% to 14.0% through the auditing process.
Implications
The framework has the potential to revolutionize the training and evaluation of computer-use agents by providing realistic environments that reflect actual economic activities. This could lead to more capable agents that can assist in complex digital tasks across various industries.
A Mixture of Experts Foundation Model for Scanning Electron Microscopy Image Analysis
Computer Vision
- Introduction of the first foundation model tailored for SEM image analysis.
- Utilization of a self-supervised transformer architecture with a Mixture of Experts mechanism.
- Demonstrated capability of defocus-to-focus image translation without paired supervision.
- Outperformed existing state-of-the-art techniques in multiple evaluation metrics.
Read more
A Mixture of Experts Foundation Model for Scanning Electron Microscopy Image Analysis
Summary
This paper presents the first foundation model specifically designed for analyzing Scanning Electron Microscopy (SEM) images, addressing the limitations of existing task-specific models that hinder scalability and adaptability in diverse applications. The proposed model is pretrained on a substantial dataset of 125,000 unlabeled SEM images, utilizing a self-supervised transformer architecture enhanced by a Mixture of Experts (MoE) mechanism. This allows the model to dynamically allocate its capacity based on the characteristics of the input images, effectively managing variations in texture, resolution, and noise. A key application demonstrated is the defocus-to-focus image translation, which is crucial for automating SEM workflows. The model successfully restores focused details from defocused inputs without requiring paired supervision, outperforming state-of-the-art techniques across multiple evaluation metrics. This work establishes a foundation for developing adaptable SEM models that can bridge the gap between foundational representation learning and real-world imaging needs, ultimately accelerating materials discovery.
Methodology
The model employs a masked autoencoding framework with a ViT-Large backbone, pretrained on a diverse corpus of SEM images. The integration of a Mixture of Experts mechanism allows for dynamic routing of model capacity based on input characteristics, enhancing adaptability across various imaging conditions.
Results
The proposed model effectively restores focused details from defocused SEM images, achieving superior performance compared to existing methods. It demonstrates robustness to noise and variations in imaging conditions, validating its potential for practical applications in automated microscopy workflows.
Implications
This foundation model can significantly enhance the efficiency and accuracy of SEM image analysis, enabling high-throughput automated inspections in materials science and semiconductor manufacturing. It paves the way for future developments in adaptable imaging models that can handle the complexities of real-world data.
Toward Consistent World Models with Multi-Token Prediction and Latent Semantic Enhancement
NLP
Large Language Models
Theory
- Theoretical analysis of MTP reveals its potential for inducing belief states but also highlights risks of structural hallucinations.
- LSE-MTP framework aligns multi-token predictions with ground-truth trajectories to enforce latent consistency.
- Experiments show that LSE-MTP improves path legality and robustness in multi-step planning tasks.
Read more
Toward Consistent World Models with Multi-Token Prediction and Latent Semantic Enhancement
Summary
This paper addresses the challenge of developing coherent internal world models in Large Language Models (LLMs) through a novel approach called Latent Semantic Enhancement Multi-Token Prediction (LSE-MTP). The authors analyze the limitations of conventional Next-Token Prediction (NTP), which focuses on one-step-ahead predictions and often fails to capture long-range dependencies and structural integrity in latent representations. They introduce Multi-Token Prediction (MTP) as a promising alternative that encourages models to consider longer-term predictions, thereby promoting representational contractivity and the emergence of internal belief states. However, the authors identify a critical issue with MTP: structural hallucinations, where models exploit shortcuts in latent space that violate environmental constraints. To mitigate this, LSE-MTP anchors predictions to ground-truth hidden state trajectories, ensuring valid transitions and reducing the risk of hallucinations. The paper presents empirical evidence from experiments on synthetic graphs and real-world scenarios, demonstrating that LSE-MTP enhances representation alignment, path legality, and robustness against perturbations, ultimately bridging the gap between discrete token predictions and continuous state representations.
Methodology
The authors conducted a theoretical analysis of the gradient coupling mechanism in MTP and proposed the LSE-MTP framework. They performed extensive experiments on both synthetic graphs and real-world data, specifically focusing on Manhattan taxi navigation, to evaluate the effectiveness of their approach in enhancing representation alignment and reducing structural hallucinations.
Results
The results indicate that LSE-MTP significantly improves the legality of paths generated by the model, enhances belief compression, and increases robustness to perturbations in multi-step planning scenarios. The empirical findings support the theoretical claims regarding the benefits of aligning predictions with ground-truth trajectories.
Implications
The findings suggest that enhancing the internal world models of LLMs can lead to better reasoning capabilities and more reliable performance in complex tasks. This work has potential applications in areas requiring robust planning and decision-making, such as autonomous navigation and interactive AI systems.
Jeffreys Flow: Robust Boltzmann Generators for Rare Event Sampling via Parallel Tempering Distillation
Generative Models
Theory
Efficient ML
- Introduction of Jeffreys Flow as a robust generative framework for rare event sampling.
- Utilizes symmetric Jeffreys divergence to mitigate mode collapse in Boltzmann generators.
- Demonstrates scalability and accuracy on complex multi-dimensional benchmarks.
- Provides theoretical guarantees for improved sampling accuracy and reduced mode collapse.
Read more
Jeffreys Flow: Robust Boltzmann Generators for Rare Event Sampling via Parallel Tempering Distillation
Summary
This paper addresses the challenge of sampling physical systems characterized by rough energy landscapes, particularly focusing on rare events and metastable trapping. While traditional Boltzmann generators have been employed for this purpose, they often suffer from mode collapse due to their reliance on the reverse Kullback–Leibler divergence. The authors introduce the Jeffreys Flow, a novel generative framework that utilizes the symmetric Jeffreys divergence to mitigate mode collapse by distilling empirical sampling data from Parallel Tempering (PT) trajectories. This approach balances local precision in targeting modes with global coverage of the distribution. The paper demonstrates that minimizing the Jeffreys divergence not only suppresses mode collapse but also corrects inaccuracies inherent in the sampling process. The authors validate the scalability and accuracy of the Jeffreys Flow on complex, non-convex benchmarks, including applications in Replica Exchange Stochastic Gradient Langevin Dynamics and Path Integral Monte Carlo for quantum thermal states. The results indicate that the Jeffreys Flow can effectively generate statistically independent samples across multi-modal distributions, providing a robust solution for rare event sampling.
Methodology
The Jeffreys Flow employs a sequence of invertible normalizing flows trained using reference samples generated from Parallel Tempering. By minimizing the Jeffreys divergence, the framework distills knowledge from empirical data to improve sampling accuracy and prevent mode collapse.
Results
The numerical tests reveal that the Jeffreys Flow significantly outperforms traditional Boltzmann generators in terms of sample diversity and accuracy, particularly in high-dimensional, multi-modal distributions. The framework successfully corrects stochastic gradient biases and accelerates exact importance sampling in quantum thermal states.
Implications
The Jeffreys Flow has the potential to enhance sampling techniques in various fields such as statistical mechanics, computational physics, and machine learning, particularly in scenarios involving complex, multi-modal distributions and rare event sampling.
Reproducing AlphaZero on Tablut: Self-Play RL for an Asymmetric Board Game
Reinforcement Learning
- Adaptation of AlphaZero for asymmetric games like Tablut.
- Implementation of separate policy and value heads for each player.
- Identification of catastrophic forgetting as a challenge in self-play.
- Significant performance improvements observed over 100 iterations.
Read more
Reproducing AlphaZero on Tablut: Self-Play RL for an Asymmetric Board Game
Summary
This paper explores the application of the AlphaZero reinforcement learning framework to Tablut, an asymmetric board game. The authors adapt the original AlphaZero algorithm, which was designed for symmetric games like Chess and Go, to accommodate the unique challenges posed by Tablut's asymmetric structure. A key modification involves implementing separate policy and value heads for each player, addressing the distinct objectives of attackers and defenders. The study highlights the challenges of training stability, particularly the issue of catastrophic forgetting, which was not previously reported in the original AlphaZero work. The authors implemented the system using JAX and conducted training on NVIDIA GPUs, achieving significant performance improvements over 100 self-play iterations. The results indicate a steady increase in Elo ratings and a notable disparity in win rates between attackers and defenders, suggesting potential imbalances in the game dynamics. Overall, this work demonstrates the feasibility of adapting AlphaZero to asymmetric games while identifying critical areas for future research.
Methodology
The authors replicated the AlphaZero architecture with modifications for Tablut, including separate policy and value heads for each player. They utilized Monte Carlo Tree Search (MCTS) with GumbelMuZero for efficient search and implemented the system in JAX. Training involved 100 self-play iterations with data augmentation and a replay buffer to stabilize learning.
Results
The model achieved a BayesElo rating of 1235 after 100 iterations, with a decrease in policy entropy and a reduction in the average number of pieces remaining at the end of games. The attacker’s win rate improved significantly, while the defender’s win rate declined, indicating potential asymmetries in the game.
Implications
This research provides insights into the adaptation of reinforcement learning algorithms for asymmetric games, highlighting the importance of tailored architectures and training strategies. The findings may inform future developments in game AI and broader applications in asymmetric decision-making scenarios.
Data Distribution Valuation Using Generalized Bayesian Inference
Theory
- Introduction of the Generalized Bayes Valuation (GBV) framework for data distribution valuation.
- Utilization of generalized Bayesian inference with a transferability loss function.
- Extension of GBV to Continual Generalized Bayes Valuation (CGBV) for continuous data streams.
- Demonstration of GBV's applicability to annotator evaluation and data augmentation.
Read more
Data Distribution Valuation Using Generalized Bayesian Inference
Summary
This paper addresses the problem of data distribution valuation, which quantifies the values of data distributions based on their samples. The authors propose a novel framework called Generalized Bayes Valuation (GBV) that employs generalized Bayesian inference with a loss function derived from transferability measures. This approach allows for a unified solution to various practical problems, including annotator evaluation and data augmentation, without the restrictive assumptions present in previous methods. The framework is scalable and efficient, making it suitable for large-scale applications. Additionally, the authors extend GBV to a continuous data stream setting, resulting in Continual Generalized Bayes Valuation (CGBV), which is designed for scenarios where data is acquired sequentially. The effectiveness of the proposed methods is empirically validated across different datasets, demonstrating their applicability in real-world scenarios.
Methodology
The authors develop the GBV framework by replacing the classical negative loglikelihood with a general loss function based on transferability measures. This allows for the construction of a posterior over data sources, which serves as the valuation. The framework is further extended to handle continuous data streams through CGBV, enabling dynamic re-evaluation of data quality.
Results
The experimental results indicate that the GBV framework is effective and efficient in various real-world applications, including annotator evaluation and data augmentation. The extension to CGBV also shows promise in managing continuous data acquisition scenarios.
Implications
The proposed framework has significant implications for data buyers in evaluating the quality of data from different vendors, enhancing decision-making processes in data acquisition. It also opens new avenues for research in data valuation and transfer learning.
PRIME: Prototype-Driven Multimodal Pretraining for Cancer Prognosis with Missing Modalities
Multimodal
- PRIME is designed to handle missing modalities in clinical data for cancer prognosis.
- It employs a prototype memory bank for semantic imputation and representation learning.
- The framework achieves superior performance on multiple cancer prognosis tasks compared to existing methods.
- PRIME supports robust predictions even when modalities are missing during inference.
Read more
PRIME: Prototype-Driven Multimodal Pretraining for Cancer Prognosis with Missing Modalities
Summary
The paper introduces PRIME, a novel framework for multimodal self-supervised pretraining aimed at cancer prognosis, particularly in scenarios where clinical data is often incomplete. Traditional approaches to multimodal learning require fully paired inputs, which is impractical in real-world clinical settings where modalities may be missing. PRIME addresses this challenge by employing a missing-aware strategy that learns robust representations from partially observed cohorts. It utilizes a unified token space and a shared prototype memory bank to facilitate semantic imputation of missing modalities through patient-level consensus retrieval. The framework incorporates two main pretraining objectives: inter-modality alignment and post-fusion consistency, which together enhance the model's ability to make predictions despite missing data. The authors evaluate PRIME on The Cancer Genome Atlas (TCGA) dataset across multiple cancer types and various prognostic tasks, demonstrating its effectiveness in overall survival prediction, mortality classification, and recurrence classification. The results indicate that PRIME outperforms existing methods, achieving high performance metrics while maintaining robustness against missing modalities. This work highlights the potential of missing-aware multimodal pretraining as a viable approach for prognosis modeling in fragmented clinical data environments.
Methodology
PRIME employs a missing-aware multimodal self-supervised pretraining framework that integrates histopathology images, gene expression data, and pathology reports. It utilizes a unified token space and a shared prototype memory bank for latent-space semantic imputation. The framework optimizes two complementary objectives: inter-modality alignment for paired modality subsets and post-fusion consistency under structured missingness augmentation, allowing it to learn robust representations from incomplete data.
Results
PRIME was evaluated on TCGA across three tasks: overall survival prediction (C-index of 0.653), 3-year mortality classification (AUROC of 0.689), and 3-year recurrence classification (AUROC of 0.637). It demonstrated the best macro-average performance among compared methods and showed improved robustness under test-time missingness.
Implications
The findings suggest that missing-aware multimodal pretraining can significantly enhance cancer prognosis modeling, making it a practical strategy for leveraging fragmented clinical data. This approach could lead to more accurate and personalized treatment planning in oncology.
Anticipatory Reinforcement Learning: From Generative Path-Laws to Distributional Value Functions
Reinforcement Learning
Theory
Time Series
- Introduction of Anticipatory Reinforcement Learning (ARL) framework for non-Markovian environments.
- Development of a 'Single-Pass' policy evaluation mechanism that avoids high-variance Monte Carlo methods.
- Utilization of Marcus-compliant Neural Controlled Differential Equations (CDEs) for accurate path dynamics.
- Formulation of a Self-Consistent Field (SCF) equilibrium to ensure consistency between deterministic and stochastic representations.
Read more
Anticipatory Reinforcement Learning: From Generative Path-Laws to Distributional Value Functions
Summary
This paper introduces Anticipatory Reinforcement Learning (ARL), a framework aimed at addressing the limitations of traditional reinforcement learning (RL) in non-Markovian environments, particularly when only a single trajectory is observed. Traditional RL methods often fail in complex environments characterized by jump-diffusions and structural breaks, as they rely on the Markov property. The ARL framework overcomes this by embedding the state space into a signature-augmented manifold, allowing the agent to utilize the history of the process as a dynamic coordinate. By employing a self-consistent field approach, ARL enables a deterministic evaluation of expected returns through a path-law proxy, significantly reducing computational complexity and variance compared to stochastic methods. The paper demonstrates that this approach preserves contraction properties and ensures stable generalization even in the presence of heavy-tailed noise. The results indicate that grounding RL in the topological features of path-space can lead to improved risk management and policy stability in volatile environments.
Methodology
The ARL framework employs a signature-augmented state space to represent path-dependent decision processes. It utilizes a self-consistent field approach for deterministic evaluation of expected returns and integrates Neural Controlled Differential Equations to model path dynamics. The framework also introduces a novel temporal difference operator to enhance learning stability.
Results
The ARL framework successfully demonstrates that agents can achieve proactive risk management and superior policy stability in environments with heavy-tailed noise. The transition to a deterministic evaluation method significantly reduces computational complexity while maintaining the ability to generalize effectively.
Implications
The ARL framework has potential applications in high-frequency finance and other domains where environments exhibit non-Markovian characteristics. It offers a mathematically rigorous alternative to traditional RL methods, enabling more efficient and stable decision-making in complex, dynamic settings.
A machine learning framework for uncovering stochastic nonlinear dynamics from noisy data
Time Series
Theory
Interpretability
- Introduces a hybrid framework that combines symbolic regression with Gaussian processes for modeling stochastic dynamics.
- Successfully identifies both symbolic and stochastic components of dynamical systems from noisy data.
- Demonstrates data efficiency, requiring only 102-103 data points for effective modeling.
- Validates the approach on both numerical benchmarks and experimental biological systems.
Read more
A machine learning framework for uncovering stochastic nonlinear dynamics from noisy data
Summary
This paper presents a novel machine learning framework designed to uncover stochastic nonlinear dynamics from noisy data. The authors address the challenge of modeling real-world systems that are influenced by noise, which can arise from various sources such as financial markets, biological systems, and environmental factors. Traditional methods for symbolic regression often overlook uncertainty, while Gaussian processes provide uncertainty quantification but lack insights into underlying dynamics. The proposed hybrid framework combines deep symbolic regression with Gaussian process-based maximum likelihood estimation, allowing for the recovery of symbolic governing equations alongside uncertainty inference in system parameters. The methodology is validated through numerical benchmarks, including harmonic, Duffing, and van der Pol oscillators, and is further tested on an experimental system of coupled biological oscillators. The results demonstrate that the framework is data-efficient, requiring only 102-103 data points, and is robust against noise, showcasing its potential for applications where uncertainty is inherent and both the structure and variability of dynamical systems need to be understood.
Methodology
The framework integrates deep symbolic regression with Gaussian process-based maximum likelihood estimation to model deterministic dynamics and noise structure separately. This approach does not require prior assumptions about the functional forms of the dynamics or noise.
Results
The framework was validated on various numerical benchmarks and an experimental system, successfully recovering the symbolic equations and quantifying uncertainty in system parameters. It proved to be data-efficient and robust to noise.
Implications
This framework has significant implications for fields requiring accurate modeling of complex dynamical systems under uncertainty, such as finance, biology, and environmental science. It enhances the understanding of system dynamics while providing reliable uncertainty quantification.
SODA: Semi On-Policy Black-Box Distillation for Large Language Models
NLP
Large Language Models
Efficient ML
- Introduction of semi on-policy distillation, utilizing a static snapshot of student responses for effective black-box alignment.
- SODA achieves a 10× speedup in training time and 27% reduction in peak GPU memory usage compared to GAD.
- Extensive evaluations show SODA outperforms or matches GAD on 15 out of 16 benchmark results.
- The method eliminates the need for adversarial training, simplifying the distillation process.
Read more
SODA: Semi On-Policy Black-Box Distillation for Large Language Models
Summary
The paper introduces SODA (Semi On-policy Distillation with Alignment), a novel approach to black-box knowledge distillation for large language models (LLMs). Traditional methods face a trade-off between off-policy and fully on-policy strategies, with off-policy methods like sequence-level knowledge distillation (SeqKD) failing to correct inherent student errors, while fully on-policy methods like Generative Adversarial Distillation (GAD) suffer from instability and high computational costs. SODA addresses these issues by leveraging a static snapshot of the student model's responses to create a contrastive signal that aligns the student with the teacher's superior outputs. This semi on-policy approach allows for effective distribution alignment without the need for continuous online sampling or additional models, leading to significant improvements in training efficiency. The authors validate SODA across multiple compact models, demonstrating its ability to match or exceed the performance of state-of-the-art methods while being faster and more memory-efficient.
Methodology
SODA employs a semi on-policy approach by first stabilizing the student model with a brief warmup on teacher data, followed by Direct Preference Optimization (DPO). This method uses the teacher's responses as preferred outputs and the student's own responses as dispreferred, creating a contrastive signal that facilitates effective learning without the overhead of adversarial training.
Results
SODA matches or outperforms the state-of-the-art GAD on 15 out of 16 model-dataset combinations, achieving improvements of up to +2.1 points in performance metrics. It also demonstrates a significant reduction in training time and memory usage, being 10× faster and consuming 27% less peak GPU memory.
Implications
The SODA framework has the potential to enhance the efficiency of training smaller language models, making them more accessible for deployment in various applications. Its ability to maintain high distillation quality while reducing resource requirements could lead to broader adoption of compact models in real-world scenarios.
LLMs Should Express Uncertainty Explicitly
Large Language Models
NLP
- Uncertainty in LLMs should be treated as an explicit interface rather than a latent quantity.
- Two complementary interfaces are proposed: verbalized confidence for final answers and an <uncertain> marker during reasoning.
- The verbalized-confidence interface improves calibration and reduces overconfident errors.
- The reasoning-time interface enhances visibility of failures and aids in retrieval control.
Read more
LLMs Should Express Uncertainty Explicitly
Summary
This paper addresses the critical issue of how large language models (LLMs) communicate uncertainty, which is essential for downstream decision-making processes such as abstention, retrieval, and verification. The authors propose that uncertainty should not be treated merely as a latent quantity to be estimated post-generation, but rather as an explicit interface that the model is trained to communicate. They introduce two complementary uncertainty interfaces: a global interface that verbalizes a calibrated confidence score for the final answer, and a local interface that emits an explicit <uncertain> marker during reasoning when the model encounters high-risk states. The study demonstrates that the verbalized-confidence interface significantly enhances calibration, reduces overconfident errors, and outperforms existing calibration baselines, while the reasoning-time interface reveals previously hidden failures, improves wrong-answer coverage, and serves as an effective signal for retrieval control. The authors analyze the mechanisms behind these improvements, highlighting the importance of training models to express uncertainty in a task-matched manner. Overall, the paper argues for the necessity of making uncertainty explicit in LLMs to facilitate better decision-making in practical applications.
Methodology
The authors implemented a unified post-training framework to evaluate the two uncertainty interfaces. They assessed calibration quality, behavioral reliability, and downstream retrieval control through empirical experiments and analyses, including error analysis and PCA.
Results
The verbalized-confidence interface showed substantial improvements in calibration and reduced overconfident errors, outperforming strong calibration baselines. The reasoning-time interface made failures visible during generation and improved wrong-answer coverage, providing effective signals for retrieval control.
Implications
The findings suggest that training LLMs to explicitly communicate uncertainty can enhance their reliability and usability in real-world applications, particularly in scenarios requiring critical decision-making based on model outputs.
Learning $ ext{AC}^0$ Under Graphical Models
Theory
- Introduces quasipolynomial-time algorithms for learning AC0 under graphical models with strong spatial mixing.
- Circumvents traditional Fourier analysis limitations by leveraging new sampling algorithms.
- Extends the applicability of low-degree polynomial approximation to other function classes.
- Addresses the critique of reliance on product structure in learning theory.
Read more
Learning $ ext{AC}^0$ Under Graphical Models
Summary
This paper addresses the challenge of learning constant-depth circuits (AC0) under more realistic correlated distributions, specifically those modeled by graphical structures. Building on the foundational work of Linial, Mansour, and Nisan (1993), which provided a quasipolynomial-time algorithm for learning AC0 under uniform distributions, the authors propose new algorithms that extend these results to graphical models exhibiting strong spatial mixing. The main innovation lies in circumventing the reliance on Fourier analysis, traditionally used in learning theory, by employing novel sampling techniques that facilitate the transfer of low-degree polynomial approximation results from uniform distributions to graphical models. This approach not only applies to AC0 but also extends to other function classes like monotone functions and halfspaces, thereby broadening the applicability of the low-degree algorithm in computational learning theory.
Methodology
The authors develop quasipolynomial-time algorithms that utilize tailored sampling techniques to analyze and learn from graphical models. They focus on the existence of low-degree approximations in these models, allowing them to bypass the need for Fourier analysis, which is typically a barrier in learning under non-product distributions.
Results
The paper demonstrates that AC0 can be efficiently learned from samples drawn from any graphical model with polynomial growth and strong spatial mixing. The results indicate that low-degree approximations exist even in the absence of product structure, ensuring the success of the low-degree algorithm in these contexts.
Implications
The findings have significant implications for computational learning theory, particularly in enhancing the understanding and capabilities of learning algorithms in more complex, real-world scenarios. This work could lead to improved algorithms for various applications in computer science, economics, and statistics where correlated data is prevalent.
Collapse-Free Prototype Readout Layer for Transformer Encoders
Theory
Efficient ML
NLP
- Introduction of DDCL-Attention, a prototype-based competitive readout layer for transformers.
- Mathematical guarantees against prototype collapse and formal training stability.
- Versatile application in multiple paradigms, including readout layers and hierarchical compression.
- Empirical validation demonstrating the effectiveness and efficiency of the method across various datasets.
Read more
Collapse-Free Prototype Readout Layer for Transformer Encoders
Summary
This paper presents DDCL-Attention, a novel prototype-based competitive readout layer designed for transformer encoders. Traditional methods for summarizing token representations, such as averaging or using a single class token, often lead to information loss and lack feedback on representational quality. DDCL-Attention addresses these issues by utilizing a small bank of globally learned prototype vectors that summarize recurring data patterns. Each token is assigned to these prototypes through a soft probabilistic rule, allowing for a weighted combination of prototypes as output. This mechanism operates with linear complexity concerning sequence length, contrasting with the quadratic cost of standard self-attention. The authors provide mathematical guarantees against prototype collapse, ensuring that prototypes do not converge to a single point, which is a common failure in existing methods. Additionally, they demonstrate training stability under specific conditions and showcase the versatility of DDCL-Attention across three application paradigms: as a final readout layer, a differentiable codebook, and a hierarchical document compressor. Experimental results across multiple datasets confirm the theoretical predictions, including zero violations of the loss decomposition and full utilization of the codebook, highlighting the effectiveness of the proposed approach in various contexts, including scientific tabular data classification.
Methodology
The authors developed DDCL-Attention, which employs a competitive readout mechanism using a bank of globally learned prototypes. The method utilizes soft probabilistic assignments for token-to-prototype mapping, ensuring linear complexity. Theoretical analysis includes exact loss decomposition and stability conditions derived from Tikhonov's singular perturbation theory.
Results
Experiments showed that the loss decomposition holds without violations, prototype separation increases as predicted, and the codebook achieves 100% utilization, significantly outperforming standard hard vector quantization methods. Additional experiments demonstrated the applicability of DDCL-Attention to scientific tabular data classification.
Implications
The proposed DDCL-Attention layer can enhance transformer models by providing a more structured and informative summary of token representations, potentially improving performance in various tasks across NLP and beyond. Its stability and efficiency make it suitable for large-scale applications.
NativeTernary: A Self-Delimiting Binary Encoding with Unary Run-Length Hierarchy Markers for Ternary Neural Network Weights, Structured Data, and General Computing Infrastructure
NLP
Large Language Models
Efficient ML
- Introduces NativeTernary, a binary encoding scheme for ternary values.
- Utilizes unary run-length encoding to represent semantic hierarchy levels.
- Addresses the lack of native binary formats for ternary neural networks.
- Offers three encoding variants with different delimiter configurations.
Read more
NativeTernary: A Self-Delimiting Binary Encoding with Unary Run-Length Hierarchy Markers for Ternary Neural Network Weights, Structured Data, and General Computing Infrastructure
Summary
The paper introduces NativeTernary, a novel binary encoding scheme designed to facilitate the representation of ternary values (either balanced {-1, 0, +1} or unsigned {0, 1, 2}) within existing binary infrastructures. The key innovation is the use of unary run-length encoding to denote semantic hierarchy levels, allowing for the inline encoding of structural information alongside data. This method addresses the limitations of current binary formats that lack native support for ternary weights, particularly in the context of large language models like BitNet b1.58, which operate on ternary weights but are stored in binary formats. NativeTernary provides three encoding variants, with the primary scheme utilizing {11} as a delimiter, and offers a path toward ternary-native computing without requiring hardware changes. The encoding efficiently represents hierarchical structures, such as character, word, and sentence boundaries, with the cost of encoding increasing with the rarity of the boundary. The decoder is designed as a stateless state machine, ensuring resilience to bitstream corruption. Overall, NativeTernary presents a significant advancement in data encoding for ternary neural networks and structured data applications.
Methodology
The methodology involves partitioning the 2-bit pair space into data symbols and delimiters, using unary run-length encoding to denote hierarchy levels. The encoding scheme allows for inline representation of both data and structural boundaries, with a focus on minimizing overhead and maximizing efficiency. The paper also explores different configurations for delimiters and their implications for various applications.
Results
The NativeTernary encoding scheme successfully demonstrates the ability to encode ternary data and structural information simultaneously, achieving efficient representation without additional overhead. The proposed decoder is resilient to bitstream corruption, enhancing reliability in practical applications.
Implications
NativeTernary has potential applications across various domains, including ternary neural network weight storage, hierarchical natural language encoding, edge computing, IoT, satellite telemetry, industrial sensors, automotive systems, medical devices, gaming, and financial data processing. It opens avenues for more efficient data representation and processing in systems that utilize ternary values.
Reasoning Through Chess: How Reasoning Evolves from Data Through Fine-Tuning and Reinforcement Learning
NLP
Large Language Models
Reinforcement Learning
- Fine-tuning on single best moves leads to effective RL but unfaithful reasoning.
- Training on multi-move trajectories results in more stable RL and faithful reasoning.
- Reinforcement learning improves move quality and reduces hallucination rates.
- SFT-checkpoint metrics can predict final RL performance.
Read more
Reasoning Through Chess: How Reasoning Evolves from Data Through Fine-Tuning and Reinforcement Learning
Summary
This paper investigates how reasoning capabilities in language models can be enhanced through supervised fine-tuning (SFT) and reinforcement learning (RL), specifically in the context of chess. The authors analyze the impact of different training datasets on model performance and reasoning quality. They find that while fine-tuning a model to predict the best move leads to effective RL, it often results in unfaithful reasoning. In contrast, training on multi-move trajectories yields comparable performance with more stable RL and faithful reasoning. The study also reveals that RL significantly improves move quality and reduces hallucination rates. Additionally, several SFT-checkpoint metrics are identified as predictive of post-RL performance. The authors release their checkpoints, models, and code, achieving superior performance compared to existing open-source reasoning models in chess with a 7B-parameter model.
Methodology
The authors employed a 7B-parameter language model trained on custom datasets using supervised fine-tuning (SFT) to predict chess moves, followed by reinforcement learning (RL) to enhance reasoning capabilities. They analyzed the effects of different training datasets and evaluated model performance based on various metrics.
Results
The study found that training on multi-move trajectories led to comparable downstream performance with faithful reasoning, while fine-tuning on single best moves resulted in unfaithful reasoning. RL was shown to induce a positive shift in move quality and reduce hallucination rates. Several SFT-checkpoint metrics were identified as predictive of the model's performance post-RL.
Implications
The findings suggest that careful selection of training strategies and datasets can significantly enhance reasoning capabilities in language models, particularly in complex domains like chess. This approach could be applied to other domains requiring reasoning and decision-making, potentially improving AI performance in various applications.
Vehicle-as-Prompt: A Unified Deep Reinforcement Learning Framework for Heterogeneous Fleet Vehicle Routing Problem
Reinforcement Learning
Optimization
- Introduces a unified DRL framework for solving HFVRP and its variants.
- Develops the Vehicle-as-Prompt mechanism to streamline decision-making.
- Achieves superior performance compared to existing DRL methods and traditional heuristics.
- Demonstrates strong zero-shot generalization across diverse problem scales.
Read more
Vehicle-as-Prompt: A Unified Deep Reinforcement Learning Framework for Heterogeneous Fleet Vehicle Routing Problem
Summary
This paper addresses the Heterogeneous Fleet Vehicle Routing Problem (HFVRP), which presents unique challenges due to varying fixed costs, variable travel costs, and capacity constraints associated with heterogeneous vehicle fleets. Traditional Deep Reinforcement Learning (DRL) methods have primarily focused on homogeneous scenarios, leading to suboptimal performance in HFVRP. To overcome these limitations, the authors propose a unified DRL framework called Vehicle-as-Prompt (VaP), which models the problem as a single-stage autoregressive decision process. The framework includes a cross-semantic encoder and a multi-view decoder that effectively capture the relationships between vehicle heterogeneity and customer node attributes. Extensive experiments demonstrate that VaP-CSMV outperforms existing state-of-the-art DRL-based neural solvers and achieves competitive results compared to traditional heuristic solvers, significantly reducing inference time. Additionally, the framework shows strong zero-shot generalization capabilities on large-scale and previously unseen problem variants, validating the contributions of its individual components through ablation studies.
Methodology
The proposed methodology involves a unified DRL framework that integrates a cross-semantic encoder and a multi-view decoder. The Vehicle-as-Prompt mechanism formulates the routing problem as a single-stage autoregressive decision process, allowing for efficient joint optimization of vehicle dispatching and routing. The framework leverages dual-attention mechanisms and cross-semantic feature fusion to adapt to various HFVRP variants.
Results
The VaP-CSMV framework significantly outperforms existing state-of-the-art DRL-based neural solvers and achieves competitive solution quality compared to traditional heuristic solvers. It also reduces inference time to mere seconds and exhibits robust zero-shot generalization capabilities on large-scale and previously unseen problem variants.
Implications
The findings suggest that the proposed framework can enhance the efficiency of logistics operations by optimizing vehicle routing in heterogeneous fleets. Its ability to generalize across different problem variants could facilitate real-time applications in logistics and supply chain management.
Territory Paint Wars: Diagnosing and Mitigating Failure Modes in Competitive Multi-Agent PPO
Reinforcement Learning
- Introduces Territory Paint Wars as a benchmark for studying competitive MARL failure modes.
- Identifies five critical implementation failure modes affecting PPO performance.
- Characterizes competitive overfitting, where self-play performance does not indicate generalization ability.
- Proposes opponent mixing as a simple yet effective solution to improve generalization.
Read more
Territory Paint Wars: Diagnosing and Mitigating Failure Modes in Competitive Multi-Agent PPO
Summary
This paper introduces Territory Paint Wars, a competitive multi-agent reinforcement learning (MARL) environment designed to investigate the failure modes of Proximal Policy Optimization (PPO) during self-play. The study reveals that an agent trained for 84,000 episodes achieves only a 26.8% win rate against a random opponent, highlighting significant implementation-level failures such as reward-scale imbalance and ineffective long-horizon credit assignment. After addressing these issues, the authors identify a new phenomenon termed competitive overfitting, where agents perform well in self-play but fail to generalize against random opponents, leading to a win rate drop from 73.5% to 21.6%. To mitigate this, the authors propose opponent mixing, which substitutes a portion of training episodes with a fixed random policy, successfully restoring generalization to 77.1%. The findings emphasize the necessity of opponent diversity in competitive settings and provide a reproducible benchmark for further research in MARL.
Methodology
The authors developed a deterministic, zero-sum two-player grid game in Unity and systematically tested various configurations of PPO agents through controlled ablations. They identified and corrected implementation failures, then analyzed the effects of these corrections on agent performance, particularly focusing on the emergence of competitive overfitting and the effectiveness of opponent mixing.
Results
The initial PPO agent achieved a 26.8% win rate against a random opponent. After addressing five identified failure modes, the agent's performance improved, but competitive overfitting was observed, causing a drop in generalization win rate to 21.6%. Implementing opponent mixing restored the generalization win rate to 77.1% (±12.6%), demonstrating the importance of opponent diversity in training.
Implications
The findings suggest that maintaining a diverse set of opponents is crucial for developing robust agents in competitive MARL settings. The open-sourcing of Territory Paint Wars provides a valuable resource for researchers to explore and address similar failure modes in multi-agent systems.
Curvature-Aware Optimization for High-Accuracy Physics-Informed Neural Networks
Optimization
Efficient ML
Theory
- Introduction of curvature-aware optimization techniques for PINNs.
- Demonstration of improved convergence and accuracy on complex differential equations.
- Comparison of new PINN methods against high-order numerical methods.
- Addressing scalability for batched training in large data-driven problems.
Read more
Curvature-Aware Optimization for High-Accuracy Physics-Informed Neural Networks
Summary
This paper presents advanced optimization strategies aimed at enhancing the convergence and accuracy of Physics-Informed Neural Networks (PINNs) when solving complex partial and ordinary differential equations. The authors introduce curvature-aware optimization techniques, including Natural Gradient (NG), Self-Scaling BFGS, and Broyden optimizers, which are designed to better exploit the geometric properties of the loss landscape. The study evaluates these optimizers on various challenging problems, such as the Helmholtz equation, Stokes flow, and stiff ODEs relevant to pharmacokinetics. Additionally, the authors propose new PINN-based methods for specific equations and compare their performance against high-order numerical methods, demonstrating the effectiveness of their approaches. The paper also addresses the scalability of these optimizers for batched training, making them suitable for large-scale data-driven applications. Overall, the work highlights the importance of incorporating curvature information into optimization processes to improve the reliability and speed of training PINNs.
Methodology
The authors implemented advanced optimization strategies, including Natural Gradient, Self-Scaling BFGS, and Broyden methods, to enhance the training of PINNs. They conducted empirical studies to evaluate the performance of these optimizers on various differential equations and proposed new methods for solving specific equations. The study also focused on the scalability of these optimizers for batched training.
Results
The proposed curvature-aware optimization techniques significantly improved the convergence rates and accuracy of PINNs when applied to challenging problems. The new PINN methods demonstrated competitive performance compared to traditional high-order numerical methods, validating the effectiveness of incorporating curvature information into the optimization process.
Implications
The findings suggest that curvature-aware optimization can lead to more efficient and accurate solutions in scientific machine learning, particularly in fields requiring the solution of complex differential equations. This could have broad applications in engineering, physics, and biomedical fields where accurate modeling of physical phenomena is crucial.
ReLU Networks for Exact Generation of Similar Graphs
Generative Models
Graph Learning
Theory
- Introduces ReLU networks for exact graph generation within specified edit distances.
- Eliminates reliance on training data, ensuring validity of generated graphs.
- Demonstrates scalability and exactness for large graphs (up to 1400 vertices).
- Outperforms existing generative models in meeting edit distance constraints.
Read more
ReLU Networks for Exact Generation of Similar Graphs
Summary
This paper addresses the challenge of generating graphs that are constrained by a specified graph edit distance from a source graph, which is crucial in fields such as cheminformatics and network anomaly synthesis. The authors propose a novel approach using ReLU neural networks that can deterministically generate graphs within a bounded edit distance, eliminating the reliance on training data that characterizes existing models. They theoretically demonstrate the existence of constant depth and O(n²d) size ReLU networks capable of generating valid graphs with n vertices and an edit distance d. Experimental results show that the proposed networks can generate valid graphs for instances with up to 1400 vertices and edit distances up to 140, outperforming baseline models like GraphRNN and GraphGDP, which fail to meet the edit distance constraints. This work provides a theoretical foundation for constructing compact generative models that guarantee the validity of generated graphs, marking a shift from probabilistic sampling to exact synthesis under similarity constraints.
Methodology
The authors theoretically characterize ReLU neural networks that can generate graphs within a specified graph edit distance. They demonstrate the existence of networks with constant depth and polynomial size that deterministically produce valid graphs. The methodology includes experimental evaluations to validate the performance of the proposed networks against existing models.
Results
The proposed ReLU networks successfully generated valid graphs for instances with up to 1400 vertices and edit distances up to 140. In contrast, baseline models such as GraphRNN and GraphGDP failed to generate any graphs that satisfied the desired edit distance constraints, highlighting the effectiveness of the proposed approach.
Implications
This research has significant implications for applications requiring exact graph generation, such as molecule design and network analysis. By providing a method for generating graphs with guaranteed structural validity, it opens new avenues for rigorous applications in bioinformatics and other fields where precise graph representations are essential.
The Geometric Alignment Tax: Tokenization vs. Continuous Geometry in Scientific Foundation Models
Theory
- The Geometric Alignment Tax is a fundamental issue in foundation models that leads to geometric distortion.
- Replacing cross-entropy with continuous objectives significantly improves geometric stability.
- Learned codebooks can worsen geometric stability despite better reconstruction.
- Three distinct failure regimes in biological foundation models were identified: Local-Global Decoupling, Representational Compression, and Geometric Vacuity.
Read more
The Geometric Alignment Tax: Tokenization vs. Continuous Geometry in Scientific Foundation Models
Summary
This paper investigates the limitations of foundation models in biology and physics, particularly their failure to maintain the continuous geometry of the systems they model due to the Geometric Alignment Tax. This tax arises from the necessity of converting continuous manifolds into discrete categorical representations, leading to significant geometric distortion. Through controlled experiments on synthetic dynamical systems, the author demonstrates that using continuous objectives instead of cross-entropy can reduce geometric distortion by up to 8.5 times. The study also reveals that learned codebooks can exacerbate geometric instability despite improving reconstruction quality. The paper evaluates 14 biological foundation models using rate-distortion theory and MINE, identifying three failure regimes: Local-Global Decoupling, Representational Compression, and Geometric Vacuity. The findings suggest that no model can achieve low distortion, high mutual information, and global coherence simultaneously, highlighting the intrinsic challenges in modeling continuous systems with discrete methods.
Methodology
The study employs controlled synthetic experiments using three architectures (Transformer, SSM, hybrid) trained on datasets with known continuous geometries. The evaluation of geometric stability is conducted through a standardized harness that computes Representational Dissimilarity Matrices (RDMs) and assesses various metrics to quantify geometric preservation under perturbations.
Results
The results indicate that under continuous objectives, the geometric stability of different architectures varies by 1.3 times, while under discrete tokenization, the divergence can reach 3,000 times. The experiments confirm that finer quantization in learned codebooks leads to increased geometric instability. The analysis of 14 foundation models reveals three failure regimes, emphasizing the limitations of current models in maintaining geometric fidelity.
Implications
The findings have significant implications for the design and evaluation of foundation models in scientific domains, suggesting a need for approaches that better preserve continuous geometries. This could lead to advancements in model architectures and training methodologies that enhance the fidelity of representations in biological and physical systems.
Expectation Maximization (EM) Converges for General Agnostic Mixtures
Theory
Optimization
- The paper extends the EM algorithm to a general agnostic setting, allowing for arbitrary parametric functions.
- Gradient EM is proposed as a modification of the traditional EM algorithm, focusing on exponential convergence to loss minimizers.
- The framework encompasses various problems, including mixed linear classifiers and generalized linear regression.
- Convergence results are established under proper initialization and separation conditions, highlighting the robustness of the approach.
Read more
Expectation Maximization (EM) Converges for General Agnostic Mixtures
Summary
This paper investigates the convergence properties of the Expectation Maximization (EM) algorithm in the context of agnostic learning, particularly for fitting parametric functions to data points without assuming a generative model. The author extends the existing work on mixed linear regression to a broader class of problems by allowing arbitrary parametric functions and strongly convex, smooth loss functions. The proposed method, termed gradient EM, is shown to converge exponentially to population loss minimizers under certain initialization and separation conditions. This work demonstrates the effectiveness of EM-type algorithms in non-generative settings, expanding their applicability beyond traditional linear regression problems.
Methodology
The author employs a gradient EM algorithm that iteratively refines parameter estimates based on observed data, utilizing a soft-min loss function that is related to the optimal maximum likelihood loss. The analysis focuses on the convergence properties of this algorithm in the agnostic learning framework, where no generative model is assumed.
Results
The paper shows that with appropriate initialization and separation conditions, the iterates of the gradient EM algorithm converge exponentially to the population loss minimizers with high probability. This result indicates that the proposed method is effective for a wide range of problems beyond the traditional mixed linear regression.
Implications
The findings have significant implications for machine learning applications where generative models are not available or applicable. The ability to fit complex parametric functions using EM-type algorithms can enhance performance in various domains, including regression, classification, and other predictive modeling tasks.
Convolutional Neural Network and Adversarial Autoencoder in EEG images classification
Computer Vision
- Combination of computer vision and neural networks for EEG classification.
- Development of a dataset of 2D EEG topograms from raw EEG signals.
- Implementation of a CNN architecture tailored for EEG image classification.
- Successful classification of motor cortex activities during hand movements.
Read more
Convolutional Neural Network and Adversarial Autoencoder in EEG images classification
Summary
This paper explores the application of computer vision algorithms combined with neural network methods for classifying EEG data, specifically focusing on human brain activity during hand movements. The authors pre-processed raw EEG signals to create 2D EEG topograms and employed both supervised and semi-supervised neural networks to classify different motor cortex activities. The study involved 15 healthy male participants, and EEG data was collected during specific hand movement tasks. The authors utilized wavelet analysis techniques to convert raw EEG signals into topographical images, which were then classified using a Convolutional Neural Network (CNN). The dataset consisted of 939 images, split into training and testing sets. The CNN architecture included multiple convolutional and pooling layers, with ReLU and Softmax activation functions. The results demonstrated the effectiveness of the proposed methods in classifying EEG images, indicating a promising approach for brain activity analysis in neuroscience.
Methodology
The methodology involved collecting EEG data from 15 participants, preprocessing the signals to create 2D topograms, and employing a CNN for classification. The CNN architecture included four convolutional layers, pooling layers, and fully connected layers, with ReLU and Softmax activation functions utilized for performance optimization.
Results
The CNN achieved effective classification of EEG images, demonstrating the potential of the proposed approach. The dataset comprised 939 images, with an 80/20 split for training and testing, leading to satisfactory classification results.
Implications
The findings suggest that integrating computer vision techniques with neural networks can enhance the analysis of EEG data, potentially leading to better understanding and classification of brain activities. This approach may have applications in neuroscience research, brain-computer interfaces, and clinical diagnostics.
Is Prompt Selection Necessary for Task-Free Online Continual Learning?
Efficient ML
Theory
- Prompt selection strategies in task-free OCL often yield suboptimal results.
- The proposed SinglePrompt framework eliminates the need for prompt selection.
- SinglePrompt focuses on classifier optimization with a single prompt per self-attention block.
- The framework employs cosine similarity-based logit design to reduce forgetting.
Read more
Is Prompt Selection Necessary for Task-Free Online Continual Learning?
Summary
This paper addresses the challenges of task-free online continual learning (OCL), where data arrives in a non-stationary stream without clear task boundaries. While recent approaches have utilized prompt selection strategies to adaptively choose prompts based on input signals, the authors observe that these strategies often fail to select appropriate prompts, leading to suboptimal performance. To overcome this limitation, the authors propose a novel framework called SinglePrompt, which eliminates the need for prompt selection and focuses on optimizing the classifier. The SinglePrompt method involves injecting a single prompt into each self-attention block, employing a cosine similarity-based logit design to mitigate forgetting effects, and masking logits for unexposed classes in the current mini-batch. The authors demonstrate that this simplified approach achieves state-of-the-art performance across various online continual learning benchmarks, suggesting that prompt selection may not be necessary for effective continual learning in task-free settings.
Methodology
The authors introduce the SinglePrompt framework, which consists of three main components: injecting a single prompt into each self-attention block, using a cosine similarity-based logit design to alleviate forgetting, and masking logits for classes not exposed in the current mini-batch. This approach simplifies the continual learning process by removing the complexities associated with prompt selection.
Results
The SinglePrompt framework outperforms existing prompt selection methods and achieves state-of-the-art results on various online continual learning benchmarks, demonstrating its effectiveness in handling task-free scenarios.
Implications
The findings suggest that continual learning models can be simplified by removing the prompt selection process, potentially leading to more efficient and robust learning systems in dynamic environments. This has implications for real-world applications where data is continuously streamed without clear task boundaries.
The UNDO Flip-Flop: A Controlled Probe for Reversible Semantic State Management in State Space Models
Theory
- Introduces the UNDO Flip-Flop task to evaluate reversible semantic state retrieval in SSMs.
- Demonstrates that existing models fail to learn the required stack-based rollback mechanism.
- Establishes a systematic failure in retrieving historical states under adversarial conditions.
- Highlights the gap between theoretical expressibility of models and their practical learning outcomes.
Read more
The UNDO Flip-Flop: A Controlled Probe for Reversible Semantic State Management in State Space Models
Summary
This paper introduces the UNDO Flip-Flop task, designed to evaluate the ability of state space models (SSMs) to manage reversible semantic states. While existing benchmarks like the standard Flip-Flop task and Dyck languages assess monotonic state tracking and structural nesting, they do not address the retrieval of historical states under non-monotonic updates. The UNDO Flip-Flop extends the standard Flip-Flop by incorporating a stack-based pop operator, requiring models to retrieve specific historical values. The author evaluates the Mamba-2 architecture, both in one-layer and two-layer configurations, and finds that neither configuration successfully implements the required stack-based rollback mechanism. Instead, both converge on a local toggle heuristic that merely inverts the current state rather than retrieving stored history. This failure is systematic, as evidenced by a significant drop in accuracy under adversarial conditions, highlighting a critical distinction between theoretical expressibility and what can be reliably learned through gradient descent. The findings suggest that while SSMs theoretically possess the capacity for complex state management, practical training outcomes may not align with these capabilities.
Methodology
The paper employs the UNDO Flip-Flop task to assess the performance of the Mamba-2 architecture in managing reversible semantic states. The evaluation involves training one-layer and two-layer configurations of Mamba-2 on the task and analyzing their ability to retrieve historical states under non-monotonic updates. The study includes causal ablation to identify bottlenecks in retrieval versus storage capabilities.
Results
The evaluation reveals that neither the one-layer nor the two-layer Mamba-2 models successfully acquire the stack-based rollback mechanism. Both configurations resort to a local toggle heuristic, resulting in a significant accuracy drop to 41.10% under adversarial retraction pressure, which is below random chance. Causal ablation indicates that the primary issue lies in the retrieval process rather than storage.
Implications
The findings underscore the importance of distinguishing between theoretical capabilities of machine learning architectures and their practical learning outcomes. This has implications for the design of future models and training algorithms, particularly in tasks requiring complex memory management and state retrieval.
Same Graph, Different Likelihoods: Calibration of Autoregressive Graph Generators via Permutation-Equivalent Encodings
Generative Models
Graph Learning
Theory
- Introduces Linearization Uncertainty (LU) as a metric for evaluating the consistency of likelihoods across different linearizations of graphs.
- Demonstrates that biased linearization strategies can lead to lower NLL but higher calibration errors under random permutations.
- Shows that LU correlates better with molecular stability than NLL, suggesting it is a more reliable quality measure for generated graphs.
Read more
Same Graph, Different Likelihoods: Calibration of Autoregressive Graph Generators via Permutation-Equivalent Encodings
Summary
This paper addresses the calibration of autoregressive graph generators, which assign likelihoods through a sequential construction process. The authors highlight the importance of consistent likelihoods across different linearizations of the same graph. They introduce the Segmented Eulerian Neighborhood Trails (SENT) encoding method, which allows for multiple equivalent linearizations of a graph. The study quantifies the discrepancies in negative log-likelihood (NLL) across these linearizations using a new metric called Linearization Uncertainty (LU). The authors demonstrate that biased linearization strategies can lead to lower NLL in their native order but result in significantly higher expected calibration error (ECE) when subjected to random permutations. Their experiments on the QM9 molecular graph benchmark reveal that LU provides a more reliable measure of generated molecular quality compared to NLL alone, indicating that models may overfit to their training linearization rather than capturing the underlying graph structure.
Methodology
The authors utilize the SENT encoding to linearize graphs into sequences and evaluate four distinct linearization strategies: Random Order, Min-Degree First, Max-Degree First, and Anchor Expansion. They compute Linearization Uncertainty (LU) by measuring the coefficient of variation of NLL across different linearizations and assess calibration using Expected Calibration Error (ECE).
Results
The study finds that biased orderings yield lower NLL on their native order but exhibit significantly higher ECE under random permutations. On the QM9 dataset, the correlation between generated graphs' NLL and molecular stability is AUC = 0.43, while LU achieves AUC = 0.85, indicating that LU is a more effective metric for evaluating the quality of generated molecules.
Implications
The findings suggest that autoregressive graph generators need to be calibrated properly to ensure reliable likelihoods across different linearizations. The introduction of LU could improve the evaluation of generative models in graph learning, particularly in applications involving molecular generation and other permutation-invariant structures.
Beauty in the Eye of AI: Aligning LLMs and Vision Models with Human Aesthetics in Network Visualization
Computer Vision
Large Language Models
Generative Models
- A large-scale user study collected human-preferred visualizations for 11,531 graphs, creating a benchmark dataset.
- Prompt engineering significantly improved the alignment of LLM outputs with human aesthetic preferences.
- Vision Models can achieve alignment with human preferences comparable to human-human agreement.
- The findings indicate the feasibility of using AI as a scalable proxy for human labelers in network visualization.
Read more
Beauty in the Eye of AI: Aligning LLMs and Vision Models with Human Aesthetics in Network Visualization
Summary
This paper addresses the challenge of aligning network visualization with human aesthetic preferences, moving beyond traditional heuristic metrics that often fail to capture user satisfaction. The authors conducted a large-scale user study involving 27 participants to collect human preference labels for 11,531 graphs, resulting in a benchmark dataset of 64,436 labels. They explored the use of Large Language Models (LLMs) and Vision Models (VMs) as proxies for human judgment in visualizations. The study revealed that while humans share common aesthetic preferences, there is significant diversity in individual choices. The authors employed prompt engineering techniques to enhance LLM performance, achieving alignment with human preferences comparable to human-human agreement. Additionally, they demonstrated that VMs can be trained to align with human aesthetics effectively. The findings suggest that AI can serve as a scalable alternative to human labelers, potentially reducing the costs associated with large-scale data collection in network visualization.
Methodology
The authors conducted a user study with 27 participants to gather human preference labels for various graph visualizations. They utilized LLMs and VMs, employing prompt engineering and confidence score filtering to enhance alignment with human preferences. The study involved analyzing the effectiveness of these models in predicting human aesthetic judgments based on the collected data.
Results
The study found that LLMs could achieve alignment with human preferences that closely matched human-human agreement through effective prompt engineering. VMs trained with a focus on human aesthetic diversity also demonstrated comparable alignment. The results indicate that AI can effectively emulate human judgment in network visualization, significantly reducing the need for extensive human labeling.
Implications
The research suggests that AI can streamline the process of generating aesthetically pleasing and informative network visualizations, making it more efficient and cost-effective. This could lead to advancements in various applications that rely on graph visualization, such as social networks, biological data representation, and transport systems.
Weight-Informed Self-Explaining Clustering for Mixed-Type Tabular Data
Interpretability
- Introduction of WISE framework for clustering mixed-type tabular data.
- Development of Binary Encoding with Padding (BEP) for feature alignment.
- Implementation of Leave-One-Feature-Out (LOFO) for diverse feature weighting.
- Creation of Discriminative FreqItems (DFI) for coherent explanations.
Read more
Weight-Informed Self-Explaining Clustering for Mixed-Type Tabular Data
Summary
The paper addresses the challenges of clustering mixed-type tabular data, which combines numerical and categorical attributes, by proposing a novel framework called WISE (Weight-Informed Self-Explaining). Traditional methods struggle with representation misalignment, uneven feature relevance, and disconnected explanations. WISE integrates representation, feature weighting, clustering, and interpretation into a unified, unsupervised pipeline. Key innovations include Binary Encoding with Padding (BEP) for aligning heterogeneous features, a Leave-One-Feature-Out (LOFO) strategy for diverse feature weighting, and Discriminative FreqItems (DFI) for providing coherent, instance-level explanations that are consistent across clusters. The framework ensures intrinsic interpretability and allows for a more nuanced understanding of the clustering process. Extensive experiments on six real-world datasets demonstrate that WISE outperforms classical and neural network baselines in clustering quality while maintaining efficiency and producing human-interpretable explanations.
Methodology
The WISE framework employs a combination of Binary Encoding with Padding (BEP) to unify feature representation, a Leave-One-Feature-Out (LOFO) strategy to derive diverse feature weights, and a two-stage weight-aware clustering procedure to aggregate semantic partitions. Discriminative FreqItems (DFI) are utilized to provide consistent and interpretable explanations that align with the clustering process.
Results
The experiments conducted on six real-world datasets show that WISE consistently outperforms classical clustering methods and neural network approaches in terms of clustering quality. Additionally, it provides faithful and interpretable explanations that are grounded in the same features driving the clustering.
Implications
The WISE framework has significant implications for exploratory data analysis in various fields such as demography, health records, and financial risk assessment. Its ability to provide interpretable clustering results can enhance decision-making processes and facilitate better understanding of data structures.
PCA-Driven Adaptive Sensor Triage for Edge AI Inference
Time Series
Efficient ML
Optimization
- PCA-Triage optimizes bandwidth allocation by adapting sampling rates based on sensor data correlations.
- The algorithm runs efficiently with zero trainable parameters and operates within a strict time budget.
- Empirical results show PCA-Triage achieves superior fault detection performance compared to existing methods.
- The method is robust against data loss and noise, making it suitable for industrial IoT applications.
Read more
PCA-Driven Adaptive Sensor Triage for Edge AI Inference
Summary
This paper introduces PCA-Triage, a novel streaming algorithm designed to optimize bandwidth allocation in multi-channel sensor networks within industrial IoT settings. The algorithm leverages incremental Principal Component Analysis (PCA) to determine proportional sampling rates for each sensor channel under a defined bandwidth budget. By focusing on the correlation structure of sensor data, PCA-Triage effectively prioritizes informative channels while reducing the sampling rates of redundant ones. The algorithm operates in O(wdk) time complexity, requires no trainable parameters, and makes decisions in approximately 0.67 ms. Evaluations across seven benchmarks demonstrate that PCA-Triage outperforms nine baseline methods, achieving significant improvements in fault detection accuracy, particularly at constrained bandwidths. The paper also discusses the robustness of PCA-Triage against packet loss and sensor noise, highlighting its potential for real-time applications in edge AI environments.
Methodology
The authors developed PCA-Triage, a streaming algorithm that utilizes incremental PCA to derive per-channel sampling rates based on the correlation structure of sensor data. The algorithm is unsupervised and operates under a bandwidth budget, updating sampling rates dynamically as new data arrives. The authors also introduced targeted extensions to enhance performance, including hybrid PCA and variance importance scoring.
Results
PCA-Triage was evaluated on seven benchmarks with varying numbers of sensor channels (8-82) and demonstrated superior performance on three out of six datasets at 50% bandwidth. It achieved an F1 score of 0.961 ± 0.001 on the Tennessee Eastman Process dataset, closely matching full-data performance, and maintained an F1 score above 0.90 at a 30% bandwidth budget. The algorithm's robustness was confirmed, showing only a 3.7-4.8% degradation in performance under worst-case conditions.
Implications
The findings suggest that PCA-Triage can significantly enhance the efficiency of data transmission in bandwidth-constrained environments, particularly in industrial IoT applications where real-time fault detection is critical. Its ability to adaptively prioritize sensor data could lead to more effective monitoring and control systems in various industrial settings.
k-Maximum Inner Product Attention for Graph Transformers and the Expressive Power of GraphGPS
Graph Learning
Efficient ML
Theory
- Introduction of k-Maximum Inner Product (k-MIP) attention for graph transformers.
- Achieves linear memory complexity and up to ten-fold speedup over full attention mechanisms.
- Proves that k-MIP transformers can approximate full-attention transformers to arbitrary precision.
- Demonstrates competitive performance on large-scale graph benchmarks.
Read more
k-Maximum Inner Product Attention for Graph Transformers and the Expressive Power of GraphGPS
Summary
This paper introduces k-Maximum Inner Product (k-MIP) attention, a novel self-attention mechanism designed for graph transformers, addressing the limitations of traditional graph neural networks (GNNs) in modeling long-range dependencies and overcoming issues like oversquashing. The k-MIP attention mechanism operates by selecting the top-k relevant key nodes for each query, resulting in a sparse attention pattern that maintains linear memory complexity and significantly improves computational efficiency. The authors demonstrate that k-MIP attention can be integrated into the GraphGPS framework without sacrificing expressive power, proving that it can approximate any full-attention transformer to arbitrary precision. Empirical evaluations on various benchmarks, including the Long Range Graph Benchmark and City-Networks, show that k-MIP attention consistently ranks among the top-performing scalable graph transformers, enabling the processing of graphs with over 500k nodes on a single A100 GPU.
Methodology
The authors propose the k-MIP attention mechanism, which dynamically selects the k most influential keys for each query using a top-k operation. This approach is combined with symbolic matrices to achieve linear memory complexity and significant computational speedups. The paper includes a theoretical analysis of the expressive power of k-MIP attention and its integration into the GraphGPS framework, alongside empirical evaluations on multiple graph benchmarks.
Results
The k-MIP attention mechanism enables the processing of large graphs (over 500k nodes) on a single A100 GPU, achieving up to a ten-fold speedup compared to traditional full attention mechanisms. The theoretical analysis confirms that k-MIP transformers retain the expressive power of full-attention transformers, and empirical results show competitive performance across various benchmarks.
Implications
The development of k-MIP attention has significant implications for scalable graph learning, allowing researchers and practitioners to effectively model large-scale graphs while maintaining high performance. This could enhance applications in social networks, molecular biology, and recommendation systems, where large graphs are common.
Blind-Spot Mass: A Good-Turing Framework for Quantifying Deployment Coverage Risk in Machine Learning Systems
Theory
- Introduction of blind-spot mass as a metric for quantifying deployment coverage risk in ML systems.
- Demonstration of how heavy-tailed distributions can lead to coverage blindness in model performance.
- Validation of the framework in two distinct domains: wearable human activity recognition and clinical data analysis.
- Identification of dominant contributors to coverage risk through blind-spot decomposition.
Read more
Blind-Spot Mass: A Good-Turing Framework for Quantifying Deployment Coverage Risk in Machine Learning Systems
Summary
This paper introduces the concept of 'blind-spot mass', a metric based on the Good-Turing estimation framework, aimed at quantifying deployment coverage risk in machine learning systems. The authors highlight that many modern ML systems operate under heavy-tailed distributions, leading to a phenomenon termed 'coverage blindness', where models may perform well on standard test sets but fail in under-supported regions of the operational state space. The proposed blind-spot mass Bn(Ï„) estimates the total probability mass assigned to states with empirical support below a defined threshold Ï„, providing insights into the reliability of models in deployment. The authors validate this framework through experiments in wearable human activity recognition (HAR) and the MIMIC-IV hospital database, demonstrating that approximately 95% of deployment probability mass is concentrated in regions with low empirical support. This work emphasizes the importance of understanding coverage risk and provides actionable insights for data collection and model deployment strategies.
Methodology
The authors adapt Good-Turing unseen-event estimation to define the blind-spot mass Bn(Ï„), which quantifies the probability mass of operational states with empirical support below a threshold Ï„. They validate this framework through empirical studies in wearable HAR and the MIMIC-IV hospital database, employing frequency-of-frequencies estimators to derive the blind-spot mass and analyze the impact of state refinement on coverage risk.
Results
The empirical validation revealed that approximately 95% of the deployment probability mass is located in regions with low empirical support (below Ï„=5) in both the HAR dataset and the MIMIC-IV database. This consistent finding across different domains indicates that the blind-spot mass framework is a robust tool for assessing coverage risk in machine learning applications.
Implications
The findings suggest that machine learning practitioners should be aware of the potential for coverage blindness in their models, particularly in applications with heavy-tailed operational distributions. The blind-spot mass framework provides a systematic approach to identify under-supported regions, guiding targeted data collection and model refinement to enhance reliability in deployment.
Context is All You Need
Generative Models
NLP
Computer Vision
- Introduction of CONTXT, a simple method for contextual adaptation in ANNs.
- Demonstrates consistent performance improvements across various tasks without retraining.
- Addresses the limitations of existing domain adaptation methods that are complex and resource-intensive.
- Incorporates insights from neuroscience regarding context processing in biological systems.
Read more
Context is All You Need
Summary
The paper addresses the challenges posed by domain shift in artificial neural networks (ANNs), which can lead to performance degradation when models encounter data distributions different from those seen during training. The authors introduce CONTXT (Contextual augmentation for Neural feature X Transforms), a novel method designed for contextual adaptation that is lightweight and easy to integrate. CONTXT modulates internal representations through simple additive and multiplicative feature transforms, enhancing robustness in both Test-Time Adaptation (TTA) and Domain Generalization (DG) settings. The method demonstrates consistent performance improvements across various discriminative tasks (like ANN/CNN classification) and generative models (such as large language models). By steering information flow without the need for retraining, CONTXT offers a compact solution to the limitations of existing complex and resource-intensive approaches to domain adaptation. The authors draw inspiration from biological systems, particularly the dual-process theory of context representation, to inform their method's design, suggesting that effective contextual adaptation can occur without extensive retraining or complex architectures.
Methodology
The authors developed CONTXT, which utilizes additive and multiplicative transforms to modulate internal neural representations. This approach allows for contextual adaptation during inference without requiring retraining, making it efficient and easy to implement in existing models.
Results
The CONTXT method showed significant performance gains across multiple tasks, including both discriminative and generative models, under conditions of domain shift. The results indicate that CONTXT can effectively enhance model robustness while maintaining low computational overhead.
Implications
The findings suggest that CONTXT can be widely applied in real-world scenarios where models must adapt to new contexts without extensive retraining, potentially improving the deployment of ANNs in various applications such as image classification, speech recognition, and text generation.
Dynamic Linear Coregionalization for Realistic Synthetic Multivariate Time Series
Time Series
- DynLMC generates synthetic multivariate time series with realistic, nonstationary correlation structures.
- The model incorporates time-varying correlations, regime-switching, and lagged dependencies.
- Fine-tuning on DynLMC-generated data improves robustness and generalization of FMTS.
- The approach addresses the limitations of existing synthetic data generators that rely on static correlations.
Read more
Dynamic Linear Coregionalization for Realistic Synthetic Multivariate Time Series
Summary
The paper introduces DynLMC, a novel synthetic data generator for multivariate time series that addresses the limitations of existing models which assume static correlations. DynLMC incorporates time-varying, regime-switching correlations and lagged dependencies, enabling the generation of realistic synthetic datasets that reflect the dynamic nature of real-world time series data. The authors demonstrate that fine-tuning three foundational models on data generated by DynLMC leads to significant improvements in zero-shot forecasting across various benchmarks. This highlights the importance of capturing dynamic inter-channel correlations for enhancing the transferability of foundation models for time series (FMTS). The study emphasizes the need for data-centric pretraining to improve the adaptability of forecasting models to real-world signals.
Methodology
DynLMC extends the Linear Model of Coregionalization (LMC) by modeling smooth correlation drift through autoregressive updates, introducing regime-switching correlations using a Hidden Markov Model, and incorporating lagged dependencies. The generative process involves sampling latent Gaussian processes and applying time-varying mixing weights to produce observed channels with realistic correlation dynamics.
Results
The experiments demonstrate that fine-tuning three different multivariate time series forecasting models on DynLMC-generated data results in consistent zero-shot forecasting improvements across nine benchmarks. The results indicate that the incorporation of dynamic inter-channel correlations significantly enhances the transferability and performance of foundation models for time series.
Implications
The findings suggest that synthetic data generation methods that capture dynamic relationships can significantly improve the performance of forecasting models in various domains, including finance, healthcare, and climate. This approach can facilitate better model training and adaptation to real-world scenarios, ultimately leading to more accurate predictions and analyses.