AI-generated summaries
Today's ML research,
without the noise.
Daily summaries of the latest machine learning papers from arXiv, processed every 8 hours.
66
Papers today
8h
Update frequency
7
Days of history
Improving Feasibility via Fast Autoencoder-Based Projections
Reinforcement Learning
Optimization
Efficient ML
- Introduces a data-driven approach for enforcing complex operational constraints using autoencoders.
- Develops a structured latent representation of the feasible set through adversarial training.
- Demonstrates significant computational efficiency in correcting infeasible predictions.
- Empirical results show near 100% feasibility in constrained optimization tasks and improved safety in reinforcement learning.
Read more
Improving Feasibility via Fast Autoencoder-Based Projections
Summary
This paper addresses the challenge of enforcing complex operational constraints in learning and control systems, particularly those that are nonconvex. The authors propose a novel data-driven approach that utilizes a trained autoencoder to act as an approximate projector, enabling fast corrections to infeasible predictions. By training the autoencoder with an adversarial objective, the authors create a structured, convex latent representation of the feasible set. This allows for rapid mapping of neural network outputs to feasible points, significantly improving computational efficiency compared to traditional constraint enforcement methods. The proposed method, referred to as FAB (Fast Autoencoder-Based projections), is tested across various constrained optimization and reinforcement learning problems, demonstrating its effectiveness in enforcing constraints with low computational cost. The results indicate that FAB can achieve near-complete feasibility in constrained optimization tasks and provides safer actions in reinforcement learning scenarios compared to existing methods like PPO and TRPO.
Methodology
The authors train an autoencoder using an adversarial objective to learn a structured, convex latent representation of the feasible set. This autoencoder serves as a plug-and-play component for standard neural networks, allowing for rapid projection of outputs onto feasible points.
Results
The empirical validation shows that the FAB method consistently provides efficient approximate projections, achieving nearly 100% feasibility in constrained optimization problems within milliseconds. In reinforcement learning settings, FAB outperforms traditional methods like PPO and TRPO in terms of safety.
Implications
The proposed approach has significant implications for real-world applications in robotics, energy systems, and industrial automation, where enforcing complex constraints efficiently is crucial for safety and usability. It provides a scalable solution for integrating feasibility improvements into existing learning systems.
From Model-Based Screening to Data-Driven Surrogates: A Multi-Stage Workflow for Exploring Stochastic Agent-Based Models
Theory
- Proposes a multi-stage workflow for exploring stochastic Agent-Based Models (ABMs).
- Integrates automated model-based screening with machine learning surrogates.
- Demonstrates methodology using a predator-prey case study.
- Automates the discovery of unstable regions in parameter space.
Read more
From Model-Based Screening to Data-Driven Surrogates: A Multi-Stage Workflow for Exploring Stochastic Agent-Based Models
Summary
This paper addresses the challenges of systematically exploring Agent-Based Models (ABMs), which are often hindered by high dimensionality and stochasticity. The authors propose a multi-stage workflow that integrates experimental design with machine learning surrogates to enhance the exploration of ABMs. The methodology is demonstrated through a predator-prey case study, where the first stage involves automated model-based screening to identify key variables and assess outcome variability. The second stage employs machine learning models to capture nonlinear interactions among the remaining parameters. This approach automates the identification of unstable regions where outcomes are sensitive to complex interactions, providing a robust framework for sensitivity analysis and policy testing in high-dimensional stochastic simulations. The authors emphasize the importance of separating different sources of uncertainty and propose a pipeline that combines machine learning with rigorous uncertainty quantification techniques, ultimately enhancing the credibility and interpretability of ABM predictions.
Methodology
The methodology consists of a two-step process: first, an automated model-based screening is conducted to identify dominant variables and assess outcome variability. Second, machine learning models are trained to map the nonlinear interactions among the remaining parameters, facilitating sensitivity analysis and uncertainty quantification.
Results
The proposed workflow successfully identifies key parameters influencing model outcomes and highlights regions of high sensitivity to parameter changes. The integration of machine learning allows for efficient exploration of complex interactions, improving the understanding of model dynamics and enhancing the robustness of predictions.
Implications
This work provides a systematic approach for researchers to explore ABMs, making it easier to assess the robustness and policy relevance of model predictions. The methodology can be applied across various domains where ABMs are used, such as ecology and socio-environmental systems, ultimately aiding in decision-making processes.
Homophily-aware Supervised Contrastive Counterfactual Augmented Fair Graph Neural Network
Graph Learning
- Introduces a two-phase training strategy for fairness-aware GNNs.
- Enhances the CAF framework by incorporating graph editing and new loss functions.
- Demonstrates improved performance in classification accuracy and fairness metrics.
- Addresses topology bias by manipulating homophily ratios in the graph.
Read more
Homophily-aware Supervised Contrastive Counterfactual Augmented Fair Graph Neural Network
Summary
This paper addresses the critical challenge of fairness in Graph Neural Networks (GNNs), which can be biased due to both node attributes and graph structure. The authors propose a novel model that enhances the Counterfactual Augmented Fair Graph Neural Network (CAF) framework by introducing a two-phase training strategy. In the first phase, the graph is edited to increase homophily with respect to class labels while reducing it concerning sensitive attribute labels. The second phase integrates a modified supervised contrastive loss and an environmental loss into the optimization process, allowing the model to improve both predictive performance and fairness. The experimental results on five real-world datasets demonstrate that the proposed model outperforms CAF and other state-of-the-art graph-based learning methods in terms of classification accuracy and fairness metrics. This work contributes to the growing field of fair machine learning by providing a hybrid approach that combines pre-processing and in-training techniques to mitigate bias in GNNs.
Methodology
The methodology involves a two-phase training strategy: first, a graph editing process to adjust homophily ratios, and second, the integration of modified supervised contrastive loss and environmental loss during model training. This hybrid approach combines pre-processing and in-training techniques to effectively reduce bias in GNNs.
Results
The proposed model consistently outperformed the original CAF framework and several state-of-the-art graph-based learning methods across five real-world datasets, achieving higher classification accuracy and improved fairness metrics.
Implications
The findings suggest that the proposed approach can be effectively applied in high-stakes domains where fairness is critical, such as healthcare and social networks. By ensuring that GNNs produce fair and unbiased predictions, this work contributes to the ethical deployment of machine learning models in sensitive applications.
MUXQ: Mixed-to-Uniform Precision MatriX Quantization via Low-Rank Outlier Decomposition
NLP
Large Language Models
Efficient ML
- MUXQ introduces an auxiliary matrix to mitigate activation outliers in quantization.
- The method enables uniform low-precision INT8 quantization while preserving accuracy.
- Experiments show improved performance over existing quantization techniques like LLM.int8() and SmoothQuant.
- MUXQ is designed to be hardware-friendly, particularly for neural processing units (NPUs).
Read more
MUXQ: Mixed-to-Uniform Precision MatriX Quantization via Low-Rank Outlier Decomposition
Summary
The paper introduces MUXQ, a novel quantization framework designed to enhance the efficiency of large language models (LLMs) by addressing the challenges posed by activation outliers during quantization. Traditional quantization methods struggle with the presence of outlier values in activation matrices, which can lead to significant accuracy degradation when converting from floating-point to low-precision integer formats. MUXQ tackles this issue by detecting outlier channels in input activations and incorporating a small auxiliary matrix that redistributes the magnitudes of these outliers across channels. This approach allows for more balanced activation distributions, facilitating effective quantization at low-precision INT levels while maintaining a hardware-friendly computation structure. Experimental results demonstrate that MUXQ consistently achieves lower perplexity compared to naive quantization methods across various scales of GPT-2 models, indicating its potential for efficient and accurate LLM inference on edge devices. The framework not only reduces memory usage but also enhances computational efficiency, making it suitable for deployment in on-device environments where integer operations are preferred.
Methodology
MUXQ employs a mixed-to-uniform quantization approach by detecting outlier channels in activation matrices and adding an auxiliary matrix that redistributes outlier magnitudes. This process balances the activation distribution, allowing for effective quantization at low precision (INT8) while maintaining computational efficiency and accuracy.
Results
In experiments conducted on GPT-2 models of varying sizes (0.1B, 0.3B, and 0.7B parameters), MUXQ consistently achieved lower perplexity than naive quantization methods. It demonstrated comparable accuracy to mixed-precision approaches while enabling uniform per-tensor INT8 quantization with modest computational overhead.
Implications
MUXQ presents a promising direction for the deployment of large language models in resource-constrained environments, such as mobile devices and embedded systems, where efficient inference and low memory usage are critical. Its ability to maintain accuracy while reducing computational demands could facilitate broader adoption of LLMs in practical applications.
Do We Need Frontier Models to Verify Mathematical Proofs?
NLP
Large Language Models
Theory
- Smaller open-source models can achieve nearly the same accuracy as frontier models in verifying mathematical proofs.
- Self-consistency is a significant challenge for smaller models, being up to 25% less consistent than frontier models.
- Specialized prompt ensembles can significantly enhance the performance of smaller models in proof verification tasks.
- The study provides insights into the capabilities of LLMs for mathematical reasoning and verification.
Read more
Do We Need Frontier Models to Verify Mathematical Proofs?
Summary
This paper investigates the necessity of using frontier large language models (LLMs) for verifying mathematical proofs, particularly in the context of complex problems. The authors systematically evaluate the performance of two frontier models (GPT-5.2 and Gemini 3.1 Pro) and four smaller open-source models (Qwen3.5-35B, Qwen3.5-122B, GPT-OSS-120B, and GPT-OSS-20B) on datasets of human-graded natural language proofs from math competitions. The evaluation focuses on two main metrics: verifier accuracy and self-consistency. The findings reveal that smaller models are only about 10% less accurate than frontier models but exhibit up to 25% less consistency in their judgments. The authors demonstrate that smaller models possess the necessary mathematical capabilities for proof verification but struggle to express these capabilities reliably without tailored prompts. By employing an ensemble of specialized prompts, they enhance the performance of smaller models, achieving up to 9.1% improvement in accuracy and 15.9% in self-consistency, allowing models like Qwen3.5-35B to match the performance of frontier models like Gemini 3.1 Pro. The study concludes that while frontier models excel in solving complex mathematical problems, verifying their solutions does not necessarily require such advanced models.
Methodology
The authors conducted a systematic evaluation of LLMs by testing their ability to verify mathematical proofs across multiple datasets. They measured the models' performance based on balanced accuracy and self-consistency, using human-graded proofs as ground truth. Additionally, they developed an ensemble of specialized prompts to improve the performance of smaller models.
Results
The results indicate that smaller models are only about 10% less accurate than frontier models in proof verification, but they are significantly less self-consistent. The introduction of specialized prompt ensembles led to improvements of up to 9.1% in accuracy and 15.9% in self-consistency for smaller models, enabling them to perform comparably to frontier models.
Implications
The findings imply that smaller LLMs can be effectively utilized for mathematical proof verification, potentially reducing the reliance on larger, more resource-intensive models. This could lead to more accessible and efficient verification tools in mathematical reasoning and education.
AdaHOP: Fast and Accurate Low-Precision Training via Outlier-Pattern-Aware Rotation
Large Language Models
Efficient ML
- First systematic analysis of outlier patterns in LLMs, identifying three types: Row-wise, Column-wise, and None.
- Introduction of AdaHOP, which adapts Hadamard transforms based on the identified outlier patterns.
- Achieves significant memory savings and acceleration in training while maintaining model quality.
- Demonstrates the importance of tailored strategies for different tensor operations in low-precision training.
Read more
AdaHOP: Fast and Accurate Low-Precision Training via Outlier-Pattern-Aware Rotation
Summary
The paper presents AdaHOP, a novel approach to low-precision training (LPT) that addresses the challenges posed by outliers in large language models (LLMs). Traditional methods utilize Hadamard transforms uniformly across layers, which is ineffective due to the varying structures of outliers in different tensors. The authors conduct a systematic analysis of outlier patterns in weights, activations, and gradients, identifying three distinct types: Row-wise, Column-wise, and None. Based on these findings, AdaHOP employs an adaptive strategy that selects the optimal transform for each matrix multiplication, either using Inner Hadamard Transform (IHT) or combining it with selective Outlier Extraction (OE) for dominant outliers. This method not only minimizes quantization error but also ensures training stability and efficiency. The implementation of AdaHOP on AMD CDNA4 architecture demonstrates significant improvements, achieving BF16 training quality at MXFP4 precision, with up to 3.6Γ memory compression and 1.8Γ kernel acceleration compared to full-precision training.
Methodology
The authors conducted a systematic study of outlier patterns in LLMs, characterizing them into three types. They developed AdaHOP, which uses an adaptive approach to select the optimal low-precision strategy for matrix multiplications, utilizing Inner Hadamard Transform and selective Outlier Extraction based on the identified outlier patterns. The implementation leverages hardware-aware fused kernels for efficient computation.
Results
AdaHOP consistently reduces quantization error and achieves BF16 training quality at MXFP4 precision. The method provides up to 3.6Γ memory compression and 1.8Γ kernel speedup over traditional BF16 full-precision training, demonstrating improved training stability and efficiency.
Implications
The findings suggest that adaptive strategies in low-precision training can significantly enhance the performance of large language models, making them more efficient and effective. This approach could be applied to other deep learning frameworks and architectures to improve training processes.
Can LLMs Learn to Reason Robustly under Noisy Supervision?
Large Language Models
Reinforcement Learning
Theory
- Introduces a systematic analysis of noisy label mechanisms in RLVR.
- Distinguishes between inactive and active noisy labels and their impacts on training.
- Proposes Online Label Refinement (OLR) to correct noisy labels dynamically.
- Demonstrates the Early Correctness Coherence phenomenon during training.
Read more
Can LLMs Learn to Reason Robustly under Noisy Supervision?
Summary
This paper investigates the robustness of large language models (LLMs) in reasoning tasks when trained under noisy supervision, particularly within the framework of Reinforcement Learning with Verifiable Rewards (RLVR). The authors identify two types of noise in labels: inactive noisy labels, which reduce data efficiency without actively misleading the model, and active noisy labels, which can skew the model's learning towards incorrect distributions. They observe a phenomenon termed 'Early Correctness Coherence,' where accuracy on both clean and noisy samples improves similarly in the early stages of training, despite later discrepancies. To address the challenges posed by noisy labels, the authors propose a novel method called Online Label Refinement (OLR). This method progressively corrects potentially noisy labels based on majority-voted answers from the model's own rollouts, guided by two criteria: the positive slope of the majority answer's rollout pass rate and the historical consistency of the majority answer across updates. The effectiveness of OLR is evaluated on six in-distribution mathematical reasoning benchmarks and three out-of-distribution tasks, demonstrating significant improvements in robustness against both inactive and active noise.
Methodology
The authors conducted experiments to analyze the effects of noisy labels on training LLMs using RLVR. They proposed OLR, which refines labels based on majority answers generated by the model itself, monitored through the slope of rollout pass rates and historical consistency of answers. The method was tested on multiple reasoning benchmarks to assess its effectiveness in handling noise.
Results
OLR showed consistent improvements in robustness against both inactive and active noisy labels, achieving average gains of 3.6%β3.9% on in-distribution benchmarks and 3.3%β4.6% on out-of-distribution tasks across varying noise ratios (0.1 to 0.9).
Implications
The findings suggest that LLMs can be trained more effectively in environments with noisy supervision, enhancing their reasoning capabilities. This has potential applications in various domains where data labeling is challenging, such as educational tools, automated reasoning systems, and AI-driven decision-making processes.
Collapse-Free Prototype Readout Layer for Transformer Encoders
Theory
Efficient ML
NLP
- Introduction of DDCL-Attention, a prototype-based readout layer for transformers.
- Mathematical guarantees against prototype collapse and formal training stability.
- Versatile application in multiple paradigms, including readout layers and hierarchical compression.
- Empirical validation showing effective prototype separation and high codebook utilization.
Read more
Collapse-Free Prototype Readout Layer for Transformer Encoders
Summary
This paper presents DDCL-Attention, a novel prototype-based competitive readout layer designed for transformer encoders, addressing the limitations of traditional pooling methods that often discard valuable information. The proposed mechanism maintains a small bank of globally learned prototype vectors that summarize recurring data patterns, allowing for a soft probabilistic assignment of tokens to these prototypes. This approach not only ensures linear complexity in sequence length but also provides mathematical guarantees against prototype collapse, a common issue in existing methods. The authors demonstrate the stability of the training dynamics under specific conditions, ensuring that prototypes do not converge to a single point. Additionally, DDCL-Attention can be applied in various paradigms, including as a final readout layer, a differentiable codebook, and a hierarchical document compressor. Empirical results across multiple datasets validate the theoretical claims, showing effective prototype separation and full utilization of the codebook, with applications extending beyond standard NLP and vision tasks to scientific tabular data.
Methodology
The authors developed DDCL-Attention, which employs a competitive readout mechanism that maps token embeddings to soft centroid representations using Boltzmann assignments over a global prototype bank. The framework includes a detailed stability analysis using Tikhonov's singular perturbation theory and provides an exact loss decomposition for coupled encoder-prototype systems.
Results
Experiments demonstrated that the proposed loss decomposition holds with zero violations, and prototype separation increased as predicted when stability conditions were met. The codebook achieved full utilization (100%) compared to only 39% for standard hard vector quantization. Additional experiments showed the applicability of the method to scientific tabular data, confirming its versatility.
Implications
The DDCL-Attention layer has the potential to enhance the efficiency and effectiveness of transformer models in various applications, particularly in scenarios where structured representations are crucial. Its ability to maintain prototype diversity and stability could lead to improved performance in tasks involving complex data patterns.
Multirate Stein Variational Gradient Descent for Efficient Bayesian Sampling
Efficient ML
Optimization
Theory
- Introduces multirate SVGD to separately handle drift and repulsion in Bayesian sampling.
- Develops adaptive error-controlled multirate methods for improved stability and efficiency.
- Demonstrates significant performance improvements over traditional SVGD in complex posterior scenarios.
- Provides a comprehensive benchmark suite for evaluating the proposed methods.
Read more
Multirate Stein Variational Gradient Descent for Efficient Bayesian Sampling
Summary
This paper introduces a multirate version of Stein Variational Gradient Descent (SVGD) aimed at improving the efficiency of Bayesian sampling. Traditional particle-based Bayesian inference methods often utilize a single global step size for updates, which can lead to instability and inefficiency, particularly in high-dimensional or complex posterior distributions. The proposed multirate SVGD separates the attractive and repulsive components of the update process, allowing them to evolve at different rates. This results in several practical algorithms: a symmetric split method, a fixed multirate method (MR-SVGD), and an adaptive multirate method (Adapt-MR-SVGD) that incorporates local error control. The effectiveness of these methods is evaluated across a comprehensive benchmark suite that includes various problem families, such as Gaussian targets, Bayesian logistic regression, and hierarchical models. The results demonstrate that multirate SVGD variants enhance robustness and improve the quality-cost tradeoffs compared to standard SVGD, particularly in challenging scenarios involving anisotropic and multimodal distributions.
Methodology
The paper derives multirate SVGD formulations that update the attractive and repulsive components of the particle flow on different time scales. It includes symmetric splitting, fixed multirate schedules, and an adaptive variant that adjusts sub-steps based on local error estimations. The methods are empirically evaluated against established benchmarks using various performance metrics.
Results
The empirical evaluation shows that multirate SVGD variants outperform standard SVGD across six benchmark families, particularly excelling in scenarios with stiff hierarchical, strongly anisotropic, and multimodal targets. The adaptive multirate method generally yields the best performance, while the fixed multirate method offers a simpler, robust alternative at a lower computational cost.
Implications
The findings suggest that multirate SVGD can significantly enhance the efficiency and robustness of Bayesian sampling methods, making it a valuable tool for practitioners dealing with complex, high-dimensional posterior distributions. This approach could be particularly beneficial in fields requiring accurate Bayesian inference, such as machine learning, statistics, and data science.
Forgetting to Witness: Efficient Federated Unlearning and Its Visible Evaluation
Federated Learning
Generative Models
Efficient ML
- Introduction of the first complete pipeline for federated unlearning.
- Development of an efficient unlearning approach that does not require historical data.
- Creation of the Skyeye framework for visualizing the forgetting capacity of unlearning models.
- Utilization of knowledge distillation to facilitate the unlearning process.
Read more
Forgetting to Witness: Efficient Federated Unlearning and Its Visible Evaluation
Summary
This paper addresses the emerging field of federated unlearning, which is crucial for ensuring data privacy in federated learning systems. The authors propose a comprehensive pipeline for federated unlearning, consisting of an efficient unlearning approach and a novel evaluation framework named Skyeye. The proposed unlearning method utilizes knowledge distillation to enable a model to forget specific data without the need for historical data storage, thus enhancing efficiency and maintaining model accuracy. The Skyeye framework visualizes the forgetting capacity of the unlearning models by integrating them into a Generative Adversarial Network (GAN), allowing for the generation of samples that reflect the model's knowledge. The effectiveness of the proposed methods is demonstrated through extensive experiments, showcasing their ability to efficiently remove data contributions while preserving model performance.
Methodology
The authors propose a federated unlearning approach that employs a knowledge distillation model, where a teacher model (an incompetent model) guides a student model (the one that needs to unlearn). The process involves inputting deleted data into the teacher model and using its outputs to train the student model, effectively enabling the student to forget the deleted data. Additionally, the Skyeye framework visualizes the unlearning process by generating samples through a GAN that incorporates the unlearning model as a classifier.
Results
The experiments conducted demonstrate that the proposed federated unlearning approach achieves high efficiency and maintains model accuracy without the need for historical data. The Skyeye framework effectively visualizes the forgetting capacity of the models, providing insights into the unlearning process and its effectiveness.
Implications
The findings of this research have significant implications for data privacy and compliance with regulations like GDPR and CCPA, as they provide a robust method for ensuring that federated learning models can effectively forget specific data contributions. This can enhance user trust in federated learning systems and facilitate broader adoption in sensitive applications.
The Geometric Alignment Tax: Tokenization vs. Continuous Geometry in Scientific Foundation Models
Theory
- Introduction of the Geometric Alignment Tax, highlighting the cost of discretizing continuous manifolds.
- Controlled experiments show that continuous objectives significantly outperform discrete tokenization in geometric stability.
- Identification of three failure regimes in biological foundation models, revealing systematic issues in representation.
- Demonstration that finer quantization in learned codebooks can worsen geometric stability despite better reconstruction.
Read more
The Geometric Alignment Tax: Tokenization vs. Continuous Geometry in Scientific Foundation Models
Summary
This paper investigates the limitations of foundation models in biology and physics, particularly their failure to maintain the continuous geometric properties of the systems they represent. The author introduces the concept of the Geometric Alignment Tax, which describes the intrinsic geometric distortion that occurs when continuous manifolds are forced through discrete categorical frameworks. Through controlled experiments on synthetic dynamical systems, the study demonstrates that using a continuous head instead of cross-entropy loss can significantly reduce geometric distortion. The paper evaluates 14 biological foundation models and identifies three failure regimes: Local-Global Decoupling, Representational Compression, and Geometric Vacuity. The findings suggest that no model can achieve low distortion, high mutual information, and global coherence simultaneously, highlighting the challenges posed by discretization in modeling continuous phenomena.
Methodology
The study employs controlled synthetic experiments using three architectures (Transformer, SSM, hybrid) trained on datasets with known continuous geometry. Geometric stability is evaluated through Representational Dissimilarity Matrices (RDMs) and various metrics to assess the preservation of geometric relationships under perturbations. The paper also applies rate-distortion theory and mutual information estimation (MINE) across multiple foundation models to analyze their performance.
Results
The results indicate that replacing cross-entropy with continuous objectives reduces geometric distortion by up to 8.5 times. The architectures trained under discrete tokenization diverge significantly in performance, with differences as large as 3,000 times. The study also finds that learned codebooks do not mitigate the geometric alignment tax, and the ESM-2 protein Transformer suite shows a decline in geometric stability with increased model size. The experiments confirm that Evo 2's robustness is due to conserved sequence composition rather than learned symmetry.
Implications
The findings have significant implications for the design and evaluation of foundation models in scientific domains, suggesting that reliance on discrete tokenization may lead to fundamental limitations in model performance. This work encourages the exploration of continuous modeling approaches to better capture the complexities of biological and physical systems.
Physical Sensitivity Kernels Can Emerge in Data-Driven Forward Models: Evidence From Surface-Wave Dispersion
Theory
Interpretability
Optimization
- Neural network surrogates can recover depth-dependent structures of surface-wave sensitivity kernels.
- Learned sensitivities are influenced by both wave physics and the training distribution.
- Surrogate gradients and Fisher information can effectively capture local inverse-problem geometry for inversion.
- Emergent differential physics allows data-driven models to recover physical structures from observable data.
Read more
Physical Sensitivity Kernels Can Emerge in Data-Driven Forward Models: Evidence From Surface-Wave Dispersion
Summary
This paper investigates the ability of data-driven neural network models to recover the underlying physical sensitivity structures in geophysical applications, specifically focusing on surface-wave dispersion. The authors train a neural network surrogate to predict how surface-wave speeds depend on subsurface velocity structures and analyze the gradients of these predictions to determine sensitivity patterns. The study finds that the learned sensitivity patterns closely align with established theoretical sensitivity kernels, suggesting that neural networks can capture physically meaningful information rather than merely acting as black-box predictors. Additionally, the research highlights that strong structural priors in the training data can introduce artifacts into the inferred sensitivities. The results indicate that neural forward models can provide valuable physical insights for inversion and uncertainty analysis, emphasizing the conditions under which these models maintain physical consistency.
Methodology
The authors trained a neural network surrogate model to approximate the mapping from one-dimensional shear-wave velocity structures to Rayleigh-wave dispersion curves. They utilized automatic differentiation to compute the gradients of the predicted dispersion with respect to the velocity structure and compared these gradients with theoretical sensitivity kernels. The study also evaluated the effectiveness of these gradients in gradient-based inversion processes.
Results
The study demonstrated that the gradients produced by the neural network closely matched the depth-dependent structure of theoretical surface-wave sensitivity kernels. Furthermore, the application of these gradients in gradient-based inversion successfully recovered velocity models from synthetic dispersion data, indicating that the differential structure of the surface-wave dispersion operator can emerge in data-driven models.
Implications
The findings suggest that neural network surrogates in geophysical modeling can provide more physically informative insights than previously assumed. This has significant implications for improving inversion techniques and uncertainty analysis in geophysical research, potentially leading to more accurate subsurface modeling.
Towards Realistic Class-Incremental Learning with Free-Flow Increments
Theory
Optimization
- Introduction of Free-Flow Class-Incremental Learning (FFCIL) to address realistic class arrival scenarios.
- Development of a model-agnostic framework that stabilizes learning through a class-wise mean objective.
- Method-wise adaptations to enhance robustness, including replay-constrained distillation and loss scale normalization.
- Extensive experiments reveal significant performance drops in existing CIL methods under FFCIL, while the proposed approach shows consistent improvements.
Read more
Towards Realistic Class-Incremental Learning with Free-Flow Increments
Summary
This paper addresses the limitations of traditional Class-Incremental Learning (CIL) methods that operate under predefined schedules with equal-sized tasks. The authors introduce a new paradigm called Free-Flow Class-Incremental Learning (FFCIL), which allows for a more realistic scenario where new classes can arrive in variable sizes at any time. This setting exposes existing CIL methods to significant performance degradation due to unstable optimization dynamics and catastrophic forgetting. To tackle these challenges, the authors propose a model-agnostic framework that includes a class-wise mean (CWM) objective to stabilize learning signals and method-wise adjustments to enhance robustness across various CIL paradigms. Key innovations include constraining distillation to replayed data, normalizing loss scales, and implementing Dynamic Intervention Weight Alignment (DIWA) to mitigate over-adjustment issues. Experimental results demonstrate that existing CIL methods suffer accuracy drops under FFCIL conditions, while the proposed strategies consistently improve performance across multiple datasets and methods.
Methodology
The authors propose a model-agnostic framework that includes a class-wise mean (CWM) objective to replace traditional sample frequency weighted loss with uniformly aggregated class-conditional supervision. This is complemented by method-wise adjustments such as constraining knowledge distillation to replayed samples, normalizing the scale of contrastive and knowledge transfer losses, and introducing Dynamic Intervention Weight Alignment (DIWA) to prevent over-adjustment from unstable statistics.
Results
The experiments indicate that existing CIL methods experience substantial accuracy drops under the FFCIL setting, with maximum drops reaching 19.4%. In contrast, the proposed framework consistently yields performance gains across various CIL baselines and datasets, demonstrating its effectiveness in handling free-flow class increments.
Implications
The findings suggest that CIL systems need to adapt to more realistic data streams and class arrival patterns, which could enhance their applicability in real-world scenarios where class distributions are non-stationary and unpredictable. This work opens avenues for further research into robust learning mechanisms in dynamic environments.
The limits of bio-molecular modeling with large language models : a cross-scale evaluation
Large Language Models
- Introduction of BioMol-LLM-Bench for evaluating LLMs in bio-molecular tasks.
- Limited benefits of chain-of-thought data in biological modeling.
- Hybrid mamba-attention architectures outperform traditional transformers for long sequences.
- Supervised fine-tuning improves specialization but reduces generalization.
Read more
The limits of bio-molecular modeling with large language models : a cross-scale evaluation
Summary
This paper addresses the challenges of bio-molecular modeling using large language models (LLMs) by introducing a comprehensive evaluation framework called BioMol-LLM-Bench. The authors identify a significant gap between LLM performance and mechanistic understanding in bio-molecular tasks. The benchmark consists of 26 downstream tasks across four difficulty levels, allowing for a systematic evaluation of LLM capabilities. The study evaluates 13 models and reveals several key findings: (1) chain-of-thought (CoT) data offers limited benefits and can reduce performance on biological tasks; (2) hybrid mamba-attention architectures are more effective for processing long bio-molecular sequences; (3) supervised fine-tuning enhances specialization but diminishes generalization; and (4) while LLMs excel in classification tasks, they struggle with complex regression tasks. The findings provide insights for future LLM-based bio-molecular modeling and highlight the need for improved evaluation frameworks that integrate computational tools.
Methodology
The authors developed the BioMol-LLM-Bench framework, which includes 26 tasks organized into four hierarchical levels of complexity. They conducted evaluations on 13 general-purpose and domain-specific models, analyzing performance across various metrics, including accuracy and validity for classification tasks, and RMSE for regression tasks. The evaluation also integrated computational tools to assess model capabilities in real-world scientific applications.
Results
The evaluation revealed that chain-of-thought data did not significantly enhance performance and could be detrimental. Hybrid mamba-attention architectures showed superior performance for long sequences compared to traditional models. Supervised fine-tuning led to specialization at the expense of generalization, and none of the models achieved satisfactory performance on challenging regression tasks, indicating a need for further development in this area.
Implications
The findings suggest that while LLMs have potential in bio-molecular modeling, there are significant limitations that need to be addressed. The BioMol-LLM-Bench framework can guide future research and development of LLMs tailored for bio-molecular applications, emphasizing the importance of integrating computational tools and improving evaluation methodologies.
Restless Bandits with Individual Penalty Constraints: A New Near-Optimal Index Policy and How to Learn It
Reinforcement Learning
Optimization
Theory
- Introduction of a new RMAB framework that incorporates individual penalty constraints for users.
- Development of the Penalty-Optimal Whittle (POW) index policy, which is asymptotically optimal and computationally tractable.
- Proposal of the DeepPOW algorithm for online learning of the POW index without prior knowledge of user dynamics.
- Comprehensive simulations validate the effectiveness of the POW index policy and DeepPOW in various network scheduling scenarios.
Read more
Restless Bandits with Individual Penalty Constraints: A New Near-Optimal Index Policy and How to Learn It
Summary
This paper addresses the challenges of resource allocation in dynamic wireless networks using the Restless Multi-Armed Bandit (RMAB) framework under individual penalty constraints. Unlike traditional RMAB models that focus solely on maximizing total system rewards, this work incorporates distinct performance constraints for each user, such as energy limits and quality-of-service requirements. The authors introduce a new Penalty-Optimal Whittle (POW) index policy, which is computationally tractable and asymptotically optimal, ensuring that all individual penalty constraints are satisfied. The POW index is derived from a reformulated dual problem that simplifies the complexity of the original RMAB model. Additionally, the paper presents a deep reinforcement learning algorithm, DeepPOW, which learns the POW index in real-time without prior knowledge of user behaviors. Extensive simulations demonstrate that both the POW index policy and DeepPOW achieve near-optimal performance, significantly outperforming existing policies across various applications.
Methodology
The authors extend the RMAB model to include individual penalty constraints and derive the POW index policy through a reformulated dual problem. They also develop a deep reinforcement learning algorithm, DeepPOW, to learn the POW indices from interactions with the environment.
Results
Simulation results indicate that the POW index policy achieves near-optimal performance while satisfying all penalty constraints. DeepPOW efficiently learns the POW index, outperforming baseline policies in multiple practical scenarios.
Implications
The proposed framework and algorithms can be applied to various dynamic resource allocation problems in wireless networks, IoT systems, and other applications requiring user-specific performance guarantees.
Evaluation of Bagging Predictors with Kernel Density Estimation and Bagging Score
Theory
Efficient ML
- Introduces a new method for evaluating bagging predictors using Kernel Density Estimation.
- Presents the Bagging Score as a confidence metric for ensemble predictions.
- Demonstrates improved prediction accuracy over traditional mean or median methods.
- Ranks highly against existing nonlinear regression approaches without optimization.
Read more
Evaluation of Bagging Predictors with Kernel Density Estimation and Bagging Score
Summary
This paper presents a novel approach for evaluating bagging predictors in machine learning by utilizing Kernel Density Estimation (KDE) and introducing a new metric called Bagging Score (BS). Traditional methods for aggregating predictions from multiple models, such as taking the mean or median, can lead to inaccuracies, especially in regions where the prediction distribution is asymmetric. The authors propose a method that determines a representative value from a set of predictions using KDE, which accounts for the underlying distribution of the predictions. This method not only provides a more accurate ensemble prediction but also quantifies the confidence of the prediction through the Bagging Score, which ranges from zero to one. The paper demonstrates that this approach yields better prediction accuracy compared to conventional statistical methods and ranks highly against other nonlinear regression techniques without requiring optimization or feature selection. The findings suggest that the new method can significantly enhance the performance of ensemble models in various machine learning applications.
Methodology
The authors trained an ensemble of 1000 neural networks with three hidden layers to generate predictions. They applied Kernel Density Estimation to derive a representative value from the predictions and introduced the Bagging Score to assess the confidence of these predictions. The method was compared against traditional mean and median calculations and other nonlinear regression techniques using real-life datasets.
Results
The proposed method outperformed traditional aggregation methods (mean and median) in terms of prediction accuracy. It also ranked favorably against various nonlinear regression methods, demonstrating its effectiveness in handling asymmetric prediction distributions without the need for optimization or feature selection.
Implications
The findings suggest that using KDE and the Bagging Score can significantly improve the reliability of ensemble predictions in machine learning, making it a valuable approach for applications where prediction accuracy is critical, particularly in scenarios with limited data.
Learning from Equivalence Queries, Revisited
Theory
Efficient ML
Interpretability
- Introduces symmetric counterexample generators to reduce adversarial behavior in learning from equivalence queries.
- Establishes tight bounds on learning rounds under both full-information and bandit feedback settings.
- Combines game-theoretic perspectives with adaptive weighting algorithms for improved learning efficiency.
- Retains the requirement for proper hypothesis proposals to ensure computational efficiency and interpretability.
Read more
Learning from Equivalence Queries, Revisited
Summary
This paper revisits the classical model of learning from equivalence queries, originally introduced by Angluin in 1988, in the context of modern machine learning systems that learn from user interactions. The authors highlight the limitations of the traditional model, particularly its pessimistic worst-case behavior under adversarial counterexample generation and the assumption of full-information feedback. To address these issues, they introduce a new class of counterexample generators termed 'symmetric', which provide a less adversarial environment for generating counterexamples based on the symmetric difference between the hypothesis and the target. The study explores learning from equivalence queries under both full-information and bandit feedback settings, establishing tight bounds on the number of learning rounds required in each case. The authors employ a game-theoretic approach and combine adaptive weighting algorithms with minimax arguments to derive their results. This work not only mitigates the pessimism associated with worst-case scenarios but also retains the requirement for learners to propose hypotheses from a fixed class, ensuring efficiency and interpretability in practical applications.
Methodology
The authors develop a framework for learning from equivalence queries that incorporates symmetric counterexample generators. They analyze the learning process under both full-information and bandit feedback conditions, employing game-theoretic techniques and adaptive weighting algorithms to derive bounds on the number of learning rounds.
Results
The study finds that by using symmetric counterexample generators, the learning process can be made significantly more efficient, with tighter bounds on the number of rounds needed for convergence compared to traditional adversarial settings. The results demonstrate that the proposed framework allows for effective learning even in less-than-ideal feedback scenarios.
Implications
This research has potential applications in various interactive machine learning systems, such as recommendation engines and generative models, where user feedback is integral to model updates. The findings suggest that adopting less adversarial feedback mechanisms can enhance learning efficiency and model performance.
Anticipatory Reinforcement Learning: From Generative Path-Laws to Distributional Value Functions
Reinforcement Learning
Theory
Time Series
- Introduces Anticipatory Reinforcement Learning (ARL) framework for non-Markovian environments.
- Utilizes a signature-augmented manifold for dynamic path-law representation.
- Enables 'Single-Pass' policy evaluation, reducing computational complexity.
- Develops a generative engine based on Neural Controlled Differential Equations (CDEs).
Read more
Anticipatory Reinforcement Learning: From Generative Path-Laws to Distributional Value Functions
Summary
This paper introduces Anticipatory Reinforcement Learning (ARL), a novel framework that addresses the challenges posed by non-Markovian decision processes in reinforcement learning (RL) architectures, particularly when only a single observed trajectory is available. Traditional RL methods often struggle in environments characterized by jump-diffusions and structural breaks, as they fail to capture the path-dependent geometry necessary for accurate foresight. The ARL framework overcomes this limitation by embedding the state space into a signature-augmented manifold, where the history of the process is treated as a dynamic coordinate. By employing a self-consistent field approach, the agent can maintain an anticipated proxy of the future path-law, enabling deterministic evaluations of expected returns. This shift from stochastic branching to a single-pass linear evaluation significantly reduces computational complexity and variance. The paper demonstrates that this framework preserves essential contraction properties and ensures stable generalization, even in the presence of heavy-tailed noise. The results indicate that grounding reinforcement learning in the topological features of path-space allows agents to achieve proactive risk management and enhanced policy stability in volatile, continuous-time environments.
Methodology
The methodology involves lifting the state space into a signature-augmented manifold, allowing for the representation of path-laws as dynamic objects. The ARL framework employs a self-consistent field approach to maintain an anticipated proxy of future path-laws, facilitating deterministic evaluations of expected returns. The paper also introduces a novel temporal difference operator that aligns the agent's value expectations with the anticipated path's topological evolution.
Results
The results show that the ARL framework effectively bridges the gap between non-Markovian decision processes and classical RL architectures, achieving significant reductions in computational complexity and variance. The framework maintains fundamental contraction properties and ensures stable generalization in environments with heavy-tailed noise, demonstrating proactive risk management and superior policy stability.
Implications
The implications of this work extend to various fields, particularly in high-frequency finance and complex physical systems, where non-Markovian dynamics are prevalent. The ARL framework provides a mathematically grounded alternative to traditional RL methods, potentially improving decision-making processes in volatile environments.
Gradient Boosting within a Single Attention Layer
NLP
Large Language Models
Theory
- Introduction of gradient-boosted attention, a multi-round attention mechanism.
- Demonstrates a formal correspondence to gradient boosting under a squared reconstruction objective.
- Shows that separate projections for correction can recover residual information lost in standard attention.
- Achieves significant improvements in test perplexity over standard and alternative attention mechanisms.
Read more
Gradient Boosting within a Single Attention Layer
Summary
This paper introduces a novel mechanism called gradient-boosted attention, which enhances the traditional attention mechanism in transformers by incorporating principles from gradient boosting. The standard attention mechanism computes a single softmax-weighted average over values, which can lead to errors that are not corrected in subsequent passes. In contrast, gradient-boosted attention performs a second attention pass that focuses on the prediction error of the first pass, applying learned projections and a gated correction. This approach is shown to map onto Friedmanβs gradient boosting framework, where each attention pass acts as a base learner, and the per-dimension gate serves as a shrinkage parameter. The paper demonstrates that this method can recover residual information that is typically lost in standard attention mechanisms. Experimental results on a 10M-token subset of WikiText-103 show that gradient-boosted attention achieves a test perplexity of 67.9, outperforming standard attention (72.2), Twicing Attention (69.6), and a parameter-matched wider baseline (69.0). The findings suggest that two rounds of attention capture most of the benefits of this approach.
Methodology
The methodology involves applying gradient boosting principles within a single attention layer. The first attention pass generates an initial estimate, and the residual error is then processed in a second attention pass with separate learned projections and a gating mechanism. This allows the model to correct its predictions based on the errors identified in the first pass.
Results
Gradient-boosted attention achieved a test perplexity of 67.9 on a 10M-token subset of WikiText-103, which is a 6.0% relative improvement over standard attention (72.2). It also outperformed Twicing Attention by 1.7 points and a parameter-matched wider baseline by 1.1 points. The results indicate that two rounds of attention are sufficient to capture most of the performance benefits.
Implications
The introduction of gradient-boosted attention could lead to more accurate and efficient transformer models, particularly in NLP tasks where attention mechanisms are critical. This approach may also inspire further research into integrating boosting techniques with other neural network architectures.
Algebraic Diversity: Group-Theoretic Spectral Estimation from Single Observations
Theory
Efficient ML
Audio & Speech
- Introduces algebraic group actions as a method for spectral estimation from single observations.
- Establishes a General Replacement Theorem for consistent estimation of subspace decomposition.
- Demonstrates the optimality of the symmetric group for achieving superior spectral decomposition.
- Applies the framework to various domains, achieving significant performance improvements over traditional methods.
Read more
Algebraic Diversity: Group-Theoretic Spectral Estimation from Single Observations
Summary
This paper introduces a novel theoretical framework that allows for the extraction of second-order statistical information from a single observation of a noisy signal through algebraic group actions, rather than relying on temporal averaging over multiple observations. The author defines a group-averaged estimator, FG, which utilizes the action of a finite group G on a single observation vector. A General Replacement Theorem is established, proving that FG can consistently estimate population-level subspace decomposition under specific conditions regarding signal transformation and noise distribution. The paper further demonstrates the optimality of the symmetric group SM for algebraic diversity, linking it to the KarhunenβLoΓ¨ve transform, which is optimal for linear decorrelating transforms. Practical applications include the MUSIC algorithm for direction-of-arrival estimation and massive MIMO channel estimation, where the proposed method significantly outperforms traditional approaches by reducing pilot overhead and improving effective throughput. The framework also extends to waveform characterization and transformer neural networks, revealing new algebraic structures in internal representations. The findings suggest that the algebraic diversity framework can effectively operate in colored noise environments and offers a systematic approach to group matching for optimal signal processing.
Methodology
The methodology involves defining a group-averaged estimator FG based on the action of a finite group G on a single observation vector. Theoretical proofs establish the conditions for consistent estimation and optimality, followed by practical demonstrations through various applications, including MUSIC for direction-of-arrival estimation and massive MIMO channel estimation.
Results
The results show that the group-averaged estimator achieves equivalent performance to multi-snapshot methods while significantly reducing pilot overhead in massive MIMO systems. In waveform classification, the proposed method outperforms FFT-based approaches, maintaining high accuracy even in low SNR conditions. The framework also identifies optimal algebraic structures in transformer models, leading to potential improvements in model efficiency.
Implications
The implications of this work suggest that algebraic diversity can transform signal processing by enabling effective estimation from single observations, reducing the need for multiple measurements, and enhancing performance in various applications, including communications and machine learning. Additionally, it opens avenues for further exploration of group-theoretic methods in signal processing and neural network analysis.
Beyond Imbalance Ratio: Data Characteristics as Critical Moderators of Oversampling Method Selection
Theory
- IR alone is a weak predictor of oversampling effectiveness; class separability is a stronger moderator.
- The study provides a new framework for method selection that considers multiple data characteristics.
- Controlled experiments reveal negative correlations between IR and oversampling benefits, challenging previous literature.
- The findings highlight the need for evidence-based guidelines in selecting oversampling techniques.
Read more
Beyond Imbalance Ratio: Data Characteristics as Critical Moderators of Oversampling Method Selection
Summary
This paper challenges the prevailing imbalance ratio (IR)-threshold paradigm in oversampling method selection, which posits a direct correlation between IR and oversampling effectiveness. Through 12 controlled experiments involving over 100 dataset variants, the authors systematically manipulated IR while maintaining constant data characteristics such as class separability and cluster structure. The results revealed a weak to moderate negative correlation between IR and oversampling benefits, contradicting previous assumptions. Class separability was identified as a significantly stronger moderator of method effectiveness than IR alone. The authors propose a 'Context Matters' framework that integrates IR, class separability, and cluster structure to guide practitioners in selecting appropriate oversampling methods based on data characteristics rather than relying solely on IR. This study emphasizes the importance of controlled experimentation in imbalanced learning research and provides evidence-based selection criteria for practitioners.
Methodology
The authors conducted 12 controlled experiments manipulating the imbalance ratio while keeping data characteristics constant, using algorithmically generated Gaussian mixture datasets. They also performed two additional validation experiments to assess ceiling effects and metric-dependence.
Results
The experiments demonstrated a weak to moderate negative correlation between IR and oversampling benefits, with class separability showing a strong negative correlation (Ο = -0.72, p = 0.003). This indicates that class separability accounts for more variance in method effectiveness than IR alone.
Implications
The proposed 'Context Matters' framework can help practitioners select more effective oversampling methods by considering data characteristics, potentially improving classification performance in imbalanced datasets across various applications such as fraud detection and medical diagnosis.
BWTA: Accurate and Efficient Binarized Transformer by Algorithm-Hardware Co-design
NLP
Large Language Models
Efficient ML
- Introduces Binary Weights and Ternary Activations (BWTA) for improved quantization in Transformers.
- Develops a Smooth Multi-Stage Quantization framework for stable training and convergence.
- Creates a custom BWTA MatMul CUDA kernel for efficient GPU execution.
- Achieves near full-precision performance for BERT with minimal accuracy drop.
Read more
BWTA: Accurate and Efficient Binarized Transformer by Algorithm-Hardware Co-design
Summary
The paper presents BWTA, a novel quantization scheme that combines Binary Weights and Ternary Activations to enhance the efficiency and accuracy of Transformer-based models under ultra-low-bit quantization. The authors address two main challenges in binarization: accuracy degradation and limited GPU support. They introduce a Smooth Multi-Stage Quantization framework that stabilizes training and improves convergence by gradually reducing activation integer bins and aligning magnitudes. For inference, a custom BWTA MatMul CUDA kernel is developed, enabling efficient execution of binary and ternary operations on GPUs. Experimental results demonstrate that BWTA achieves performance close to full precision for BERT with minimal accuracy loss and significantly improves computational efficiency, achieving 16-24Γ speedups over FP16 on NVIDIA GPUs. This work highlights the potential of algorithm-hardware co-design in facilitating low-latency, ultra-low-bit inference without compromising model quality.
Methodology
The methodology involves a two-pronged approach: (1) a Smooth Multi-Stage Quantization framework that includes a Levelwise Degradation Strategy and a Magnitude-Alignment Projection Factor for effective training, and (2) the development of a BWTA MatMul CUDA kernel that supports efficient binary and ternary operations on GPUs, addressing the limitations of existing low-bit kernels.
Results
BWTA achieves an average accuracy drop of only -3.5% on the GLUE benchmark with less than 2% drop across five tasks for BERT. It also provides 16-24Γ kernel-level speedups over FP16 and 216-330 tokens/s end-to-end speedup in prefill tasks with reduced memory usage.
Implications
The findings suggest that BWTA can facilitate the deployment of Transformer models in resource-constrained environments, making ultra-low-bit quantization practical for real-world applications. This approach can potentially enhance the efficiency of various NLP tasks and large language models.
Jump Start or False Start? A Theoretical and Empirical Evaluation of LLM-initialized Bandits
Theory
Large Language Models
Reinforcement Learning
- Introduces a Noisy-CBLI framework to evaluate the impact of noise in LLM-generated preference data.
- Empirical results show that warm-starting benefits diminish and can become harmful beyond 30% noise.
- Systematic misalignment of LLM preferences can lead to higher regret than cold-start bandits.
- Develops a theoretical analysis linking prior-error to performance outcomes in bandit algorithms.
Read more
Jump Start or False Start? A Theoretical and Empirical Evaluation of LLM-initialized Bandits
Summary
This paper investigates the effectiveness of using Large Language Models (LLMs) to initialize contextual bandits, a method known as Contextual Bandits with LLM Initialization (CBLI). While previous studies indicated that LLM-generated preferences could significantly reduce early regret in bandit algorithms, this research critically examines the robustness of these findings under conditions of noise and misalignment. The authors introduce a Noisy-CBLI framework that simulates varying levels of noise in LLM-generated data and evaluate its impact on bandit performance. Through empirical studies across multiple datasets, they find that warm-starting with LLM-generated preferences remains beneficial up to a certain level of noise (approximately 30% preference-flipping), after which it becomes detrimental. Additionally, in cases of systematic misalignment, LLM-generated priors can lead to worse performance than a cold-start bandit. The paper also presents a theoretical analysis that identifies a prior-error term capturing the effects of noise and misalignment, providing a sufficient condition for when LLM-based warm starts outperform cold-start methods. Overall, the findings highlight the importance of alignment between LLM-generated preferences and actual user preferences in the context of bandit algorithms.
Methodology
The authors developed a Noisy-CBLI framework that injects synthetic noise into LLM-generated preference data. They conducted empirical studies using three conjoint datasets and multiple LLMs to assess the impact of noise and misalignment on cumulative regret. A theoretical analysis was also performed to derive conditions under which LLM-based warm starts are beneficial.
Results
The study found that warm-starting with LLM-generated preferences is effective up to 30% noise, loses its advantage around 40%, and becomes harmful beyond 50%. In cases of systematic misalignment, LLM-generated priors can lead to worse performance than cold-start bandits, even without noise. The theoretical analysis provided a sufficient condition for when LLM-based warm starts outperform cold-start methods, closely tracking observed performance transitions.
Implications
The findings suggest that while LLMs can provide valuable initialization for bandit algorithms, careful consideration of alignment and noise is essential before deployment in real-world applications. This research can inform the design of recommendation systems and other online learning frameworks that utilize LLMs.
Causal-Audit: A Framework for Risk Assessment of Assumption Violations in Time-Series Causal Discovery
Time Series
- Causal-Audit formalizes assumption validation as calibrated risk assessment.
- The framework computes risk scores across five assumption families and provides uncertainty intervals.
- An abstention-aware decision policy is implemented to guide method selection based on risk scores.
- Evaluation shows high calibration accuracy (AUROC > 0.95) and significant false positive reduction.
Read more
Causal-Audit: A Framework for Risk Assessment of Assumption Violations in Time-Series Causal Discovery
Summary
The paper introduces Causal-Audit, a novel framework designed to assess the risk of assumption violations in time-series causal discovery. Time-series causal discovery methods often depend on critical assumptions such as stationarity and regular sampling. When these assumptions are violated, the resulting causal graphs can be misleading. Causal-Audit formalizes the validation of these assumptions through calibrated risk assessment, computing effect-size diagnostics across five assumption families and aggregating them into four calibrated risk scores with uncertainty intervals. The framework employs an abstention-aware decision policy to recommend methods only when reliable inference is supported by evidence. The authors evaluate Causal-Audit on a synthetic dataset of 500 data-generating processes, demonstrating high calibration accuracy and significant reductions in false positives. The framework is open-source, allowing for broader application in various fields where causal discovery is essential.
Methodology
Causal-Audit employs a three-stage pipeline: (1) automatic diagnostics for assumption auditing, (2) risk calibration to transform diagnostics into calibrated risk scores, and (3) a decision policy that recommends or abstains from using specific causal discovery methods based on the assessed risk. The framework is method-agnostic in the diagnostic stage but includes method-specific components for risk calibration and decision thresholds.
Results
The evaluation of Causal-Audit on a synthetic atlas of 500 data-generating processes yielded well-calibrated risk scores with AUROC exceeding 0.95. The framework achieved a 62% reduction in false positives among recommended datasets and a 78% abstention rate in cases of severe assumption violations. External evaluations confirmed the consistency of recommend-or-abstain decisions with benchmark specifications.
Implications
Causal-Audit has significant implications for fields such as climate science, neuroscience, epidemiology, and economics, where reliable causal inference from time-series data is critical. By providing a systematic approach to assess assumption validity, it enhances the reliability of causal discovery methods and supports transparent reporting of data limitations.
A Clinical Point Cloud Paradigm for In-Hospital Mortality Prediction from Multi-Level Incomplete Multimodal EHRs
Multimodal
- Introduces HealthPoint (HP) for modeling multi-level incomplete EHRs.
- Utilizes a 4D coordinate system to represent clinical events as points.
- Employs Low-Rank Relational Attention for capturing high-order dependencies.
- Demonstrates state-of-the-art performance in mortality prediction.
Read more
A Clinical Point Cloud Paradigm for In-Hospital Mortality Prediction from Multi-Level Incomplete Multimodal EHRs
Summary
This paper introduces HealthPoint (HP), a novel paradigm for modeling incomplete Electronic Health Records (EHRs) to predict in-hospital mortality. Traditional approaches to multimodal EHRs often assume data completeness and fail to address the complexities arising from multi-level incompleteness, such as irregular sampling, missing modalities, and label sparsity. HP reformulates clinical events as independent points in a 4D coordinate system defined by content, time, modality, and case dimensions. It employs a Low-Rank Relational Attention mechanism to capture high-order dependencies among these points, allowing for flexible event-level interactions and fine-grained self-supervision. The methodology enhances the integration of multi-source information and supports robust modality recovery, effectively utilizing unlabeled data. Extensive experiments on large-scale EHR datasets demonstrate that HP achieves state-of-the-art performance in mortality prediction, showcasing its robustness against varying degrees of incompleteness.
Methodology
The methodology involves representing clinical events as points in a 4D space (content, time, modality, case) and using a Low-Rank Relational Attention mechanism to model interactions between these points. A hierarchical interaction and sampling strategy is employed to balance representation granularity and computational efficiency, allowing for robust event-level interactions and self-supervision.
Results
The HealthPoint model consistently outperforms existing methods in mortality prediction tasks across various datasets, demonstrating superior robustness to the challenges posed by multi-level incompleteness in EHRs.
Implications
The findings suggest that HealthPoint can significantly improve clinical risk prediction and decision support systems by effectively handling the inherent incompleteness of real-world EHR data, potentially leading to better patient outcomes and more informed clinical decisions.
The Role of Generator Access in Autoregressive Post-Training
NLP
Large Language Models
Theory
- Generator access significantly influences the effectiveness of autoregressive post-training.
- Prefix control is a primary boundary that affects learning outcomes.
- Weak local reset can eliminate barriers to reaching informative prefixes.
- Observation richness becomes meaningful only after prefix control is granted.
Read more
The Role of Generator Access in Autoregressive Post-Training
Summary
This paper investigates the constraints imposed by generator access on autoregressive post-training in large language models. The central question addressed is whether learners can only utilize fresh root-start rollouts or if they can revisit previously constructed prefixes to query the next-token rule. The study introduces two key concepts: prefix control, which determines the prefixes the learner can access, and prefix observation, which defines what the learner can see at those prefixes. The findings reveal that prefix control is a critical boundary; once it is granted, richer observations can significantly enhance performance. The paper outlines a four-step analysis: (1) characterizing the no-reset regime where each query starts from the root, (2) demonstrating how weak local reset eliminates the reachability barrier, (3) showing that observation richness becomes relevant only after control is established, and (4) illustrating how changes in generator access can lead to exponential differences in the number of queries needed for outcome-reward post-training. This work complements existing research by isolating the impact of generator access on learning dynamics.
Methodology
The paper employs a theoretical framework to analyze the interactions between learners and autoregressive generators, focusing on prefix control and observation. It characterizes different regimes of generator access and examines their implications for learning dynamics through formal proofs and conceptual analysis.
Results
The study demonstrates that allowing weak local resets significantly improves the learner's ability to access informative prefixes, thereby enhancing the overall performance of autoregressive post-training. It also shows that richer observations can outperform simpler methods only when prefix control is established.
Implications
The findings suggest that improving generator access could lead to more efficient training of large language models, particularly in complex reasoning tasks. This has potential applications in enhancing the performance of AI systems in natural language processing and other domains reliant on autoregressive models.
WGFINNs: Weak formulation-based GENERIC formalism informed neural networks
Theory
Interpretability
- WGFINNs enhance robustness to noisy data compared to GFINNs.
- The weak formulation approach allows for accurate modeling even in the presence of noise.
- Incorporation of state-wise weighted loss and residual-based attention improves performance.
- Theoretical analysis supports the effectiveness of weak formulations in maintaining trajectory consistency.
Read more
WGFINNs: Weak formulation-based GENERIC formalism informed neural networks
Summary
This paper addresses the challenge of data-driven discovery of governing equations from noisy observations in scientific machine learning. The authors propose a novel framework called Weak formulation-based GENERIC formalism informed neural networks (WGFINNs), which enhances the robustness of existing GENERIC-informed neural networks (GFINNs) by integrating weak formulations of dynamical systems. The primary limitation of GFINNs is their reliance on strong-form loss formulations, making them sensitive to measurement noise. WGFINNs mitigate this issue by employing a weak formulation approach that retains the structure-preserving architecture of GFINNs while ensuring compliance with thermodynamic laws. The authors introduce a state-wise weighted loss and a residual-based attention mechanism to address scale imbalances across state variables. Theoretical analysis reveals that the weak-form estimator remains accurate under noisy conditions, unlike the strong-form estimator, which diverges as the time step decreases. Numerical experiments demonstrate that WGFINNs consistently outperform GFINNs across various noise levels, leading to more accurate predictions and reliable recovery of physical quantities.
Methodology
The authors developed WGFINNs by combining the weak formulation of dynamical systems with the architecture of GFINNs. They utilized a state-wise weighted loss function and a residual-based attention mechanism to handle scale imbalances. Theoretical analysis was conducted to compare the strong-form and weak-form estimators, focusing on their behavior under noisy conditions.
Results
Numerical experiments indicated that WGFINNs consistently outperformed GFINNs across varying levels of noise, achieving higher accuracy in predictions and better recovery of physical quantities. The weak-form estimator demonstrated stability and accuracy even when subjected to noisy data, contrasting with the divergence observed in the strong-form estimator.
Implications
The proposed WGFINNs framework has significant implications for scientific machine learning, particularly in fields where data is often noisy and the discovery of governing equations is crucial. This approach can enhance the reliability of predictive modeling in various applications, including fluid dynamics and other complex dynamical systems.
Understanding Latent Diffusability via Fisher Geometry
Generative Models
Theory
Efficient ML
- Introduces a theoretical framework linking latent diffusability with Fisher Information Geometry.
- Identifies and quantifies geometric distortions affecting latent diffusion performance.
- Derives conditions for preserving Fisher Information Rate (FIR) to ensure stable diffusion.
- Validates the framework through experiments, showing that standard VAEs often exhibit significant FIR deviations.
Read more
Understanding Latent Diffusability via Fisher Geometry
Summary
This paper addresses the degradation of diffusion models when applied to latent spaces, particularly in the context of Variational Autoencoders (VAEs). The authors introduce a framework to quantify latent-space diffusability by analyzing the rate of change of the Minimum Mean Squared Error (MMSE) along diffusion trajectories. They decompose this rate into contributions from Fisher Information (FI) and Fisher Information Rate (FIR), revealing that while global isometry ensures FI alignment, FIR is influenced by the local geometric properties of the encoder. The study identifies three measurable penalties associated with latent geometric distortion: dimensional compression, tangential distortion, and curvature injection. The authors derive theoretical conditions for preserving FIR across different spaces, which is crucial for maintaining diffusability. Through extensive experiments on various autoencoding architectures, they validate their framework and demonstrate that the proposed FI and FIR metrics serve as effective diagnostic tools for identifying and addressing latent diffusion failures.
Methodology
The authors utilize a theoretical approach based on Fisher geometry to analyze latent diffusability. They decompose the MMSE rate along diffusion trajectories into contributions from FI and FIR, and derive conditions for their preservation. Empirical validation is performed through experiments on toy and FFHQ datasets to assess the effectiveness of their proposed metrics.
Results
The study finds that maintaining both FI and FIR is essential for stable latent diffusion. The derived geometric bounds provide insights into the necessary conditions for FIR stability, and experiments confirm that standard VAEs exhibit significant FIR deviations, correlating with failures in generation performance. The proposed metrics effectively predict latent diffusion outcomes.
Implications
The findings have significant implications for the design of autoencoders and diffusion models, suggesting that careful consideration of latent space geometry can enhance the performance of generative models. The framework can guide future research in improving latent representations and mitigating diffusion failures.
Delayed Homomorphic Reinforcement Learning for Environments with Delayed Feedback
Reinforcement Learning
Robotics
Theory
- Introduces Delayed Homomorphic Reinforcement Learning (DHRL) to address delayed feedback in RL.
- Utilizes MDP homomorphisms to create a compact abstract MDP, improving sample efficiency.
- Presents two algorithms: DHVI for finite domains and D2HPG for continuous domains.
- Demonstrates superior performance of DHRL over traditional augmentation-based methods in long-delay environments.
Read more
Delayed Homomorphic Reinforcement Learning for Environments with Delayed Feedback
Summary
This paper addresses the challenges of reinforcement learning (RL) in environments with delayed feedback, which disrupts the Markov assumption and complicates learning and control. Traditional state augmentation methods, while theoretically sound, lead to state-space explosion and increased sample complexity. The authors propose a novel framework called Delayed Homomorphic Reinforcement Learning (DHRL), which utilizes MDP homomorphisms to create a compact abstract MDP by collapsing belief-equivalent augmented states. This approach allows for efficient policy learning without sacrificing optimality. The paper includes theoretical analyses of state-space compression and sample complexity, and introduces two algorithms: Delayed Homomorphic Value Iteration (DHVI) for finite domains and Deep Delayed Homomorphic Policy Gradient (D2HPG) for continuous domains. Experimental results on continuous control tasks in the MuJoCo benchmark demonstrate that DHRL significantly outperforms existing augmentation-based methods, particularly in scenarios with long delays.
Methodology
The authors define a belief-equivalence relation on the augmented state space to induce a compact abstract MDP. They analyze state-space compression bounds and sample complexity, and develop two algorithms: DHVI for finite domains and D2HPG for continuous domains, grounded in the stochastic homomorphic policy gradient theorem.
Results
The experimental results show that the DHRL framework outperforms strong augmentation-based baselines in continuous control tasks, especially under conditions of long delays, confirming the effectiveness of the proposed approach.
Implications
The findings suggest that DHRL can be applied to real-world RL problems where feedback delays are common, potentially enhancing the performance and efficiency of RL agents in various applications, including robotics and automated control systems.
Empowering Power Outage Prediction with Spatially Aware Hybrid Graph Neural Networks and Contrastive Learning
Graph Learning
- Introduction of SA-HGNN, a hybrid model integrating static and dynamic spatial dependencies for power outage prediction.
- Development of a dynamic graph learning module to capture complex spatial relationships across different weather events.
- Use of contrastive learning to generate location-specific embeddings, addressing the imbalance in outage datasets.
- Empirical studies show SA-HGNN outperforms existing models in four utility service territories.
Read more
Empowering Power Outage Prediction with Spatially Aware Hybrid Graph Neural Networks and Contrastive Learning
Summary
This paper addresses the critical issue of power outages caused by extreme weather events, which have significant economic and social impacts. The authors propose a novel modeling approach called Spatially Aware Hybrid Graph Neural Networks (SA-HGNN) that incorporates both static and dynamic spatial features to enhance power outage predictions. Traditional models often fail to account for the spatial relationships between geographic locations, leading to suboptimal predictions. The SA-HGNN model integrates a dynamic graph learning module that captures evolving spatial dependencies and employs contrastive learning to mitigate the imbalance in outage datasets. By minimizing intra-event distances for similar locations and maximizing inter-event distances, the model generates more discriminative location-specific embeddings. Empirical evaluations across four utility service territories demonstrate that SA-HGNN achieves state-of-the-art performance in predicting power outages, showcasing its effectiveness in addressing the limitations of existing predictive models.
Methodology
The authors developed the SA-HGNN model, which constructs a fixed adjacency matrix for static features and employs a dynamic graph learning module for capturing evolving spatial dependencies. Contrastive learning is utilized to create location-specific embeddings by minimizing distances between similar locations and maximizing distances between different events.
Results
The SA-HGNN model demonstrated state-of-the-art performance in power outage prediction across four utility service territories, outperforming existing predictive models that do not account for spatial relationships.
Implications
The findings suggest that incorporating spatially aware models can significantly improve the accuracy of power outage predictions, which can lead to better preparedness and response strategies for utility companies during extreme weather events.
SIEVE: Sample-Efficient Parametric Learning from Natural Language
NLP
Large Language Models
Efficient ML
- SIEVE enables sample-efficient parametric learning from natural language context using as few as three examples.
- The method utilizes a synthetic data generation pipeline, SIEVE-GEN, which decomposes context to create high-quality training data.
- Empirical results show that SIEVE outperforms prior context distillation methods and can match or exceed ICL performance without context at inference time.
- The approach allows for persistent improvements in model performance with minimal input, making parametric learning more practical.
Read more
SIEVE: Sample-Efficient Parametric Learning from Natural Language
Summary
The paper introduces SIEVE, a novel method for sample-efficient parametric learning from natural language context, which can achieve effective model adaptation with as few as three query examples. Traditional in-context learning (ICL) allows models to adapt using prompts but lacks the persistent improvements offered by parametric learning, which typically requires extensive data. SIEVE addresses this by employing a synthetic data generation pipeline, SIEVE-GEN, that leverages the decomposability of natural language context. This method generates high-quality training data by pairing synthetic queries with only the relevant context units, enhancing the quality of context distillation. The authors evaluate SIEVE in various reasoning tasks, demonstrating that it outperforms existing context distillation methods and matches or exceeds ICL performance without needing context during inference. The findings suggest that parametric learning can be effectively achieved with minimal input, bridging the gap between the sample efficiency of ICL and the advantages of parametric learning.
Methodology
SIEVE employs a synthetic data generation pipeline (SIEVE-GEN) that decomposes natural language context into applicable units. It generates synthetic queries paired with relevant context, leading to higher-quality rollouts for training. The method then applies context distillation techniques to internalize the model's behavior based on this filtered context into the model weights.
Results
SIEVE demonstrated superior performance compared to previous context distillation methods across multiple reasoning tasks, including Retail domain rule application, RuleArena for sports regulation reasoning, and Machine Translation from One Book tasks. The models trained with SIEVE achieved results that matched or exceeded those of in-context learning, even without context during inference.
Implications
The findings suggest that SIEVE can make parametric learning practical for incorporating natural language context into models, enabling more efficient training and improved performance in various applications, particularly in domains requiring complex reasoning.
Align Your Structures: Generating Trajectories with Structure Pretraining for Molecular Dynamics
Generative Models
- Introduces a two-stage framework for generating MD trajectories that combines structure pretraining and temporal interpolation.
- Addresses data scarcity in MD simulations by leveraging large-scale conformer datasets.
- Implements an equivariant temporal interpolator to model temporal dependencies in molecular dynamics.
- Demonstrates improved accuracy in generating chemically realistic MD trajectories across various molecular systems.
Read more
Align Your Structures: Generating Trajectories with Structure Pretraining for Molecular Dynamics
Summary
This paper addresses the challenges of generating molecular dynamics (MD) trajectories using deep generative models, particularly due to the limited availability of MD data and the complexities of high-dimensional MD distributions. The authors propose a novel framework called EGINTERPOLATOR, which utilizes structure pretraining to enhance MD trajectory generation. The framework consists of two main components: a diffusion-based structure generation model pretrained on a large-scale conformer dataset and an interpolator module that enforces temporal consistency among generated structures. By decomposing the MD modeling task into structural generation and temporal alignment, the approach effectively leverages abundant structural data to improve generalization across diverse molecular systems. The authors evaluate their method on the QM9 and DRUGS small-molecule datasets, as well as tetrapeptide and protein monomer systems, demonstrating significant improvements in the accuracy of geometric, dynamical, and energetic measurements in the generated MD trajectories.
Methodology
The proposed EGINTERPOLATOR framework consists of two stages: first, a conformer diffusion model is pretrained on a large-scale dataset to generate plausible molecular structures. Second, a temporal interpolator is integrated to model the temporal dependencies of these structures, allowing for the generation of MD trajectories that maintain both structural fidelity and realistic dynamics.
Results
The experimental results show that the EGINTERPOLATOR framework excels in generating MD trajectories with high accuracy in geometric, dynamical, and energetic measurements. The method was validated on multiple datasets, including QM9 and DRUGS, and extended to tetrapeptide and protein monomer systems, confirming its effectiveness in generating chemically realistic molecular dynamics.
Implications
The findings suggest that structure pretraining can significantly enhance the generation of MD trajectories, making it a valuable approach for applications in drug discovery and molecular modeling. This framework could facilitate the exploration of new molecular systems and improve the efficiency of MD simulations.
Where to Steer: Input-Dependent Layer Selection for Steering Improves LLM Alignment
NLP
Large Language Models
Theory
- The assumption of a globally fixed intervention layer for steering vectors is fundamentally limited.
- Different inputs may require steering at different layers to align with target behaviors.
- The W2S framework learns to predict the optimal steering layer based on input embeddings.
- W2S consistently outperforms traditional fixed-layer steering methods across various datasets.
Read more
Where to Steer: Input-Dependent Layer Selection for Steering Improves LLM Alignment
Summary
This paper addresses the limitations of existing steering vector methods for aligning large language models (LLMs) at inference time, which typically apply interventions at a globally fixed layer. The authors argue that the optimal intervention layer should be input-dependent, as different inputs may require steering at different layers to achieve alignment with desired behaviors. They theoretically demonstrate this variability and provide empirical evidence showing that the optimal steering layer indeed varies across inputs. To tackle this issue, they introduce the Where to Steer (W2S) framework, which learns to adaptively select the intervention layer based on input embeddings. The authors evaluate W2S across multiple LLMs and alignment behaviors, showing that it consistently outperforms fixed-layer baselines in both in-distribution and out-of-distribution settings. This work emphasizes the importance of input-dependent control in LLM alignment and highlights adaptive layer selection as a critical design aspect missing in current methodologies.
Methodology
The authors propose the W2S framework, which formulates the problem of input-dependent layer selection as a learning task. They train a model to map input embeddings to optimal steering layers, allowing for adaptive intervention during inference. The methodology involves both theoretical analysis and empirical validation across multiple datasets and LLMs.
Results
W2S demonstrated significant improvements in steering performance compared to fixed-layer baselines across 13 diverse datasets, achieving better alignment with target behaviors in both in-distribution and out-of-distribution scenarios.
Implications
The findings suggest that adaptive layer selection can enhance the effectiveness of steering vectors in LLMs, potentially leading to more robust and contextually aware model behaviors. This approach could be applied to various applications requiring LLM alignment, such as sentiment analysis, content moderation, and conversational agents.
Supervised Dimensionality Reduction Revisited: Why LDA on Frozen CNN Features Deserves a Second Look
Computer Vision
Efficient ML
Theory
- LDA consistently improves classification accuracy over full-dimensional features across multiple architectures and datasets.
- Dimensionality reduction using LDA can reduce feature dimensionality by 61-95% while enhancing accuracy.
- LDA outperforms PCA and more complex alternatives in both accuracy and computational cost.
- Two lightweight extensions to LDA are introduced, offering slight accuracy improvements.
Read more
Supervised Dimensionality Reduction Revisited: Why LDA on Frozen CNN Features Deserves a Second Look
Summary
This paper investigates the impact of dimensionality reduction on the performance of classifiers using frozen features from pretrained convolutional neural networks (CNNs). The authors focus on Linear Discriminant Analysis (LDA) as a method for reducing the dimensionality of high-dimensional feature vectors extracted from CNNs before classification. Through extensive experiments involving four backbone architectures (ResNet-18, ResNet-50, MobileNetV3-Small, EfficientNet-B0) and two datasets (CIFAR-100 and Tiny ImageNet), the study demonstrates that applying LDA consistently enhances classification accuracy by up to 4.6 percentage points while significantly reducing dimensionality by 61-95%. The results are statistically significant across all tested configurations. The paper also benchmarks LDA against other dimensionality reduction techniques, such as PCA, Local Fisher Discriminant Analysis, and Neighbourhood Components Analysis, showing that LDA outperforms these methods in terms of accuracy and computational efficiency. Additionally, the authors introduce two extensions to LDAβResidual Discriminant Augmentation (RDA) and Discriminant Subspace Boosting (DSB)βwhich provide marginal gains in accuracy at the cost of increased computation. The paper concludes with practical guidelines for practitioners on when to use LDA over PCA, how many components to retain, and which backbone-dataset combinations yield the best results.
Methodology
The authors conducted controlled experiments comparing ten dimensionality reduction methods, including LDA, across four CNN architectures and two datasets. They measured classification accuracy and dimensionality reduction, applying statistical tests to validate their findings.
Results
LDA improved classification accuracy by up to 4.6 percentage points while reducing dimensionality by 61-95%. The improvements were statistically significant (p < 0.001) across all configurations. LDA outperformed PCA in 7 out of 8 settings and showed better performance than more complex methods in terms of accuracy and computational efficiency.
Implications
The findings suggest that supervised dimensionality reduction, particularly using LDA, should be a standard practice in transfer learning pipelines involving frozen CNN features. This approach can lead to better model performance with lower computational costs, making it valuable for practitioners with limited resources.
Generative Frontiers: Why Evaluation Matters for Diffusion Language Models
NLP
Large Language Models
Generative Models
- Diffusion language models offer greater flexibility in generative tasks than autoregressive models.
- Current evaluation methodologies, particularly likelihood-based metrics, are inadequate for assessing dLLMs.
- Generative perplexity and entropy can be decomposed into KL divergence components, leading to a new evaluation framework.
- OpenWebText is recommended as the standard pretraining dataset for dLLMs over LM1B.
Read more
Generative Frontiers: Why Evaluation Matters for Diffusion Language Models
Summary
This paper addresses the evaluation challenges associated with diffusion language models (dLLMs), which have gained traction due to their flexibility in generative trajectories compared to autoregressive models. The authors highlight the limitations of current evaluation methodologies, particularly the reliance on likelihood-based metrics and generative perplexity. They argue that these methods can lead to misleading conclusions about model performance. The paper proposes a new evaluation framework called 'generative frontiers,' which decomposes generative perplexity and entropy into components of KL divergence to provide a more principled assessment of generative quality. The authors emphasize the importance of using appropriate pretraining datasets, advocating for OpenWebText over alternatives like LM1B, which lacks coherence and is unsuitable for generative quality evaluation. Empirical observations are included to illustrate the proposed evaluation framework's effectiveness.
Methodology
The authors systematically analyze existing evaluation methodologies for diffusion language models, discussing the limitations of likelihood-based evaluations and generative perplexity. They propose the generative frontiers framework, which decomposes generative perplexity and entropy into KL divergence components to better assess generative quality. The paper also includes empirical observations to support their claims.
Results
The paper concludes that the generative frontiers framework offers a more principled approach to evaluating diffusion language models, addressing the shortcomings of traditional metrics. The empirical observations suggest that using OpenWebText as a pretraining dataset yields more meaningful performance metrics compared to LM1B.
Implications
The findings of this paper have significant implications for the evaluation of generative models, particularly in the context of diffusion language models. By establishing a more reliable evaluation framework, researchers can better assess the generative capabilities of their models, leading to improved model development and understanding of generative processes.
ArrowFlow: Hierarchical Machine Learning in the Space of Permutations
Theory
Efficient ML
- Introduces ArrowFlow, a permutation-based machine learning architecture.
- Utilizes ranking filters and hierarchical learning without floating-point parameters.
- Demonstrates robustness to monotone batch effects in gene expression data.
- Achieves competitive accuracy across various datasets, including UCI and MNIST.
Read more
ArrowFlow: Hierarchical Machine Learning in the Space of Permutations
Summary
ArrowFlow is a novel machine learning architecture that operates entirely within the space of permutations, utilizing ranking filters that compare inputs based on Spearmanβs footrule distance. The architecture is designed to learn ordinal representations hierarchically, where each layer's output ranking serves as the input for the next layer, eliminating the need for floating-point parameters in core computations. The encoding pipeline connects real-valued data to the permutation space, while a multi-view ensemble approach combines outputs from independent networks trained on diverse projections to mitigate information loss. The paper connects ArrowFlow to Arrowβs impossibility theorem, suggesting that violations of social-choice fairness axioms can serve as inductive biases for nonlinearity, sparsity, and stability in learning. Experimental results demonstrate that ArrowFlow outperforms traditional methods on several benchmarks, including UCI datasets and gene expression data, showcasing its robustness to batch effects and its competitive performance in classification tasks. The architecture is presented as a proof of concept that competitive classification can be achieved without conventional gradient-based methods, emphasizing the advantages of ordinal structures in machine learning.
Methodology
ArrowFlow employs a hierarchical structure of ranking filters that learn to compare inputs using Spearmanβs footrule distance. The architecture incorporates an encoding pipeline to bridge real-valued data with permutations and utilizes a multi-view ensemble approach to enhance accuracy. The learning process is based on permutation-matrix accumulation, which updates filters based on displacement evidence rather than traditional gradient descent.
Results
ArrowFlow outperformed baseline models on the Iris dataset (2.7% vs. 3.3% error) and was competitive on most UCI datasets. In gene expression classification, it maintained a stable 2.5% error rate, demonstrating perfect invariance to monotone batch effects, while traditional methods like SVM collapsed to 82.6%. On MNIST, ArrowFlow achieved a 9.1% error rate using only ordinal comparisons, with performance improving as network width and depth increased.
Implications
ArrowFlow suggests a new paradigm for machine learning that prioritizes ordinal structures, which could lead to more robust and efficient models, particularly in domains where data is inherently ordinal or relational. Its integer-only arithmetic also aligns well with neuromorphic hardware, potentially paving the way for more energy-efficient machine learning systems.
Simple yet Effective: Low-Rank Spatial Attention for Neural Operators
Theory
Efficient ML
- Introduction of Low-Rank Spatial Attention (LRSA) for neural operators.
- Unification of global mixing modules under a low-rank perspective.
- Use of standard Transformer primitives for simplicity and efficiency.
- Achieved over 17% error reduction compared to existing methods.
Read more
Simple yet Effective: Low-Rank Spatial Attention for Neural Operators
Summary
This paper introduces Low-Rank Spatial Attention (LRSA), a novel approach designed to enhance neural operators for solving partial differential equations (PDEs). The authors observe that the global interaction kernels in many PDE regimes exhibit rapid spectral decay, allowing for low-rank approximations. LRSA is proposed as a unified framework that compresses high-dimensional features into a compact latent space, processes global interactions, and reconstructs the global context back to spatial points. Unlike previous methods that rely on complex aggregation or normalization techniques, LRSA utilizes standard Transformer components, making it straightforward to implement and compatible with hardware-optimized kernels. The authors demonstrate that this simple architecture achieves significant accuracy improvements, with an average error reduction of over 17% compared to the second-best methods, while maintaining stability and efficiency during mixed-precision training. Overall, the work highlights the potential of low-rank structures in enhancing the performance of neural operators in modeling physical phenomena.
Methodology
The methodology involves a three-step process: compressing spatial point features into latent tokens using cross-attention, processing global information within this latent space through standard self-attention and feed-forward networks, and reconstructing the spatial point features from the processed latent tokens. This approach leverages the low-rank structure of PDE interactions while utilizing standard Transformer components.
Results
The LRSA framework was tested across various PDE benchmarks, yielding an average error reduction of over 17% compared to the second-best methods. The architecture proved to be stable and efficient, particularly in mixed-precision training scenarios.
Implications
The findings suggest that LRSA can significantly improve the efficiency and accuracy of neural operators in solving PDEs, which has broad applications in fields such as physics, engineering, and computational modeling. The approach may also inspire further research into low-rank methods in other areas of machine learning.
Autoencoder-Based Parameter Estimation for Superposed Multi-Component Damped Sinusoidal Signals
Time Series
- Development of an autoencoder-based method for parameter estimation of damped sinusoidal signals.
- Evaluation of the method's performance under Gaussian and uniform training data distributions.
- High accuracy in parameter estimation even in complex scenarios with overlapping components.
- Robustness of the method against noise and less informative training distributions.
Read more
Autoencoder-Based Parameter Estimation for Superposed Multi-Component Damped Sinusoidal Signals
Summary
This paper presents an innovative approach to estimating parameters of superposed multi-component damped sinusoidal signals using an autoencoder-based method. Damped sinusoidal signals are prevalent in various physical systems, but their analysis is complicated by rapid decay, superposition of multiple components, and observational noise. The authors develop a method that leverages the latent space of autoencoders to accurately estimate the frequency, phase, decay time, and amplitude of each signal component. The study evaluates the method under different training data distributions, specifically comparing Gaussian and uniform distributions. Results demonstrate that the proposed autoencoder method can effectively estimate parameters even in challenging scenarios, such as when components are subdominant or nearly cancel each other out. The method shows robustness against less informative training distributions, highlighting its potential for analyzing short-duration, noisy signals across various applications.
Methodology
The authors designed an autoencoder architecture that compresses high-dimensional input data into a low-dimensional latent space, enabling effective extraction of physical parameters from noisy multi-component signals. The method was evaluated through waveform reconstruction and parameter estimation accuracy, using both two-component and five-component signal cases.
Results
The proposed autoencoder method achieved high accuracy in estimating the parameters of damped sinusoidal signals, even in difficult conditions such as subdominant components or nearly opposite-phase components. The comparison between Gaussian and uniform training distributions revealed that the method remains effective even when trained on less informative data.
Implications
This research has significant implications for various fields, including physics, engineering, and signal processing, where accurate analysis of damped sinusoidal signals is crucial. The autoencoder-based approach could enhance the analysis of noisy signals in applications such as structural health monitoring, vibration analysis, and gravitational-wave astronomy.
Investigating Data Interventions for Subgroup Fairness: An ICU Case Study
Theory
- Algorithmic bias in healthcare can exacerbate disparities among subgroups.
- Combining data sources may lead to unpredictable effects on model fairness and performance.
- Common data addition strategies are often ineffective and can introduce distribution shifts.
- A hybrid approach of data-centric methods and model calibration is most effective for improving subgroup performance.
Read more
Investigating Data Interventions for Subgroup Fairness: An ICU Case Study
Summary
This paper addresses the critical issue of algorithmic bias in machine learning models used for decision-making in high-stakes environments, particularly in healthcare. The authors investigate the effectiveness of data interventions aimed at improving subgroup fairness, focusing on intensive care unit (ICU) data. They highlight the challenges of combining data sources to enhance model performance, noting that while adding data can sometimes improve fairness, it can also introduce distribution shifts that negatively impact subgroup performance. The study evaluates two datasets, the eICU Collaborative Research Database and the MIMIC-IV dataset, revealing that common data addition strategies are often unreliable. The authors compare model-based post-hoc calibration with data-centric approaches, concluding that a hybrid strategy combining both is necessary for optimal subgroup performance. This work questions the prevailing notion that simply acquiring 'better data' is a panacea for fairness issues, emphasizing the need for a nuanced understanding of data interventions.
Methodology
The authors conducted a comparative analysis of data interventions for subgroup fairness using two healthcare datasets. They evaluated the effectiveness of various data addition strategies and model-based post-hoc calibration techniques, identifying the impact of mean discrepancy as a barrier to success. The study employed both quantitative and qualitative assessments to analyze subgroup performance across different data interventions.
Results
The findings indicate that data addition can both enhance and hinder model fairness and performance, with many intuitive strategies proving unreliable. The hybrid approach of combining data-centric methods with model calibration was found to be the most effective in improving subgroup performance across the evaluated datasets.
Implications
The results suggest that practitioners in healthcare and other high-stakes domains should be cautious when applying data addition strategies for fairness. The study advocates for a more integrated approach that combines data interventions with model adjustments to address algorithmic bias effectively. This work has broader implications for the design of trustworthy AI systems in various fields.
Complex-Valued GNNs for Distributed Basis-Invariant Control of Planar Systems
Graph Learning
Robotics
Theory
- Introduces a complex-valued GNN architecture for distributed control of planar systems.
- Achieves global invariance to local basis choices, enhancing applicability in GPS-denied environments.
- Utilizes complex-valued linear layers and phase-equivariant activation functions.
- Demonstrates improved data efficiency and tracking performance over traditional real-valued GNNs.
Read more
Complex-Valued GNNs for Distributed Basis-Invariant Control of Planar Systems
Summary
This paper introduces a novel architecture for Graph Neural Networks (GNNs) that is designed to facilitate distributed control of planar systems without relying on a global reference frame. Traditional GNN architectures require nodes to collect geometric observations in compatible bases, which limits their applicability in environments where global positioning or compass data is unavailable. The authors propose a complex-valued GNN parameterization that is globally invariant to the choice of local basis, allowing for effective control in GPS-denied scenarios. By expressing 2D geometric features and transformations in the complex domain, the proposed architecture utilizes complex-valued linear layers with phase-equivariant activation functions. This design ensures that the learned control policies remain invariant to local frame choices. The authors demonstrate that their approach enhances data efficiency, tracking performance, and generalization capabilities compared to a real-valued baseline in an imitation learning flocking task.
Methodology
The authors adapt the GraphSAGE architecture to incorporate complex-valued neural networks (CVNNs), allowing for the representation of geometric transformations in the complex domain. The GNN layers are designed to maintain equivariance to local transformations, ensuring that the learned policies are invariant to the choice of local frames. The methodology includes the use of phase-amplitude activation functions to effectively manage the complex-valued inputs and outputs.
Results
The proposed complex-valued GNN architecture outperformed the real-valued baseline in terms of data efficiency, tracking performance, and generalization in the context of an imitation learning flocking task. The results indicate that the complex representation allows for more effective learning and control in distributed systems.
Implications
This work has significant implications for multi-robot systems, particularly in environments where traditional navigation aids are unavailable. The ability to control systems without a global reference frame opens up new possibilities for applications in robotics, autonomous vehicles, and swarm intelligence.
Cog-DRIFT: Exploration on Adaptively Reformulated Instances Enables Learning from Hard Reasoning Problems
NLP
Large Language Models
Reinforcement Learning
- Cog-DRIFT reformulates difficult reasoning problems into simpler task formats to facilitate learning.
- The framework utilizes an adaptive curriculum that progresses from easier to harder tasks based on model performance.
- Significant performance improvements were observed on challenging reasoning benchmarks, with gains of +10.11% for Qwen and +8.64% for Llama.
- The method shows strong generalization capabilities across held-out datasets.
Read more
Cog-DRIFT: Exploration on Adaptively Reformulated Instances Enables Learning from Hard Reasoning Problems
Summary
The paper presents Cog-DRIFT, a novel framework designed to enhance the learning capabilities of large language models (LLMs) when faced with difficult reasoning problems. Traditional reinforcement learning from verifiable rewards (RLVR) often fails to provide meaningful learning signals for challenging tasks, resulting in a performance ceiling. To address this, the authors propose a task reformulation strategy that transforms complex open-ended problems into simpler formats, such as multiple-choice and cloze tasks. This approach reduces cognitive load and allows models to learn from easier variants, which can then be transferred back to improve performance on the original difficult problems. The Cog-DRIFT framework automatically generates these reformulated tasks and organizes them into an adaptive curriculum that progresses from easier to harder formats as the model improves. The results demonstrate that Cog-DRIFT significantly enhances performance on previously unsolvable problems, achieving notable gains on various reasoning benchmarks and showing strong generalization to held-out datasets. The findings underscore the effectiveness of task reformulation and curriculum learning in overcoming exploration barriers in LLM post-training.
Methodology
The authors developed the Cog-DRIFT framework, which reformulates complex open-ended problems into simpler formats (multiple-choice and cloze tasks) to reduce cognitive load. This reformulation allows models to learn from easier tasks, which subsequently improves their performance on the original difficult problems. The framework organizes these tasks into an adaptive curriculum that adjusts the difficulty level based on the model's learning progress.
Results
Cog-DRIFT achieved substantial improvements on hard reasoning problems, with absolute gains of +10.11% for the Qwen model and +8.64% for the Llama model. The method consistently outperformed standard GRPO and guided-exploration baselines, with average improvements of +4.72% for Qwen and +3.23% for Llama over the second-best baseline. Additionally, the curriculum approach enhanced sample efficiency and improved pass@k metrics at test time.
Implications
The findings suggest that task reformulation and adaptive curriculum learning can significantly enhance the training of LLMs, particularly in scenarios where traditional methods struggle. This approach could be applied to various domains requiring complex reasoning, potentially leading to more robust and capable AI systems.
Integrating Artificial Intelligence, Physics, and Internet of Things: A Framework for Cultural Heritage Conservation
Theory
- Integration of AI, IoT, and physical knowledge for cultural heritage conservation.
- Development of a four-layer framework for effective monitoring and predictive maintenance.
- Utilization of Physics-Informed Neural Networks (PINNs) for enhanced simulation accuracy.
- Incorporation of Reduced Order Methods (ROMs) to improve computational efficiency.
Read more
Integrating Artificial Intelligence, Physics, and Internet of Things: A Framework for Cultural Heritage Conservation
Summary
This paper presents a novel framework aimed at enhancing the conservation of cultural heritage through the integration of Artificial Intelligence (AI), Internet of Things (IoT), and physical knowledge. The proposed framework is structured into four functional layers: data acquisition, knowledge base, inference engine, and application layer. A central innovation is the use of Physics-Informed Neural Networks (PINNs) that incorporate physical laws into deep learning models, allowing for more accurate simulations of degradation processes in cultural assets. The framework also employs Reduced Order Methods (ROMs) to improve computational efficiency and is compatible with classical Finite Element methods. The authors demonstrate the framework's capabilities through simulated scenarios involving complex geometries, addressing both direct and inverse problems in cultural heritage conservation. The methodology allows for the automatic processing of 3D digital replicas, facilitating reliable simulations that combine data-driven and physics-based approaches. Overall, this work contributes significantly to the field by providing a comprehensive solution for predictive maintenance and monitoring of cultural heritage assets.
Methodology
The framework employs a structured approach consisting of four layers: data acquisition for sensor data and digital replicas, a knowledge base for data storage and preprocessing, an inference engine for simulation and analysis, and an application layer for user interaction. The integration of PINNs allows for the incorporation of physical laws into machine learning models, while ROMs enhance computational efficiency. The framework is tested through simulated scenarios that involve both direct and inverse problems.
Results
The framework successfully demonstrates its efficacy through various test problems, including simulations of temperature monitoring and diffusion reactions in cultural assets. The results indicate that the integration of PINNs with ROMs allows for accurate modeling of degradation processes influenced by environmental and material parameters, showcasing the framework's potential for practical applications in cultural heritage conservation.
Implications
This framework has significant implications for the field of cultural heritage conservation, providing a robust methodology for predictive maintenance and risk assessment of cultural assets. By combining AI, IoT, and physical knowledge, it enhances the reliability of simulations and decision-making processes, potentially leading to better preservation strategies and resource allocation in the conservation of historical artifacts.
Hierarchical Planning with Latent World Models
Reinforcement Learning
Robotics
Optimization
- Introduces a hierarchical planning framework that mitigates prediction errors in long-horizon control tasks.
- Achieves a 70% success rate in real-world robotic tasks using zero-shot control with final goal specifications.
- Demonstrates up to 3Γ reduction in planning-time compute compared to traditional methods.
- Applies across diverse latent world-model architectures, enhancing generalizability.
Read more
Hierarchical Planning with Latent World Models
Summary
This paper presents a novel framework for hierarchical planning in robotic control using learned latent world models, addressing the challenges of long-horizon control and prediction errors. The proposed method, Hierarchical Planning with Latent World Models (HWM), operates by learning world models at multiple temporal scales and performing hierarchical planning across these scales. This approach allows for long-horizon reasoning while significantly reducing the computational complexity of planning. HWM serves as a plug-in abstraction applicable to various latent world-model architectures and domains. The authors demonstrate that this hierarchical planning framework enables zero-shot control in real-world robotic tasks, achieving a 70% success rate in pick-and-place tasks using only final goal specifications, compared to 0% with a single-level world model. Additionally, in simulated environments, hierarchical planning shows improved success rates while requiring up to three times less planning-time compute, highlighting its efficiency and effectiveness in complex decision-making scenarios.
Methodology
The authors propose a hierarchical model predictive control (MPC) framework that learns latent world models at different temporal resolutions. The high-level planner generates macro-actions using a long-horizon world model, while the low-level planner optimizes primitive actions based on subgoals derived from high-level predictions. This interaction between levels is facilitated by a learned action encoder that compresses action sequences, reducing the dimensionality of the planning space and enhancing computational efficiency.
Results
The hierarchical planning approach significantly outperformed single-level models, achieving a 70% success rate in pick-and-place tasks and higher success rates in simulated environments like push manipulation and maze navigation. The method also demonstrated a reduction in planning time, requiring up to three times less compute while maintaining or exceeding success rates compared to flat planning methods.
Implications
The findings suggest that hierarchical planning with latent world models can be effectively applied to complex robotic control tasks, enabling zero-shot deployment in diverse environments. This approach could lead to advancements in autonomous systems, enhancing their adaptability and efficiency in real-world applications.
Fast NF4 Dequantization Kernels for Large Language Model Inference
Large Language Models
Efficient ML
Optimization
- Identification of dequantization as a critical bottleneck in LLM inference.
- Development of a lightweight shared memory optimization that reduces latency.
- Achieved 2.0β2.2Γ speedup in kernel execution and 1.54Γ end-to-end improvement.
- Compatibility with HuggingFace ecosystem facilitates easy adoption.
Read more
Fast NF4 Dequantization Kernels for Large Language Model Inference
Summary
The paper addresses the performance bottleneck in large language model (LLM) inference caused by the dequantization process from NF4 (4-bit NormalFloat) to FP16 format on NVIDIA GPUs. As LLMs grow in size, traditional methods of dequantization become increasingly inefficient, consuming significant computational resources and time. The authors propose a lightweight shared memory optimization that enhances the dequantization process by exploiting memory hierarchy effectively. This approach simplifies indexing logic and reduces instruction counts, achieving substantial speedups in kernel execution. The experimental results demonstrate a 2.0β2.2Γ speedup in kernel performance and up to 1.54Γ improvement in end-to-end latency across various model sizes, showcasing the potential of lightweight optimizations in enhancing LLM inference without requiring extensive engineering efforts. The proposed solution is compatible with existing frameworks like HuggingFace, making it accessible for practical deployment in both research and production environments.
Methodology
The authors conducted systematic profiling to identify the dequantization bottleneck in NF4 implementations. They designed a shared memory optimization that reduces global memory access by transforming redundant accesses into efficient per-block loading. This method leverages the latency advantages of on-chip memory and simplifies the indexing logic, resulting in reduced instruction counts.
Results
The proposed optimization resulted in a kernel speedup of 2.0β2.2Γ across three models (Gemma 27B, Qwen3 32B, and Llama3.3 70B) and an end-to-end improvement of up to 1.54Γ. The optimization reduced the instruction count for dequantization by 71%, demonstrating significant efficiency gains.
Implications
This work provides a practical solution for deploying large language models on existing GPU infrastructure, enabling more efficient inference and broader accessibility to advanced models. It highlights the importance of optimizing memory access patterns in machine learning systems, particularly as model sizes continue to grow.
How Long short-term memory artificial neural network, synthetic data, and fine-tuning improve the classification of raw EEG data
Time Series
- Integration of synthetic data generation with LSTM networks enhances EEG classification.
- The study addresses the limitations of traditional machine learning methods in EEG data analysis.
- Fine-tuning techniques are crucial for improving model performance on ambiguous visual stimuli.
- The experimental design involved a well-defined dataset with controlled ambiguity levels.
Read more
How Long short-term memory artificial neural network, synthetic data, and fine-tuning improve the classification of raw EEG data
Summary
This paper presents a novel machine learning pipeline aimed at enhancing the classification of electroencephalogram (EEG) data using a combination of synthetic data generation, Long Short-Term Memory (LSTM) artificial neural networks, and fine-tuning techniques. The study focuses on classifying EEG responses to implicit visual stimuli, specifically the Necker cube, which presents varying levels of ambiguity. The authors highlight the challenges posed by limited experimental data in neuroscience, where traditional machine learning methods often fall short. By generating synthetic datasets through band-pass filtering and employing LSTM networks, the authors demonstrate a significant improvement in classification accuracy. The experimental setup involved twenty healthy volunteers who interpreted images of the Necker cube, with EEG data collected using a standard electrode configuration. The results indicate that the proposed approach effectively addresses the classification problem, showcasing the potential of deep learning techniques in brain-computer interface applications.
Methodology
The methodology includes generating synthetic EEG data through band-pass filtering, implementing LSTM networks for classification, and applying fine-tuning techniques to optimize model performance. The dataset was structured based on varying levels of ambiguity in visual stimuli, and K-fold Cross-Validation was employed to ensure robust evaluation.
Results
The proposed approach resulted in a marked improvement in the classification accuracy of raw EEG data, effectively distinguishing between different levels of ambiguity in the visual stimuli presented to participants. The integration of synthetic data and fine-tuning of the LSTM model contributed significantly to these results.
Implications
The findings suggest that combining synthetic data generation with advanced neural network architectures can enhance the analysis of EEG data, potentially leading to better brain-computer interface applications. This approach may facilitate more accurate interpretations of brain activity in response to visual stimuli, with broader implications for neuroscience research and clinical applications.
Diagonal-Tiled Mixed-Precision Attention for Efficient Low-Bit MXFP Inference
Large Language Models
Efficient ML
- Introduction of Diagonal-Tiled Mixed-Precision Attention (DMA) for efficient LLM inference.
- Development of a fully fused GPU kernel that integrates multiple processes into one workflow.
- Demonstration of lossless generation quality compared to traditional full-precision methods.
- Significant speedup achieved through kernel fusion and efficient memory usage.
Read more
Diagonal-Tiled Mixed-Precision Attention for Efficient Low-Bit MXFP Inference
Summary
This paper addresses the high inference costs associated with transformer-based large language models (LLMs) due to the quadratic complexity of attention mechanisms and the limitations of high-precision operations. The authors propose a novel Diagonal-Tiled Mixed-Precision Attention (DMA) kernel that utilizes the microscaling floating-point (MXFP) data format to enhance computational efficiency on next-generation GPU architectures. DMA incorporates a mixed-precision design that partitions the attention matrix into low- and high-precision regions, allowing critical tokens to retain high precision while leveraging low-precision formats for others. The kernel is fully fused, integrating quantization, microscaling transformation, and attention computation into a single workflow, which minimizes memory access and kernel launch overhead. Empirical evaluations on NVIDIA B200 GPUs demonstrate that DMA achieves lossless generation quality compared to full-precision attention while significantly speeding up inference through kernel fusion. The authors also provide insights into the trade-offs between efficiency and accuracy through extensive ablation studies.
Methodology
The authors developed the DMA kernel by implementing a mixed-precision design that partitions the attention matrix into regions of low and high precision. They utilized Triton for kernel implementation, allowing for hardware-level parallelism and memory efficiency. The workflow includes quantization, microscaling transformation, and attention computation, all fused into a single kernel to reduce overhead.
Results
Experimental results indicate that DMA maintains generation quality comparable to full-precision attention while achieving significant speed improvements. The ablation studies confirm the effectiveness of the mixed-precision tiling and diagonal window design in optimizing performance without sacrificing accuracy.
Implications
The proposed DMA kernel has the potential to enhance the efficiency of LLM inference in practical applications, making low-bit quantization feasible for real-world deployment. This could lead to faster and more resource-efficient AI systems, particularly in environments with limited computational resources.
Communication-free Sampling and 4D Hybrid Parallelism for Scalable Mini-batch GNN Training
Graph Learning
- Introduction of ScaleGNN, a 4D parallel framework for mini-batch GNN training.
- Communication-free distributed sampling algorithm that enhances efficiency.
- 3D parallel matrix multiplication (PMM) for improved scalability.
- Achieved significant speedup and maintained high accuracy on large datasets.
Read more
Communication-free Sampling and 4D Hybrid Parallelism for Scalable Mini-batch GNN Training
Summary
This paper presents ScaleGNN, a novel framework designed to enhance the scalability of mini-batch training for Graph Neural Networks (GNNs). Traditional distributed mini-batch approaches face significant performance bottlenecks due to expensive sampling methods and limited scalability with data parallelism. ScaleGNN addresses these issues by introducing a 4D parallel framework that combines communication-free distributed sampling, 3D parallel matrix multiplication (PMM), and data parallelism. The framework employs a uniform vertex sampling algorithm that allows each GPU to construct its local mini-batch without inter-process communication, significantly reducing overhead. Additionally, it optimizes the training process by overlapping sampling with training, utilizing low-precision communication, kernel fusion, and communication-computation overlap. The authors evaluate ScaleGNN on five graph datasets, demonstrating its ability to scale effectively up to 2048 GPUs, achieving a 3.5Γ speedup over state-of-the-art methods while maintaining or exceeding accuracy levels.
Methodology
ScaleGNN employs a uniform vertex sampling algorithm that allows each GPU to independently construct local mini-batch subgraphs without inter-device communication. It organizes GPUs into a 4D virtual grid, combining data parallelism with 3D PMM to optimize matrix operations. The framework also incorporates optimizations such as overlapping sampling with training, low-precision communication, kernel fusion, and communication-computation overlap.
Results
ScaleGNN was evaluated on five graph datasets, achieving strong scaling up to 2048 GPUs on Perlmutter, 2048 GCDs on Frontier, and 1024 GPUs on Tuolumne. Notably, on Perlmutter, ScaleGNN achieved a 3.5Γ end-to-end training speedup over the state-of-the-art baseline on the ogbn-products dataset, while also reaching 81.3% test accuracy, outperforming existing methods like GraphSAINT and GraphSAGE.
Implications
The advancements presented in ScaleGNN could significantly impact the efficiency of GNN training on large-scale graph datasets, making it feasible to apply GNNs in real-world applications such as recommendation systems, fraud detection, and scientific discovery. The open-source nature of ScaleGNN encourages further research and development in scalable GNN training methodologies.
SODA: Semi On-Policy Black-Box Distillation for Large Language Models
NLP
Large Language Models
Efficient ML
- Introduction of semi on-policy distillation, utilizing static snapshots for effective training.
- SODA achieves up to 10Γ faster training and 27% less peak GPU memory usage compared to GAD.
- Outperforms state-of-the-art methods on 15 out of 16 benchmark results.
- Eliminates the need for adversarial training and additional models, simplifying the distillation process.
Read more
SODA: Semi On-Policy Black-Box Distillation for Large Language Models
Summary
The paper introduces SODA (Semi On-policy Distillation with Alignment), a novel approach to knowledge distillation for large language models (LLMs) that addresses the limitations of existing methods. Traditional off-policy techniques, such as sequence-level knowledge distillation (SeqKD), fail to correct inherent errors in student models, while fully on-policy methods like Generative Adversarial Distillation (GAD) introduce significant computational overhead and instability. SODA leverages the capability gap between powerful teacher models and smaller student models by using a static snapshot of the student's responses to create a contrastive signal for training. This method allows for effective distribution alignment without the need for continuous online sampling or adversarial training. The authors validate SODA through extensive evaluations on multiple model-dataset combinations, demonstrating that it achieves superior distillation quality while being significantly faster and more memory-efficient than GAD.
Methodology
SODA employs a semi on-policy approach by using a one-time static snapshot of the student model's responses to create a contrastive signal. This involves a brief warmup on teacher data followed by Direct Preference Optimization (DPO), where the teacher's responses are treated as preferred and the student's responses as dispreferred. This method allows for effective error correction and distribution alignment without the complexities of adversarial training.
Results
SODA matches or exceeds the performance of GAD on 15 out of 16 model-dataset combinations, with improvements of up to +2.1 points. It also demonstrates a significant reduction in training time and GPU memory usage, achieving training speeds that are 10 times faster than GAD and requiring 27% less peak GPU memory.
Implications
The findings suggest that SODA could be a practical solution for efficiently training smaller language models, making it easier to deploy capable models in real-world applications. This approach could enhance the accessibility of advanced language models by reducing the computational resources required for distillation.
Reflective Context Learning: Studying the Optimization Primitives of Context Space
Optimization
Reinforcement Learning
Theory
- Introduction of Reflective Context Learning (RCL) as a unified framework for context optimization.
- Emphasis on reflection as a mechanism to generate update signals for context space learning.
- Integration of classical optimization techniques to enhance learning in context space.
- Demonstrated improvements over strong baselines across multiple benchmarks.
Read more
Reflective Context Learning: Studying the Optimization Primitives of Context Space
Summary
The paper introduces Reflective Context Learning (RCL), a novel framework aimed at addressing the challenges of learning in context space, which is often overlooked in classical machine learning optimization. RCL emphasizes the importance of learning through interaction and reflection, allowing agents to iteratively update their context based on experiences. The authors argue that traditional optimization issues such as credit assignment, overfitting, and local optima also manifest in context space. RCL employs reflection to generate directional update signals from execution trajectories and current context, akin to gradients in parameter space. The framework systematically incorporates classical optimization techniques, including batching, improved credit assignment, auxiliary losses, and failure replay, to enhance learning efficiency. The authors validate RCL on various benchmarks, demonstrating that these optimization primitives significantly improve performance over existing methods, with their effectiveness varying across different task regimes. The study concludes that context updates should be viewed as an optimization problem, allowing for systematic improvements through transferable principles.
Methodology
The authors developed the RCL framework, which involves reflecting on agent behavior and experiences to create directional update signals for context. They systematically integrated classical optimization primitives such as batching, credit assignment, and failure replay into the learning process, and evaluated the framework on several benchmarks including AppWorld, BrowseComp+, and RewardBench2.
Results
The application of RCL and its optimization primitives led to significant performance improvements over strong baselines in various tasks. The study highlighted the varying importance of different optimization techniques depending on the task regime, and provided insights into robustness to initialization and the effects of batch size.
Implications
The findings suggest that optimizing context rather than model parameters can lead to more efficient and interpretable learning processes for AI agents. This approach may facilitate the development of more adaptable and robust agents capable of generalizing across diverse tasks and environments.
Conditional Sampling via Wasserstein Autoencoders and Triangular Transport
Generative Models
Theory
Efficient ML
- Introduction of Conditional Wasserstein Autoencoders (CWAEs) for conditional sampling.
- Utilization of block-triangular decoders to exploit low-dimensional structures.
- Demonstration of substantial error reductions compared to traditional methods like LREnKF.
- Theoretical connections to conditional optimal transport problems.
Read more
Conditional Sampling via Wasserstein Autoencoders and Triangular Transport
Summary
This paper introduces Conditional Wasserstein Autoencoders (CWAEs), a novel framework for conditional sampling that leverages low-dimensional structures in both conditioned and conditioning variables. The authors modify the traditional Wasserstein autoencoder by employing a block-triangular decoder and imposing independence assumptions on latent variables. This approach allows the model to effectively exploit low-dimensional structures while enabling conditional simulation. The paper explores various theoretical properties of CWAEs, particularly their connections to conditional optimal transport problems. It also presents three architectural variants of CWAEs, forming the basis of the proposed algorithms. Through numerical experiments, the authors demonstrate that these variants significantly reduce approximation errors compared to the low-rank ensemble Kalman filter (LREnKF), especially in scenarios where the conditional measures exhibit low-dimensional support. The work emphasizes the importance of discovering low-dimensional structures directly from data, providing a scalable and data-driven alternative to existing methods in high-dimensional conditional sampling tasks.
Methodology
The authors propose a framework that combines the Wasserstein autoencoder with block-triangular transport maps. They introduce a low-dimensional latent variable to capture essential data structures and train an encoder-decoder pair by minimizing the Wasserstein distance between generated and true distributions. The model is designed to learn transport maps for generating samples from conditional distributions directly from data.
Results
The numerical experiments show that the different variants of CWAEs achieve significant reductions in approximation error when compared to the low-rank ensemble Kalman filter, particularly in cases where the conditional measures are low-dimensional. The results validate the effectiveness of the proposed framework in high-dimensional conditional sampling tasks.
Implications
The proposed CWAEs have potential applications in Bayesian inference, nonlinear filtering, and other areas requiring efficient conditional sampling from high-dimensional distributions. The ability to automatically discover low-dimensional structures from data could enhance the performance of various machine learning models in complex scenarios.
Entropy, Disagreement, and the Limits of Foundation Models in Genomics
Theory
- High entropy in genomic sequences leads to uncertainty in model predictions.
- Models trained on DNA exhibit significant disagreement and instability compared to text models.
- Fisher information is concentrated in embedding layers of genomic models, limiting inter-token relationship exploitation.
- Self-supervised training from sequences alone may not be effective for genomic data.
Read more
Entropy, Disagreement, and the Limits of Foundation Models in Genomics
Summary
This paper investigates the limitations of foundation models in genomics, particularly their mixed success compared to natural language processing models. The authors focus on the role of entropy in genomic sequences, arguing that high entropy leads to near-uniform output distributions and disagreement among models. They train ensembles of transformer models on both text and DNA sequences, analyzing predictions, static embeddings, and Fisher information flow. The findings reveal that genomic models struggle to learn meaningful relationships due to the low information content of DNA, which hinders their ability to produce confident predictions. The study questions the effectiveness of self-supervised training methods for genomic data, suggesting that current methodologies may not be suitable for developing robust genomic foundation models.
Methodology
The authors trained three ensembles of transformer models (BERT) on English text and DNA sequences, using both byte-pair encoding and k-mer tokenization. They ensured that all models had the same architecture, training data, and hyperparameters to facilitate direct comparison. The models' output distributions and embeddings were analyzed using Kullback-Leibler divergence and Jensen-Shannon distance metrics.
Results
The study found that DNA models produced output distributions with greater uncertainty compared to text models, as evidenced by lower Kullback-Leibler divergence values. The Jensen-Shannon distance indicated that DNA models exhibited significant disagreement in their predictions, particularly when sampling methods were applied. Additionally, the concentration of Fisher information in the embedding layers of DNA models suggested a failure to leverage inter-token relationships effectively.
Implications
The findings suggest that genomic foundation models may require different training methodologies than those used in natural language processing. This could lead to the development of more specialized models that better capture the unique characteristics of genomic data, potentially improving performance in various downstream biological tasks.
Which Leakage Types Matter?
Theory
- Estimation leakage has minimal impact on model performance, with ΞAUC β€ 0.005.
- Selection leakage significantly inflates performance metrics, with effects ranging from +0.013 to +0.045.
- Memorization leakage produces the largest raw effects, varying by model capacity.
- Boundary leakage is often undetected in standard cross-validation, masking its impact.
Read more
Which Leakage Types Matter?
Summary
This paper investigates the impact of various types of data leakage on model performance across a large set of datasets. Conducting twenty-eight experiments on 2,047 tabular datasets and a boundary experiment on 129 temporal datasets, the author proposes a four-class taxonomy of data leakage based on causal mechanisms. The findings reveal that estimation leakage has a negligible effect on model performance, while selection leakage significantly inflates reported scores, particularly in datasets of practical sizes. Memorization leakage, which occurs when models are trained on exact copies of evaluation data, produces the largest effects and is influenced by model capacity. Boundary leakage, arising from mismatches in validation procedures, is often undetected under standard cross-validation protocols. The study emphasizes the need to prioritize the mitigation of selection leakage and highlights the inadequacy of traditional cross-validation methods in revealing structural contamination in temporal data. Overall, the research provides a quantitative landscape of leakage effects, offering insights into which types of leakage warrant more attention in machine learning practices.
Methodology
The study employs a controlled experimental design with 29 experiments (28 core and 1 boundary) across 2,047 binary classification datasets. Each experiment measures the AUC difference between leaky and clean procedures using a standard 5-fold cross-validation workflow. The author utilizes deterministic hashes for internal validation, ensuring replicability of results across dataset splits.
Results
The results indicate that the effect sizes of leakage types vary significantly, with estimation leakage showing negligible effects, selection leakage causing substantial inflation in AUC scores, and memorization leakage yielding the largest raw effects. Boundary leakage is revealed to be invisible under random cross-validation, with a pure temporal effect averaging +0.023 AUC on genuine temporal datasets.
Implications
The findings suggest that machine learning practitioners should focus on mitigating selection leakage in their workflows, as it poses a more significant risk to model validity than previously recognized. Additionally, the study calls for a reevaluation of cross-validation practices, especially in temporal contexts, to better account for structural contamination.
ROMAN: A Multiscale Routing Operator for Convolutional Time Series Models
Time Series
- ROMAN is a deterministic operator that enhances time series representation by creating a structured channel format.
- It allows for better control over inductive biases in convolutional models, improving temporal awareness and multiscale interactions.
- The operator was evaluated through synthetic tasks and real datasets, showing task-dependent improvements in accuracy and efficiency.
- ROMAN can be integrated with existing convolutional architectures without replacing them, serving as a preprocessing step.
Read more
ROMAN: A Multiscale Routing Operator for Convolutional Time Series Models
Summary
The paper introduces ROMAN (ROuting Multiscale representAtioN), a deterministic operator designed for time series classification that enhances the representation of temporal data by mapping temporal scale and coarse temporal position into a structured channel format while reducing sequence length. ROMAN constructs an anti-aliased multiscale pyramid, extracts fixed-length windows from various scales, and organizes them into pseudochannels. This transformation allows convolutional classifiers to operate on a more compact and informative representation, improving control over inductive biases such as temporal invariance and multiscale interactions. The author evaluates ROMAN through formal analysis and empirical studies, demonstrating its effectiveness as a preprocessing step for four convolutional classifiers: MiniRocket, MultiRocket, a standard CNN-based classifier, and a fully convolutional network (FCN) classifier. The evaluation includes synthetic tasks that isolate specific temporal characteristics and benchmarks on long-sequence datasets from the UCR and UEA archives, revealing that ROMAN can enhance accuracy and computational efficiency depending on the task. Overall, ROMAN serves as a valuable tool for modifying the inductive bias of convolutional time series models.
Methodology
The methodology involves the formalization of the ROMAN operator as a multiscale routing mechanism for time series data. The operator constructs a multiscale pyramid, extracts fixed-length windows, and organizes them into pseudochannels. The effectiveness of ROMAN is evaluated through synthetic classification tasks that isolate specific temporal characteristics and through benchmarking on datasets from the UCR and UEA archives, comparing the performance of convolutional classifiers with and without the ROMAN preprocessing step.
Results
The results indicate that ROMAN consistently improves the performance of convolutional classifiers on tasks where class information is dependent on temporal structure, particularly in scenarios where standard pooling methods tend to suppress relevant temporal details. The operator also enhances computational efficiency by reducing the processed time axis, making it a practical alternative representation in time series classification.
Implications
The introduction of ROMAN has significant implications for time series classification, providing a method to enhance the representation of temporal data in convolutional models. It allows researchers and practitioners to better capture temporal dynamics and interactions, potentially leading to improved performance in various applications such as financial forecasting, sensor data analysis, and other domains where time series data is prevalent.
Re-analysis of the Human Transcription Factor Atlas Recovers TF-Specific Signatures from Pooled Single-Cell Screens with Missing Controls
Theory
- Successfully re-analyzed the Human TF Atlas dataset to recover TF-specific signatures despite missing controls.
- Assigned TF identities to 60,997 cells, significantly improving the detection of transcriptional effects.
- Identified strong transcriptional remodelers and linked them to specific biological pathways.
- Demonstrated the importance of using external controls for accurate differential expression analysis.
Read more
Re-analysis of the Human Transcription Factor Atlas Recovers TF-Specific Signatures from Pooled Single-Cell Screens with Missing Controls
Summary
This paper presents a comprehensive re-analysis of the Human Transcription Factor (TF) Atlas dataset (GSE216481), which includes a pooled overexpression screen of 3,550 TF open reading frames across 254,519 cells. The authors address challenges posed by incomplete metadata and missing internal controls in the original dataset. By employing a reproducible computational pipeline, they successfully assign TF identities to 60,997 cells and recover TF-specific signatures for 59 out of 61 testable TFs. The study utilizes embryoid body (EB) cells as an external baseline for differential expression analysis, effectively removing batch effects and enhancing the detection of TF-specific signals. The results reveal strong transcriptional remodelers and highlight functional enrichment patterns linking specific TFs to biological pathways. The findings demonstrate that the TF Atlas data can support validated analyses when combined with appropriate controls and artifact removal strategies.
Methodology
The authors developed a fully automated pipeline that processes raw scRNA-seq data through quality control, normalization, and dimensionality reduction. They demultiplexed MORF barcodes to assign TF identities and performed differential expression analysis using EB cells as a negative control. Functional enrichment analysis was conducted to link TFs to biological pathways.
Results
The analysis revealed TF-specific signatures for 59 of 61 testable TFs, with HOPX, MAZ, PAX6, FOS, and FEZF2 identified as the strongest transcriptional remodelers. The study found significant enrichment patterns linking FEZF2 to differentiation regulation, EGR1 to Hippo and cardiac programs, FOS to focal adhesion, and NFIC to collagen biosynthesis. The per-TF effect sizes correlated with previously published rankings, indicating the robustness of the findings.
Implications
The findings underscore the potential of re-analyzing existing datasets with improved methodologies to uncover valuable biological insights. This approach can enhance our understanding of TF functions and their roles in cellular processes, which may have implications for developmental biology and disease research.
Isokinetic Flow Matching for Pathwise Straightening of Generative Flows
Generative Models
Efficient ML
Optimization
- Introduction of Isokinetic Flow Matching (Iso-FM) as a solution to curvature issues in flow-based generative models.
- Iso-FM utilizes a Jacobian-free regularizer to penalize pathwise acceleration, enhancing local velocity consistency.
- Demonstrated significant improvements in few-step sampling efficiency on CIFAR-10, with a 2.9Γ relative efficiency gain.
- Empirical results show a reduction in conditional non-OT FID@2 from 78.82 to 27.13.
Read more
Isokinetic Flow Matching for Pathwise Straightening of Generative Flows
Summary
This paper introduces Isokinetic Flow Matching (Iso-FM), a novel approach aimed at enhancing the efficiency of few-step sampling in flow-based generative models. Traditional Flow Matching (FM) suffers from strong curvature in the learned marginal velocity field due to trajectory superposition, leading to significant numerical truncation errors. Iso-FM addresses this issue by implementing a lightweight, Jacobian-free dynamical regularizer that penalizes pathwise acceleration, thereby enforcing local velocity consistency. The method employs a self-guided finite-difference approximation of the material derivative, allowing it to operate as a plug-and-play addition to existing FM training without the need for auxiliary encoders or costly second-order autodifferentiation. Empirical evaluations on the CIFAR-10 dataset demonstrate that Iso-FM significantly improves few-step generation efficiency, achieving a 2.9Γ relative efficiency gain in conditional non-OT FID@2, reducing it from 78.82 to 27.13, and attaining a best-observed FID@4 of 10.23. These findings establish acceleration regularization as a viable and computationally efficient strategy for fast generative sampling.
Methodology
The methodology involves the introduction of Iso-FM, which applies a Jacobian-free regularization technique to penalize pathwise acceleration in the learned velocity field. This is achieved through a self-guided finite-difference approximation of the material derivative, allowing for improved local velocity consistency without the need for complex auxiliary structures or second-order computations. The method is integrated into the existing FM training framework as a straightforward augmentation.
Results
The implementation of Iso-FM on the CIFAR-10 dataset resulted in a substantial reduction in conditional non-OT FID@2 from 78.82 to 27.13, reflecting a 2.9Γ relative efficiency gain. Additionally, the best-observed FID@4 achieved was 10.23, indicating a significant enhancement in the quality of few-step generative sampling.
Implications
The findings suggest that acceleration regularization can be a powerful tool for improving the efficiency of generative models, particularly in scenarios where rapid sampling is critical. This could have applications in various domains requiring fast and high-quality generative outputs, such as image synthesis, video generation, and other creative AI applications.
Adaptive Semantic Communication for Wireless Image Transmission Leveraging Mixture-of-Experts Mechanism
Computer Vision
Efficient ML
Multimodal
- Introduces a novel multi-stage end-to-end image semantic communication system.
- Utilizes a dynamic expert gating mechanism for adaptive routing based on CSI and image semantics.
- Achieves significant improvements in image reconstruction quality over existing methods.
- Maintains high transmission efficiency despite increased model adaptability.
Read more
Adaptive Semantic Communication for Wireless Image Transmission Leveraging Mixture-of-Experts Mechanism
Summary
This paper addresses the limitations of traditional semantic communication systems in wireless image transmission, which often rely on fixed models and lack adaptability to varying image contents and dynamic channel conditions. The authors propose a novel multi-stage end-to-end image semantic communication system that utilizes a Mixture-of-Experts (MoE) mechanism integrated with a Swin Transformer architecture. The key innovation is a dynamic expert gating mechanism that evaluates both real-time Channel State Information (CSI) and the semantic content of image patches to compute adaptive routing probabilities. This allows for the selective activation of a specialized subset of experts, enhancing the system's adaptability and efficiency. The proposed method significantly improves image reconstruction quality while maintaining transmission efficiency, outperforming existing methods in simulations. Overall, the work highlights the potential of adaptive semantic communication frameworks in next-generation wireless networks.
Methodology
The authors developed an adaptive MoE Swin Transformer block that synthesizes real-time CSI and semantic content to determine optimal routing strategies. The architecture includes a specialized gating mechanism that computes routing probabilities for activating a targeted subset of experts, enhancing model capacity without proportional computational overhead. The system is designed for multi-input multi-output (MIMO) channels and is evaluated through extensive simulations.
Results
Simulation results demonstrate that the proposed system achieves considerable performance gains in image reconstruction quality compared to existing methods, while preserving transmission efficiency. The adaptive routing mechanism significantly enhances the system's robustness to diverse image contents and dynamic channel conditions.
Implications
The findings suggest that adaptive semantic communication frameworks can greatly enhance the efficiency and robustness of wireless image transmission, making them suitable for data-intensive applications in next-generation wireless networks. This approach could lead to advancements in various fields, including telecommunication, remote sensing, and multimedia streaming.
Generalization Limits of Reinforcement Learning Alignment
NLP
Large Language Models
Reinforcement Learning
- RLHF does not acquire new capabilities but redistributes existing ones, limiting generalization to unknown attacks.
- The introduction of 'compound jailbreaks' effectively demonstrates vulnerabilities in LLM safety mechanisms.
- The attack success rate significantly increases when combining multiple attack techniques, indicating weaknesses in individual defenses.
- Safety training may overfit to training data, failing to generalize to diverse attack patterns.
Read more
Generalization Limits of Reinforcement Learning Alignment
Summary
This paper investigates the limitations of reinforcement learning from human feedback (RLHF) in ensuring the safety of large language models (LLMs). The authors argue that RLHF does not lead to the acquisition of new capabilities but rather redistributes the utilization of existing ones, which raises concerns about the generalization of safety mechanisms against unknown attack patterns. To empirically demonstrate these limitations, the authors introduce 'compound jailbreaks,' a novel attack strategy that combines multiple existing attack techniques to exploit vulnerabilities in the instruction hierarchy of OpenAI's gpt-oss-20b model. Their findings reveal that the attack success rate (ASR) significantly increases from 14.3% with individual methods to 71.4% when using the combined approach. This study highlights the need for multifaceted safety evaluations and suggests that current safety training may not generalize effectively to novel attack scenarios, emphasizing the importance of addressing these vulnerabilities in future LLM safety protocols.
Methodology
The authors conducted a theoretical analysis of RLHF limitations and developed a series of compound jailbreak attacks targeting the gpt-oss-20b model. They empirically tested the effectiveness of these attacks by measuring the attack success rate (ASR) when using both individual and combined attack techniques.
Results
The study found that the attack success rate increased from 14.3% with individual attack methods to 71.4% when employing the compound jailbreak approach, providing empirical evidence for the limitations of RLHF in generalizing to unknown attack patterns.
Implications
The findings suggest that current safety mechanisms in LLMs may be insufficient against novel attack strategies, highlighting the need for improved safety training and evaluation methods. This research could inform the development of more robust LLM safety protocols and encourage further exploration of multifaceted attack scenarios.
DrugPlayGround: Benchmarking Large Language Models and Embeddings for Drug Discovery
NLP
Large Language Models
Multimodal
- DrugPlayGround is a comprehensive benchmarking platform for evaluating LLMs in drug discovery.
- The framework assesses LLM performance across four critical drug discovery tasks.
- Collaboration with domain experts is emphasized to enhance the reliability of LLM predictions.
- The study provides insights into the strengths and limitations of LLMs in pharmaceutical applications.
Read more
DrugPlayGround: Benchmarking Large Language Models and Embeddings for Drug Discovery
Summary
The paper introduces DrugPlayGround, a benchmarking framework designed to evaluate the performance of large language models (LLMs) in drug discovery. The authors highlight the transformative potential of LLMs in accelerating drug research by enhancing hypothesis generation and optimizing candidate prioritization. However, they note the absence of objective assessments to understand the advantages and limitations of LLMs compared to traditional drug discovery methods. DrugPlayGround addresses this gap by providing a systematic evaluation of LLMs across four key tasks: drug function analysis, drug-target interaction prediction, prediction of synergistic drug combinations, and drug perturbation prediction. The framework is built on paired datasets and allows for both qualitative and quantitative assessments of LLM-generated outputs. The authors emphasize the importance of collaboration with domain experts to ensure the reliability of LLM predictions and to explore their reasoning capabilities. Through detailed analyses, the framework aims to elucidate the strengths and weaknesses of current LLM approaches, ultimately guiding future developments in drug discovery methodologies.
Methodology
The authors developed DrugPlayGround as a benchmarking platform that utilizes paired datasets to evaluate LLM performance in drug discovery tasks. The framework includes both qualitative assessments of LLM-generated textual content and quantitative evaluations of embeddings. It systematically analyzes LLM outputs across four main tasks relevant to drug discovery, facilitating a comprehensive understanding of their capabilities and limitations.
Results
The results indicate that DrugPlayGround effectively benchmarks LLMs, revealing their strengths in generating meaningful drug representations and their limitations in handling complex chemical structures. The framework provides actionable insights through detailed error analyses and case studies, which help in understanding the performance of LLMs in various drug-related applications.
Implications
The development of DrugPlayGround has significant implications for the future of drug discovery, as it provides a structured approach to evaluate LLMs and their potential applications in pharmaceutical research. By identifying the strengths and weaknesses of LLMs, the framework can guide researchers in optimizing drug discovery pipelines and improving the safety and efficacy of therapeutic interventions.
Time-Warping Recurrent Neural Networks for Transfer Learning
Time Series
- Introduces a time-warping approach for transfer learning in RNNs.
- Demonstrates the ability of LSTMs to approximate time lag models with high accuracy.
- Applies the method to predict fuel moisture content, relevant for wildfire modeling.
- Achieves competitive accuracy compared to traditional transfer learning methods with fewer parameter adjustments.
Read more
Time-Warping Recurrent Neural Networks for Transfer Learning
Summary
This dissertation presents a novel approach to transfer learning using Time-Warping Recurrent Neural Networks (RNNs), specifically Long Short-Term Memory (LSTM) networks. The research addresses the challenge of modeling dynamical systems that evolve at varying rates under different environmental conditions. By employing time-warping, the study demonstrates that LSTMs can accurately approximate linear, first-order differential equations known as time lag models, while allowing for modifications in time scaling without losing approximation accuracy. The proposed method is applied to predict fuel moisture content (FMC), a critical factor in wildfire modeling. An RNN is pretrained on data with a characteristic time scale of 10 hours, and then adapted through transfer learning to predict FMC for time scales of 1 hour, 100 hours, and 1000 hours. The effectiveness of the Time-Warping method is compared against established transfer learning techniques, revealing that it achieves comparable prediction accuracy while modifying significantly fewer parameters. This work not only advances the understanding of transfer learning in RNNs but also provides practical solutions for environmental modeling.
Methodology
The methodology involves mathematically deriving a time-warped LSTM model capable of approximating linear time lag systems. The model is pretrained on a large dataset and then fine-tuned for different time scales using the time-warping technique. The performance of the time-warping method is evaluated against various established transfer learning strategies, including full fine-tuning and frozen layers.
Results
The results indicate that the time-warping method provides predictions for FMC with accuracy levels comparable to traditional transfer learning methods, despite modifying only a small fraction of the parameters. The study also highlights the stability of the learned dynamics and the effectiveness of zero-shot transfer learning for different FMC classes.
Implications
The findings suggest that time-warping can be a powerful tool for enhancing transfer learning in RNNs, particularly in applications involving dynamical systems with varying time scales. This could have significant implications for environmental modeling, particularly in predicting phenomena like wildfires, where accurate and timely predictions are crucial.
SLaB: Sparse-Lowrank-Binary Decomposition for Efficient Large Language Models
Large Language Models
Efficient ML
Optimization
- SLaB combines sparse, low-rank, and binary matrix decomposition for efficient LLM compression.
- The framework eliminates the need for retraining, making it computationally efficient.
- SLaB achieves a 36% reduction in perplexity and an 8.98% accuracy improvement on zero-shot tasks.
- Activation-aware pruning scores are utilized to guide the decomposition process.
Read more
SLaB: Sparse-Lowrank-Binary Decomposition for Efficient Large Language Models
Summary
The paper presents SLaB, a novel framework designed to address the deployment challenges of large language models (LLMs) due to their high computational and memory requirements. Traditional model compression techniques, such as network pruning, often struggle to maintain performance at high compression ratios. SLaB innovatively decomposes the weight of each linear layer into three components: a sparse matrix, a low-rank matrix, and a binary matrix. This decomposition allows for significant weight compression without the need for retraining. The authors utilize activation-aware pruning scores to guide the decomposition process effectively. Experimental results on Llama-family models demonstrate that SLaB achieves state-of-the-art performance, reducing perplexity by up to 36% at a 50% compression ratio and improving accuracy by up to 8.98% on zero-shot tasks compared to existing methods. The framework's ability to retain strong performance while simplifying computations positions it as a promising solution for deploying LLMs on resource-constrained devices.
Methodology
The SLaB framework employs an optimization strategy to decompose weight matrices into three components: a sparse matrix, a low-rank matrix, and a binary matrix. The optimization process involves alternating updates of these matrices while leveraging activation influences to guide pruning decisions. The framework operates in a layer-wise manner, performing forward propagation, pruning, and updating outputs iteratively.
Results
SLaB demonstrated significant improvements in performance metrics, achieving a reduction in perplexity by up to 36% at a 50% compression ratio and enhancing accuracy by up to 8.98% on zero-shot tasks when evaluated on Llama-family models. These results indicate that SLaB can maintain or even improve model performance while achieving high compression.
Implications
The SLaB framework has the potential to facilitate the deployment of large language models on resource-limited devices, making advanced AI capabilities more accessible. Its innovative approach to model compression could inspire further research into efficient machine learning techniques.
Neural Operators for Multi-Task Control and Adaptation
Reinforcement Learning
Robotics
Optimization
- Neural operators are established as effective models for multi-task control, capable of approximating mappings from task-defining functions to optimal policies.
- The architecture allows for structured adaptation strategies, enabling efficient updates and fine-tuning for new tasks.
- Meta-trained variants of the operator facilitate rapid few-shot adaptation, improving performance with limited data.
- The approach generalizes well across different task distributions and varying amounts of training data.
Read more
Neural Operators for Multi-Task Control and Adaptation
Summary
This paper explores the application of neural operator methods to multi-task control problems, where the goal is to learn a mapping from task descriptions to optimal control laws. The authors introduce a permutation-invariant neural operator architecture, specifically the SetONet, to approximate solution operators for various parametric optimal control environments and locomotion tasks. Through behavioral cloning, the trained operator demonstrates the ability to generalize to unseen tasks and varying amounts of task observations. The architecture's branch-trunk structure allows for efficient adaptation strategies, ranging from lightweight updates to full-network fine-tuning, achieving strong performance across different data and compute settings. Additionally, the authors propose meta-trained operator variants that optimize initialization for few-shot adaptation, outperforming traditional meta-learning approaches. The results indicate that neural operators can effectively model multi-task control, providing a unified framework for rapid task adaptation and robust performance in diverse scenarios.
Methodology
The authors utilize the SetONet architecture, built on DeepONet, to learn mappings between function spaces. They train the operator using behavioral cloning and develop various adaptation strategies, including last-layer updates and full-network fine-tuning. Two meta-training variants, SetONet-Meta and SetONet-Meta-Full, are introduced to optimize initialization for few-shot adaptation.
Results
The experiments demonstrate that the neural operator accurately approximates the solution operator across different tasks and environments. The structured adaptation strategies allow for effective performance even on out-of-distribution tasks, with partial fine-tuning achieving comparable accuracy to full fine-tuning at a lower cost. The meta-trained operators significantly enhance few-shot adaptation capabilities, outperforming a popular meta-learning baseline.
Implications
The findings suggest that neural operators can serve as a powerful framework for multi-task control and adaptation in various applications, including robotics and dynamic systems, where rapid and efficient task adaptation is crucial.
AXELRAM: Quantize Once, Never Dequantize
Large Language Models
Efficient ML
Optimization
- AXELRAM enables computation of attention scores directly from quantized KV cache indices, eliminating the need for dequantization.
- The architecture achieves a 102.4Γ reduction in multiplications required for attention score computation.
- Sign pattern sensitivity in KV cache quantization can lead to significant performance degradation in certain models.
- A gradient-free sign pattern selection method is proposed to address catastrophic spikes in perplexity.
Read more
AXELRAM: Quantize Once, Never Dequantize
Summary
The paper introduces AXELRAM, a novel smart SRAM macro architecture designed to compute attention scores directly from quantized key-value (KV) cache indices without the need for dequantization. This innovation addresses the bottleneck in large language model (LLM) inference caused by the substantial memory requirements of KV caches. AXELRAM employs a design-time fixed codebook based on orthogonal-transform-based quantization, which optimally concentrates the distribution of each coordinate. The architecture features an asymmetric write/read path that significantly reduces the number of multiplications required during attention score computation by a factor of 102.4. The authors also investigate the phenomenon of sign pattern sensitivity in KV cache quantization, revealing that certain models experience catastrophic spikes in perplexity due to layer-wise norm heterogeneity. To mitigate this issue, they propose a gradient-free sign pattern selection method that effectively eliminates these spikes without incurring additional hardware costs. The paper provides comprehensive evaluations across multiple models, demonstrating the effectiveness and stability of the AXELRAM architecture.
Methodology
The AXELRAM architecture integrates orthogonal-transform-based quantization with a table-lookup mechanism for attention computation. It utilizes a fixed codebook for quantization and employs an asymmetric path for writing and reading data, which allows for efficient computation without dequantization. The authors conducted multi-seed evaluations across three different models to analyze the effects of sign pattern sensitivity and proposed a novel method for selecting sign patterns that mitigates performance spikes.
Results
The evaluation revealed that while some models, like Qwen2.5-3B, experienced catastrophic spikes in perplexity (over 50), others, such as LLaMA-3.1-8B, remained stable. The proposed sign pattern selection method successfully eliminated these spikes without additional hardware costs, demonstrating the robustness of the AXELRAM architecture.
Implications
The AXELRAM architecture has the potential to significantly enhance the efficiency of large language model inference by reducing memory requirements and computational overhead. This could lead to faster and more efficient deployment of LLMs in various applications, including real-time language processing and AI-driven systems.
MAVEN: A Mesh-Aware Volumetric Encoding Network for Simulating 3D Flexible Deformation
Graph Learning
- MAVEN introduces a mesh-aware approach that incorporates higher-dimensional geometric features for improved simulation accuracy.
- The model enhances the representation of contact interactions and internal physical propagation by utilizing 3D cells and 2D facets.
- MAVEN consistently outperforms existing GNN-based methods on established datasets and specific tasks involving complex deformations.
- The architecture employs learnable position-aware aggregators to facilitate information propagation through higher-dimensional structures.
Read more
MAVEN: A Mesh-Aware Volumetric Encoding Network for Simulating 3D Flexible Deformation
Summary
The paper presents MAVEN, a novel mesh-aware volumetric encoding network designed to improve the simulation of 3D flexible deformations. Traditional graph neural networks (GNNs) often represent meshes using only vertices and edges, neglecting higher-dimensional geometric features such as 2D facets and 3D cells. This limitation can lead to inaccuracies in capturing boundary representations and volumetric characteristics, which are crucial for modeling contact interactions and internal physical quantity propagation, especially in sparse mesh discretizations. MAVEN addresses these challenges by explicitly modeling geometric mesh elements, allowing for flexible transformations among 3D cells, 2D facets, and vertices. The architecture incorporates learnable mappings and aggregates information at higher-dimensional levels, enhancing the model's ability to simulate complex physical behaviors accurately. Experimental results demonstrate that MAVEN achieves state-of-the-art performance across various established datasets and a novel metal stretch-bending task, showcasing its effectiveness in handling large deformations and prolonged contacts.
Methodology
MAVEN employs an encoder-processor-decoder architecture that explicitly integrates geometric elements from the mesh, including vertices, facets, and cells. The model uses learnable mappings to encode vertex information into higher-dimensional elements and processes interactions at the cell level, allowing for comprehensive geometric modeling and accurate simulation under sparse mesh conditions.
Results
MAVEN demonstrated state-of-the-art performance in simulations involving flexible deformations, outperforming traditional GNN methods on established datasets and a novel metal stretch-bending task. The results indicate that the incorporation of higher-dimensional geometric features significantly enhances the accuracy of physical simulations.
Implications
The advancements presented in MAVEN have potential applications in various fields requiring accurate modeling of flexible deformations, such as industrial manufacturing, aeronautical engineering, and materials science. The ability to simulate complex physical interactions more effectively could lead to improved design and analysis processes in these domains.
k-Maximum Inner Product Attention for Graph Transformers and the Expressive Power of GraphGPS
Graph Learning
Efficient ML
Theory
- Introduction of k-Maximum Inner Product (k-MIP) attention for graph transformers.
- Achieves linear memory complexity and significant speed improvements over traditional attention mechanisms.
- Proves that k-MIP attention retains the expressive power of full-attention transformers.
- Integrates seamlessly into the GraphGPS framework with established theoretical bounds.
Read more
k-Maximum Inner Product Attention for Graph Transformers and the Expressive Power of GraphGPS
Summary
This paper introduces k-Maximum Inner Product (k-MIP) attention, a novel self-attention mechanism designed for graph transformers. Traditional graph neural networks face challenges such as oversquashing and limited ability to model long-range dependencies, particularly when applied to large-scale graphs due to the quadratic complexity of all-to-all attention mechanisms. The proposed k-MIP attention addresses these issues by selecting the top-k relevant key nodes per query, resulting in a sparse yet flexible attention pattern that maintains linear memory complexity. The authors demonstrate that k-MIP attention can process graphs with over 500k nodes on a single A100 GPU, achieving speedups of up to ten-fold compared to full attention. The paper also provides a theoretical analysis confirming that k-MIP transformers can approximate any full-attention transformer to arbitrary precision, ensuring that expressive power is not compromised. Furthermore, the integration of k-MIP attention into the GraphGPS framework is explored, establishing an upper bound on its graph distinguishing capability. Empirical evaluations on various benchmarks, including the Long Range Graph Benchmark and City-Networks benchmark, show that k-MIP consistently ranks among the top-performing scalable graph transformers.
Methodology
The paper presents k-MIP attention, which dynamically selects the k most influential keys for each query using a top-k operation. This mechanism is combined with symbolic matrices to achieve linear memory complexity. The authors also conduct a theoretical analysis of the expressive power of k-MIP attention and its integration into the GraphGPS framework, alongside empirical evaluations on multiple graph benchmarks.
Results
The k-MIP attention mechanism allows for the processing of large graphs (over 500k nodes) on a single A100 GPU with up to ten-fold speedup compared to full attention. The theoretical analysis confirms that k-MIP transformers can approximate full-attention transformers to arbitrary precision, and empirical results show competitive performance across various benchmarks.
Implications
The k-MIP attention mechanism can significantly enhance the scalability and efficiency of graph transformers, making them more applicable to real-world large-scale graph data in various domains such as social networks, molecular biology, and recommendation systems.
Dynamic Free-Rider Detection in Federated Learning via Simulated Attack Patterns
Federated Learning
- Introduces S2-WEF, a novel method for detecting dynamic free-riders in Federated Learning.
- Demonstrates the limitations of existing WEF-defense methods against dynamic free-rider attacks.
- Combines simulation-based similarity scores with mutual deviation scores for improved detection accuracy.
- Validates the effectiveness of S2-WEF through extensive experiments on multiple datasets and attack types.
Read more
Dynamic Free-Rider Detection in Federated Learning via Simulated Attack Patterns
Summary
This paper addresses the challenge of detecting dynamic free-riders in Federated Learning (FL), where clients may initially contribute but later submit fake model parameters to benefit from the global model without actual training. The existing method, weight evolving frequency (WEF), struggles to identify these dynamic free-riders, particularly under global-model-mimicking attacks. The author proposes a novel detection method called S2-WEF, which simulates WEF patterns of potential attacks on the server side using previously broadcasted global models. S2-WEF identifies clients whose submitted WEF patterns resemble these simulated patterns. Additionally, it combines this simulation-based similarity score with a deviation score from mutual comparisons among submitted WEFs. This two-dimensional clustering approach allows for effective separation of benign and free-rider clients. The paper demonstrates that S2-WEF can dynamically detect clients transitioning into free-riding behavior without the need for proxy datasets or pre-training. Extensive experiments across three datasets and five attack types show that S2-WEF outperforms existing methods in robustness against free-rider attacks.
Methodology
The proposed S2-WEF method simulates weight evolving frequency (WEF) patterns of potential global-model-based attacks on the server side. It identifies clients by comparing their submitted WEF patterns to these simulated patterns. The method also computes a deviation score from mutual comparisons among submitted WEFs and utilizes two-dimensional clustering combined with threshold-based classification to distinguish between benign and free-rider clients.
Results
The experiments conducted across three datasets and five different attack types demonstrate that S2-WEF achieves significantly higher robustness in detecting dynamic free-riders compared to existing detection methods. The results indicate that S2-WEF effectively identifies clients that switch from honest training to free-riding behavior.
Implications
The findings of this research have significant implications for the deployment of Federated Learning in real-world applications, particularly in cross-silo settings where clients may have competing interests. By improving the detection of free-riders, S2-WEF can enhance the reliability and fairness of collaborative model training, encouraging more organizations to participate in Federated Learning initiatives.
Choosing the Right Regularizer for Applied ML: Simulation Benchmarks of Popular Scikit-learn Regularization Frameworks
Theory
Optimization
Efficient ML
- Ridge, Lasso, and ElasticNet are nearly interchangeable for prediction accuracy with sufficient sample-to-feature ratios.
- Lasso's recall is fragile under multicollinearity, with significant performance degradation in challenging conditions.
- ElasticNet outperforms Lasso in high multicollinearity scenarios, maintaining higher recall rates.
- The paper provides a decision guide for selecting appropriate regularization frameworks based on feature space attributes.
Read more
Choosing the Right Regularizer for Applied ML: Simulation Benchmarks of Popular Scikit-learn Regularization Frameworks
Summary
This paper explores the evolution of regularization techniques in machine learning, from early methods like stepwise regression to contemporary approaches such as Bayesian methods and β0-based regularization. The authors conduct an empirical evaluation of four prominent regularization frameworksβRidge, Lasso, ElasticNet, and Post-Lasso OLSβacross 134,400 simulations based on eight production-grade machine learning models. The study finds that when the sample-to-feature ratio is adequate (n/p β₯ 78), Ridge, Lasso, and ElasticNet yield similar prediction accuracies. However, Lasso's performance is notably sensitive to multicollinearity, exhibiting a significant drop in recall under high condition numbers and low signal-to-noise ratios, while ElasticNet remains robust. The authors recommend against using Lasso or Post-Lasso OLS in scenarios characterized by high multicollinearity and small sample sizes. The paper concludes with a decision guide to aid practitioners in selecting the most suitable regularization framework based on specific feature space characteristics.
Methodology
The authors simulate a diverse range of feature spaces to evaluate the performance of the four regularization frameworks. They focus on three criteria: feature recovery (F1 score), coefficient estimate accuracy (relative L2 error), and predictive performance (root mean squared error). A total of 134,400 simulations are conducted to ensure robust empirical findings.
Results
The study reveals that Ridge, Lasso, and ElasticNet perform similarly in terms of prediction accuracy when the sample-to-feature ratio is high. However, Lasso's recall drops significantly under conditions of high multicollinearity, while ElasticNet maintains a high recall rate. The findings suggest that Lasso and Post-Lasso OLS should be avoided in high multicollinearity situations with small sample sizes.
Implications
The results of this study have practical implications for machine learning practitioners, guiding them in the selection of regularization techniques based on the characteristics of their datasets. The decision guide provided can help in optimizing model performance and avoiding pitfalls associated with specific regularization methods.