AI-generated summaries
Today's ML research,
without the noise.
Daily summaries of the latest machine learning papers from arXiv, processed every 8 hours.
48
Papers today
8h
Update frequency
7
Days of history
EdgeRazor: A Lightweight Framework for Large Language Models via Mixed-Precision Quantization-Aware Distillation
Large Language Models
Efficient ML
NLP
- EdgeRazor integrates quantization and distillation for efficient LLM deployment.
- The framework introduces mixed-precision quantization for better resource allocation.
- Empirical results show significant performance improvements over existing quantization methods.
- EdgeRazor reduces storage requirements and accelerates decoding times.
Read more
EdgeRazor: A Lightweight Framework for Large Language Models via Mixed-Precision Quantization-Aware Distillation
Summary
The paper presents EdgeRazor, a novel framework designed to optimize the deployment of Large Language Models (LLMs) on resource-constrained devices through mixed-precision quantization-aware distillation. The authors identify the limitations of existing quantization methods, including Post-Training Quantization (PTQ), Quantization-Aware Training (QAT), and Quantization-Aware Distillation (QAD), particularly in terms of performance degradation and computational resource demands. EdgeRazor introduces three innovative modules: Mixed-Precision Quantization-Aware Distillation, which allows for fine-grained control of precision; Adaptive Feature Distillation, which selects the most informative layers for distillation; and Entropy-Aware KL Divergence, which balances the distillation process based on the teacher model's output distribution entropy. Empirical evaluations demonstrate that EdgeRazor achieves superior performance with a 1.88-bit precision, outperforming 3-bit precision competitors and leading 2-bit PTQ methods by a significant margin, while also reducing training costs. The framework effectively compresses models and accelerates decoding, making it a promising solution for deploying LLMs in environments with limited resources.
Methodology
EdgeRazor employs a mixed-precision quantization approach that combines three key modules: Mixed-Precision Quantization-Aware Distillation for precision control, Adaptive Feature Distillation for selecting informative layers, and Entropy-Aware KL Divergence to balance the distillation process based on the teacher model's output entropy. This methodology allows for efficient training and deployment of LLMs with lower bit-widths while maintaining performance.
Results
The framework demonstrated that a model quantized to 1.88 bits outperformed all 3-bit precision competitors and surpassed leading 2-bit PTQ methods by 11.3 points. Additionally, a 1.58-bit model reduced storage from 1.41 GB to 0.28 GB and accelerated decoding by 15.1 times compared to a 16-bit baseline, showcasing the effectiveness of the proposed methods.
Implications
EdgeRazor's lightweight framework has significant implications for deploying large language models in real-world applications where computational resources are limited. It enables more efficient use of hardware, potentially expanding the accessibility of advanced AI technologies in mobile and edge computing environments.
When Safety Geometry Collapses: Fine-Tuning Vulnerabilities in Agentic Guard Models
NLP
Large Language Models
Theory
- Benign fine-tuning can lead to a complete collapse of safety alignment in guard models.
- The phenomenon of safety geometry collapse is more severe in purpose-built guard models than in general-purpose LLMs.
- Fisher-Weighted Safety Subspace Regularization (FW-SSR) effectively restores safety alignment during fine-tuning.
- Structural representational geometry is a more reliable predictor of safety behavior than absolute displacement metrics.
Read more
When Safety Geometry Collapses: Fine-Tuning Vulnerabilities in Agentic Guard Models
Summary
This paper investigates the vulnerabilities of agentic guard models, which are designed to filter harmful content in AI systems, particularly when fine-tuned on benign data. The authors demonstrate that such fine-tuning can lead to a catastrophic loss of safety alignment, termed 'safety geometry collapse.' This phenomenon occurs when the latent safety boundary that distinguishes harmful from benign inputs is eroded, resulting in a guard model's inability to refuse harmful content. The study employs a latent geometry analysis protocol to reveal this collapse across three safety classifiers: LlamaGuard, WildGuard, and Granite Guardian. The authors propose a novel solution, Fisher-Weighted Safety Subspace Regularization (FW-SSR), which mitigates this issue by applying a training-time penalty that accounts for the curvature of safety subspaces. The results show that FW-SSR can recover significant refusal rates and improve safety without sacrificing domain performance, highlighting the importance of geometry-based monitoring in evaluating guard models.
Methodology
The authors conducted experiments on three safety classifiers to analyze the effects of benign fine-tuning on safety alignment. They employed Singular Value Decomposition (SVD) to extract per-layer safety subspaces and tracked the evolution of the harmful-benign representational boundary. The proposed FW-SSR method was developed to regularize safety-critical subspace directions during fine-tuning, utilizing curvature-aware direction weights derived from diagonal Fisher information.
Results
The study found that Granite Guardian's refusal rate dropped from 85% to 0% after benign fine-tuning, indicating a complete safety geometry collapse. With the application of FW-SSR, the refusal rate was recovered to 75%, and the CKA score improved to 0.983. Additionally, WildGuard's Attack Success Rate was reduced to 3.6%, demonstrating the effectiveness of the proposed method in enhancing safety.
Implications
The findings suggest that guard models in agentic AI systems require careful monitoring and evaluation of their safety geometry, especially during specialization. The proposed FW-SSR method offers a promising approach to maintain safety alignment while allowing for domain-specific fine-tuning, which is crucial for the safe deployment of AI systems in sensitive applications.
FL-Sailer: Efficient and Privacy-Preserving Federated Learning for Scalable Single-Cell Epigenetic Data Analysis via Adaptive Sampling
Federated Learning
Efficient ML
Theory
- FL-Sailer is the first federated learning framework tailored for scATAC-seq data analysis.
- Adaptive leverage score sampling reduces dimensionality by 80% while preserving biological interpretability.
- The invariant VAE architecture effectively disentangles biological signals from technical confounders.
- FL-Sailer demonstrates superior performance compared to centralized methods in multi-institutional settings.
Read more
FL-Sailer: Efficient and Privacy-Preserving Federated Learning for Scalable Single-Cell Epigenetic Data Analysis via Adaptive Sampling
Summary
FL-Sailer introduces a novel federated learning framework specifically designed for single-cell ATAC-seq (scATAC-seq) data analysis, addressing significant challenges such as ultra-high dimensionality, extreme sparsity, and cross-institutional heterogeneity. The framework incorporates two main innovations: adaptive leverage score sampling, which reduces dimensionality by 80% while maintaining biological interpretability, and an invariant VAE architecture that disentangles biological signals from technical confounders through mutual information minimization. The authors provide a convergence guarantee, demonstrating that FL-Sailer can converge to an approximate solution of the original high-dimensional problem with bounded error. Extensive experiments on both synthetic and real epigenomic datasets show that FL-Sailer facilitates multi-institutional collaborations that were previously infeasible and outperforms centralized methods by effectively suppressing technical noise. This work highlights the potential of tailored federated learning approaches to enhance collaborative epigenomic research.
Methodology
FL-Sailer employs adaptive leverage score sampling to select biologically relevant features, significantly reducing dimensionality and communication costs. It utilizes an invariant VAE architecture to minimize mutual information between biological signals and technical confounders, ensuring robust analysis across heterogeneous datasets. The framework operates in a federated learning setting, where local models are trained on private data without transferring sensitive information.
Results
The experiments conducted on synthetic and real epigenomic datasets reveal that FL-Sailer not only enables effective multi-institutional collaborations but also surpasses traditional centralized methods in performance. The adaptive sampling acts as an implicit regularizer, enhancing clustering performance and enabling meaningful biological discoveries despite the challenges posed by data sparsity and heterogeneity.
Implications
FL-Sailer paves the way for more effective and privacy-preserving collaborations in single-cell epigenomic research, potentially accelerating discoveries in precision medicine and personalized therapies. By addressing the barriers to data sharing and analysis, this framework can facilitate large-scale studies that integrate diverse datasets while ensuring compliance with privacy regulations.
ITBoost: Information-Theoretic Trust for Robust Boosting
Theory
Optimization
- ITBoost leverages an information-theoretic trust framework to address label noise in boosting.
- The method employs an MDL-based sample weighting mechanism that focuses on residual time-series characteristics.
- Theoretical analysis shows ITBoost has a tighter generalization error bound than standard GBDT under label noise.
- Empirical results indicate ITBoost outperforms existing boosting algorithms and deep tabular models in noisy settings.
Read more
ITBoost: Information-Theoretic Trust for Robust Boosting
Summary
The paper introduces ITBoost, a novel boosting method designed to enhance robustness against label noise in gradient boosting algorithms. Traditional boosting methods often prioritize samples with large gradients, which can lead to performance degradation when training labels are noisy. ITBoost addresses this issue by evaluating sample reliability based on the evolution of residuals across iterations rather than instantaneous errors. By applying the Minimum Description Length (MDL) principle, ITBoost measures the complexity of residual trajectories, down-weighting samples with erratic residual patterns deemed less trustworthy. The authors provide theoretical justification for ITBoost, demonstrating that it achieves a tighter generalization bound under label noise compared to standard Gradient Boosting Decision Trees (GBDT). Empirical evaluations on various tabular datasets show that ITBoost outperforms leading boosting algorithms and deep learning models in noisy environments while maintaining strong performance on clean data.
Methodology
ITBoost utilizes the Minimum Description Length principle to assess the complexity of residual trajectories over iterations. It computes a 'trust score' for each sample based on the sequential characteristics of its residuals, allowing the model to down-weight samples with high complexity and erratic patterns while emphasizing genuinely challenging samples.
Results
ITBoost demonstrated improved robustness in noisy environments compared to leading boosting algorithms and modern deep tabular models. It maintained competitive performance on clean datasets, validating its effectiveness in real-world applications where label noise is prevalent.
Implications
The findings suggest that ITBoost can be effectively applied in fields such as medicine and finance, where label noise is common, enhancing the reliability of predictive models in these critical areas.
An End-to-End Framework for Building Large Language Models for Software Operations
Large Language Models
Reinforcement Learning
NLP
- Introduction of OpsLLM, a domain-specific LLM for software operations.
- Implementation of a Human-in-the-Loop mechanism for high-quality data curation.
- Development of a domain process reward model (DPRM) to enhance RCA accuracy.
- Demonstrated significant performance improvements over existing LLMs in QA and RCA tasks.
Read more
An End-to-End Framework for Building Large Language Models for Software Operations
Summary
This paper presents OpsLLM, a domain-specific large language model (LLM) designed to enhance software operations through efficient knowledge-based question answering (QA) and root cause analysis (RCA). The authors identify key challenges in existing LLM applications for software operations, including low-quality data, fragmented knowledge, and insufficient learning methodologies. To address these issues, the paper proposes an end-to-end workflow that incorporates a Human-in-the-Loop (HITL) mechanism for data curation, enabling the construction of high-quality fine-tuning datasets. The model undergoes supervised fine-tuning to establish a base model, followed by reinforcement learning with a novel domain process reward model (DPRM) to improve the accuracy and reliability of RCA tasks. Experimental results demonstrate that OpsLLM outperforms existing models, achieving accuracy improvements of 0.2% to 5.7% on QA tasks and 2.7% to 70.3% on RCA tasks, showcasing strong transferability across various operational scenarios. The authors also plan to open-source three versions of OpsLLM with varying parameter sizes and a fine-tuning dataset, contributing to the broader research community.
Methodology
The methodology involves a multi-stage learning framework that includes a Human-in-the-Loop mechanism for data quality control, supervised fine-tuning for model training, and reinforcement learning with a domain process reward model (DPRM) to optimize reasoning processes in RCA tasks. The authors also propose a multi-expert benchmark for evaluating QA capabilities.
Results
OpsLLM achieved accuracy improvements of 0.2% to 5.7% on QA tasks and 2.7% to 70.3% on RCA tasks compared to existing open-source and closed-source LLMs, demonstrating its effectiveness in learning and aligning with operational domain knowledge.
Implications
The development of OpsLLM has significant implications for automating and improving software operations, potentially reducing downtime and enhancing system reliability. Its open-source nature may foster further research and development in the application of LLMs in operational contexts.
Disentangling Shared and Task-Specific Representations from Multi-Modal Clinical Data
Multimodal
- Introduces Orthogonal Task Decomposition (OrthTD) for disentangling shared and task-specific representations.
- Utilizes a unified Transformer architecture for multimodal data fusion.
- Achieves superior performance in clinical outcome prediction compared to existing methods.
- Demonstrates significant improvements in identifying rare events within imbalanced datasets.
Read more
Disentangling Shared and Task-Specific Representations from Multi-Modal Clinical Data
Summary
This paper addresses the challenge of effectively utilizing multi-modal clinical data for predicting multiple outcomes by proposing a novel multi-task learning framework called Orthogonal Task Decomposition (OrthTD). The authors highlight the limitations of existing approaches that either overly share representations across tasks or fail to adequately separate shared and task-specific information, leading to issues such as negative transfer and redundancy. OrthTD employs a unified Transformer architecture for multimodal fusion and introduces a geometric orthogonality constraint to disentangle shared representations from task-specific signals. The framework was evaluated on a cohort of 12,430 surgical patients, predicting four clinical outcomes. The results demonstrated that OrthTD achieved an average AUC of 87.5% and an average AUPRC of 37.2%, outperforming advanced tabular and multi-task methods, particularly in identifying rare events in imbalanced clinical data. These findings suggest that enforcing non-redundant representations can significantly enhance multi-outcome predictions in clinical settings.
Methodology
The OrthTD framework integrates multi-modal clinical data using a unified Transformer model. It decomposes the resulting patient representations into shared and task-specific subspaces while imposing an orthogonality constraint to minimize redundancy. This approach allows for structured multi-task learning, where each task can leverage both shared and unique information effectively.
Results
OrthTD achieved an average AUC of 87.5% and an average AUPRC of 37.2% across four clinical outcomes in a cohort of 12,430 surgical patients. The framework consistently outperformed advanced tabular and multi-task learning methods, particularly excelling in the identification of rare clinical events.
Implications
The findings suggest that OrthTD can be a valuable tool for enhancing clinical decision-making by improving the accuracy of multi-outcome predictions from diverse clinical data sources. This could lead to better patient management and tailored treatment strategies in healthcare settings.
Replay-Based Continual Learning for Physics-Informed Neural Operators
Efficient ML
Theory
- Introduces a replay-based continual learning strategy for physics-informed neural operators.
- Utilizes a distillation-based constraint to preserve knowledge and mitigate catastrophic forgetting.
- Employs a PDE-based scoring strategy to focus on poorly performing samples for efficient training.
- Demonstrates improved adaptability to OOD data without requiring labeled datasets.
Read more
Replay-Based Continual Learning for Physics-Informed Neural Operators
Summary
This paper addresses the challenge of performance degradation in neural operators when faced with out-of-distribution (OOD) data by introducing a replay-based continual learning strategy specifically for physics-informed neural operators, particularly those based on the Transolver architecture. The proposed method leverages a physics-informed approach that does not require labeled data, relying instead on input fields and physical constraints for training. To combat catastrophic forgetting, the method incorporates a distillation-based constraint that allows for the integration of a small number of past data points when new OOD data becomes available. Additionally, a transfer learning technique called LoRA is employed to facilitate rapid adaptation to new data. The framework is validated through systematic experiments on three physical problems: Darcy flow in fluid mechanics, a hyperelastic brain tumor problem in biomechanics, and a linear elastic Triply Periodic Minimal Surfaces problem in solid mechanics. The results indicate that the proposed method effectively mitigates catastrophic forgetting while maintaining adaptability to new data, outperforming conventional joint training strategies in terms of training efficiency, memory usage, and computational cost.
Methodology
The methodology involves a replay-based continual learning framework that integrates a distillation-based constraint for knowledge preservation and a PDE-based scoring strategy to selectively replay poorly performing samples. The approach is fully physics-informed, relying on input fields and physical constraints rather than labeled data, and utilizes transfer learning techniques for rapid adaptation to new data.
Results
The proposed method effectively reduces catastrophic forgetting and enhances adaptability to new OOD data across the three tested physical problems. Compared to traditional joint training methods, it shows significant improvements in training efficiency, reduced memory usage, and lower computational costs.
Implications
The findings suggest that continual learning strategies can significantly enhance the performance of physics-informed neural operators in real-world applications, particularly in scenarios where data distributions evolve over time. This approach could lead to more efficient and robust models for solving complex physical problems governed by partial differential equations.
Discovering Sparse Counterfactual Factors via Latent Adjustment for Survey-based Community Intervention
Optimization
Theory
Interpretability
- Introduces a framework for sparse, policy-feasible community interventions based on survey data.
- Utilizes a fixed-basis nonnegative latent representation for stable comparisons pre- and post-intervention.
- Employs Shapley attribution for identifying important latent factors relevant to intervention strategies.
- Combines optimal transport with weighted ℓ2,1 penalties to ensure sparsity in intervention adjustments.
Read more
Discovering Sparse Counterfactual Factors via Latent Adjustment for Survey-based Community Intervention
Summary
This paper addresses the challenge of deriving actionable, sparse counterfactual interventions from transportation survey data. Traditional survey analyses often fall short of providing clear, policy-feasible strategies for community interventions. The authors propose a novel framework that formulates the problem as a distributional alignment task, where the aim is to adjust controllable survey variables to shift a target respondent group towards a desired reference group. The methodology employs a fixed-basis nonnegative latent representation to ensure comparability between pre- and post-intervention data. Key components include the identification of relevant latent factors through Shapley-guided attribution and the optimization of group-level adjustments using an entropy-regularized optimal transport approach. The framework is validated through experiments on real-world transportation datasets, demonstrating its ability to produce compact, interpretable interventions that enhance population-level conversion while maintaining sparsity in adjustments.
Methodology
The authors develop a conversion-by-alignment framework that encodes survey responses into a fixed-coordinate latent space using nonnegative matrix factorization (NMF). They identify target and reference groups through outcome-anchored clustering and utilize a logistic surrogate model alongside Shapley attribution to prioritize latent factors. The final intervention adjustments are learned by minimizing an entropy-regularized optimal transport discrepancy, promoting shared sparsity in the adjustments.
Results
The proposed framework successfully generates compact and interpretable policy interventions that specify adjustment magnitudes. Experiments indicate improved population-level conversion rates and effective preservation of intervention sparsity, validating the approach's practical applicability in community interventions.
Implications
This research has significant implications for policymakers and transportation agencies, providing a structured methodology to derive actionable interventions from survey data. It enhances the ability to design targeted community interventions that can effectively shift public behavior towards desired outcomes, thereby improving transportation adoption and usage.
PRISM-CTG: A Foundation Model for Cardiotocography Analysis with Multi-View SSL
Time Series
- PRISM-CTG is the first foundation model for CTG analysis utilizing self-supervised learning.
- The model integrates multiple supervisory signals to enhance representation learning.
- It demonstrates significant performance improvements across various CTG tasks.
- PRISM-CTG shows strong generalization capabilities on external datasets.
Read more
PRISM-CTG: A Foundation Model for Cardiotocography Analysis with Multi-View SSL
Summary
The paper introduces PRISM-CTG, a self-supervised foundation model designed for cardiotocography (CTG) analysis, addressing the limitations of traditional supervised deep learning models that rely on narrowly curated labeled datasets. PRISM-CTG leverages a large-scale collection of unlabelled CTG recordings to learn transferable domain-level representations through a multi-view self-supervised learning framework. The model is pretrained using three complementary objectives: guided masked signal reconstruction, clinical variable prediction, and feature classification, each associated with specific task tokens to enhance representation learning. By incorporating patient metadata and domain knowledge as supervisory targets, PRISM-CTG effectively transforms underutilized clinical information into valuable learning signals. Extensive experiments across seven downstream CTG tasks demonstrate that PRISM-CTG consistently outperforms existing in-domain and self-supervised learning baselines, achieving strong generalization on external validation datasets and comparable performance to models trained on larger, labeled datasets. This study marks the first introduction of a large-scale foundation model for CTG that learns clinically relevant representations, showcasing the potential of self-supervised learning in medical signal analysis.
Methodology
PRISM-CTG employs a multi-view self-supervised learning framework that optimizes three pretext objectives: random-projected guided masked signal reconstruction, clinical variable prediction, and feature classification. Each objective is linked to task-specific tokens, allowing for specialized representation learning. The model also incorporates controlled cross-attention to facilitate information exchange across different clinical contexts.
Results
The model outperformed existing supervised and self-supervised learning baselines across seven downstream CTG tasks, achieving average performance improvements ranging from 4.93% to 19.31%. It also demonstrated strong generalization on two external datasets, with average improvements of 2.76% and 8.62%, despite limited intrapartum data during pretraining.
Implications
PRISM-CTG's ability to learn from unlabelled data and integrate clinical context has significant implications for automated CTG analysis, potentially improving diagnostic accuracy and decision-making in fetal health assessment. This approach may also be applicable to other medical signal analysis tasks where labeled data is scarce.
Proteo-R1: Reasoning Foundation Models for De Novo Protein Design
Generative Models
Large Language Models
Multimodal
- Proteo-R1 decouples molecular understanding from geometric generation, enhancing interpretability.
- The framework employs a dual-expert architecture combining a reasoning expert and a generation expert.
- Explicit residue-level decisions improve the incorporation of biochemical knowledge into the design process.
- Proteo-R1 allows for modular integration with various generative models, increasing flexibility.
Read more
Proteo-R1: Reasoning Foundation Models for De Novo Protein Design
Summary
The paper introduces Proteo-R1, a novel framework for de novo protein design that integrates reasoning with molecular generation. Traditional deep learning models in this field often lack deliberation, producing molecular structures without understanding the functional significance of residues. Proteo-R1 addresses this by employing a dual-expert architecture: a multimodal large language model (MLLM) acts as the understanding expert, identifying key functional residues, while a diffusion-based generation expert synthesizes molecular geometries based on these identified constraints. This separation of reasoning and generation mimics human expert practices in molecular design, enhancing interpretability and control. By operationalizing reasoning through explicit residue-level decisions, Proteo-R1 allows for the incorporation of biochemical knowledge and facilitates modularity, enabling the reasoning expert to guide various generative models. The framework achieves stable and interpretable results, marking a significant advancement in the field of protein design.
Methodology
Proteo-R1 utilizes a dual-expert framework where a multimodal large language model (MLLM) analyzes protein sequences and structures to identify critical residues. These residues are then used as hard constraints for a diffusion-based generative model, which creates molecular geometries while adhering to the specified interaction anchors. This method emphasizes explicit reasoning over implicit guidance during the generation process.
Results
The implementation of Proteo-R1 demonstrated stable and interpretable outcomes in protein design, with the ability to effectively integrate prior biochemical knowledge. The framework's modularity allows it to adapt to various generative models, showcasing its versatility and potential for broader applications in molecular engineering.
Implications
Proteo-R1 has the potential to revolutionize protein design by providing a more interpretable and controllable framework. Its ability to incorporate human reasoning and biochemical knowledge could lead to more efficient discovery pipelines in drug design, synthetic biology, and other areas of biotechnology.
Improving FMQA via Initial Training Data Design Considering Marginal Bit Coverage in One-Hot Encoding
Optimization
- Introduces a method for designing initial training data to ensure complete marginal bit coverage in FMQA.
- Proposes two sampling techniques: Latin Hypercube Sampling (LHS) and Sobol’ sequence for improved optimization.
- Demonstrates significant performance improvements in optimization tasks, especially with larger variable sets.
- Highlights the importance of initial training data design in the context of black-box optimization problems.
Read more
Improving FMQA via Initial Training Data Design Considering Marginal Bit Coverage in One-Hot Encoding
Summary
This paper addresses the limitations of the Factorization Machine with Quadratic-Optimization Annealing (FMQA) when applied to black-box optimization problems using one-hot encoding. The authors identify that uniform random initial sampling can lead to many binary variables remaining inactive, which prevents effective gradient updates during training. To overcome this issue, the authors propose a novel approach to design the initial training data to ensure complete marginal bit coverage, meaning every binary variable is activated at least once. They implement two space-filling sampling methods, Latin Hypercube Sampling (LHS) and Sobol’ sequence, resulting in LHS-FMQA and Sobol’-FMQA. The effectiveness of these methods is evaluated on a benchmark problem involving human-powered aircraft wing shape optimization with 17 and 32 design variables. The results demonstrate that both proposed methods significantly outperform the baseline FMQA, particularly in the 32-variable scenario, indicating that the initial training data design plays a crucial role in enhancing optimization performance.
Methodology
The authors developed two new sampling methods, LHS and Sobol’, to create initial training data that ensures every binary variable in one-hot encoding is activated at least once. This design aims to improve the gradient updates during the training of the FM model within the FMQA framework. The methods were tested on a benchmark optimization problem involving aircraft wing shapes.
Results
Both LHS-FMQA and Sobol’-FMQA achieved higher mean final cruising speeds compared to the baseline FMQA, with the advantage being more pronounced in the 32-variable optimization problem. This indicates that the proposed initial training data design significantly enhances the optimization capabilities of FMQA.
Implications
The findings suggest that careful design of initial training data can lead to better optimization outcomes in various applications, particularly in fields requiring efficient black-box optimization. This could have implications for engineering design, materials optimization, and other combinatorial optimization problems.
When Does Gene Regulatory Network Inference Break? A Controlled Diagnostic Study of Causal and Correlational Methods on Single-Cell Data
Graph Learning
- Causal methods for GRN inference often underperform compared to correlation-based methods in realistic benchmarks.
- A controlled diagnostic framework was developed to isolate and evaluate the impact of seven specific biological and technical pathologies.
- Causal methods show superiority in ideal conditions but are significantly hindered by dropout and latent confounders.
- An error-type decomposition reveals qualitatively different errors among methods with similar aggregate accuracy.
Read more
When Does Gene Regulatory Network Inference Break? A Controlled Diagnostic Study of Causal and Correlational Methods on Single-Cell Data
Summary
This paper addresses the puzzling observation that causal methods for Gene Regulatory Network (GRN) inference from single-cell RNA-seq data often do not outperform simpler correlation-based methods. The authors argue that existing benchmarks are insufficiently controlled, as they evaluate methods on data with multiple confounding pathologies. To tackle this issue, the authors introduce a controlled diagnostic framework that isolates seven biologically motivated pathologies: dropout, latent confounders, cell-type mixing, feedback loops, network density, sample size, and pseudotime drift. They conduct 6,120 controlled experiments to assess the performance of six representative methods across three inference paradigms. The findings reveal that while causal methods excel in clean conditions, specific pathologies, particularly dropout and latent confounders, can neutralize their advantages. The authors also introduce an error-type decomposition to analyze the qualitative differences in errors made by various methods. Their results indicate that the joint effects of multiple pathologies are sub-additive, providing new insights into the conditions under which different methods succeed or fail in GRN inference.
Methodology
The authors constructed a synthetic simulator based on a linear structural causal model to generate single-cell expression data with independently controllable pathologies. They evaluated six methods spanning correlational, tree-ensemble, and causal paradigms, using metrics such as Area Under the Precision–Recall Curve (AUPRC) and an error-type decomposition to analyze performance degradation.
Results
The study found that causal methods outperform correlation-based methods under clean conditions but are significantly affected by specific pathologies like dropout and latent confounders. The error-type decomposition indicated that different methods commit qualitatively distinct errors, and the joint effects of multiple pathologies were found to be sub-additive.
Implications
These findings provide actionable insights for the development of GRN inference methods and practical guidance for researchers in computational biology, emphasizing the need for controlled benchmarks to better understand method performance under various biological and technical challenges.
A Closed-Form Adaptive-Landmark Kernel for Certified Point-Cloud and Graph Classification
Graph Learning
Theory
Efficient ML
- Introduction of PALACE, an adaptive landmark kernel for point-cloud and graph classification.
- The method provides closed-form guarantees for distortion bounds and classification rates.
- Empirical results show PALACE outperforms existing methods on multiple benchmarks.
- Adaptive landmark placement significantly reduces computational budget compared to uniform grids.
Read more
A Closed-Form Adaptive-Landmark Kernel for Certified Point-Cloud and Graph Classification
Summary
This paper introduces PALACE (Persistence Adaptive-Landmark Analytic Classification Engine), a novel method for classifying point clouds and graphs using a closed-form adaptive landmark kernel. PALACE enhances the existing PLACE pipeline by adapting landmark placements based on the data distribution, which significantly improves classification performance while maintaining theoretical guarantees. The authors propose a self-contained cover-theoretic framework that provides four key closed-form guarantees related to distortion bounds, landmark weight optimization, classification rates, and per-prediction certificates. Empirical evaluations demonstrate that PALACE outperforms existing diagram-based methods on various benchmarks, achieving high accuracy rates while requiring fewer computational resources. The method's adaptability allows it to maintain performance even under increased domain complexity, showcasing its robustness and efficiency in practical applications.
Methodology
PALACE employs a data-adaptive configuration of landmarks placed using class-aware farthest-point sampling on training diagrams. It utilizes a summation embedding lifted into a Reproducing Kernel Hilbert Space (RKHS) via an additive landmark kernel. The method incorporates a small cross-validation tier for parameter selection, ensuring optimal performance while maintaining theoretical guarantees.
Results
PALACE achieved an accuracy of 91.3 ± 1.0% on the Orbit5k dataset, matching the performance of gradient-trained models like Persformer. It consistently outperformed competitors on COX2 and MUTAG datasets and remained competitive on DHFR. The method demonstrated robustness, maintaining 94% accuracy under domain inflation, while uniform grid methods collapsed to chance levels.
Implications
The development of PALACE has significant implications for machine learning applications involving point clouds and graphs, particularly in fields such as computer vision, bioinformatics, and material science. Its ability to provide certified predictions and adapt to data distributions can enhance the reliability and efficiency of classification tasks in these domains.
OCRR: A Benchmark for Online Correction Recovery under Distribution Shift
NLP
Efficient ML
Theory
- Introduction of OCRR, a benchmark for assessing online correction recovery in classification systems.
- Evaluation of nine baseline algorithms across multiple datasets and correction policies.
- Demonstration of the substrate's superior performance in recovering from errors compared to traditional methods.
- Highlighting the inadequacy of static benchmarks in capturing the dynamics of online learning and correction.
Read more
OCRR: A Benchmark for Online Correction Recovery under Distribution Shift
Summary
The paper introduces OCRR (Online Correction Recovery Rate), a novel benchmark designed to evaluate how well classification systems can recover from errors in real-time as they encounter distribution shifts. Unlike traditional static benchmarks that assess model performance at a fixed point in time, OCRR focuses on the recovery speed of models when corrections are applied during operation. The benchmark evaluates systems based on two metrics: novel-class accuracy and original-distribution accuracy, both measured against the number of corrections made. The author evaluates nine baseline algorithms from various families, including continual-learning methods and retrieval-augmented classifiers, against a proposed substrate that employs a hash-chained append-only structure with margin-band majority voting. The results demonstrate that the substrate outperforms existing methods significantly, achieving high accuracy in both novel-class recovery and retention of original-distribution accuracy, thus highlighting the importance of online correction capabilities in deployed systems.
Methodology
The methodology involves creating a streaming protocol for the OCRR benchmark that applies oracle or stochastic corrections to wrong predictions in real-time. The evaluation includes two axes of accuracy (novel-class and original-distribution) and tests various algorithms under different correction policies and storage constraints. The study conducts extensive experiments across two datasets (Banking77 and CLINC150) with multiple baseline systems to assess their performance.
Results
The substrate achieved a novel-class accuracy of 88.7% and an original-distribution accuracy of 95.4%, outperforming the next-best continual-learning baseline by 32.6 percentage points at the same memory budget. Additionally, it showed remarkable stability in classification accuracy (99%) even as retrieval performance degraded, indicating robustness against retrieval imperfections.
Implications
The findings suggest that online correction capabilities are critical for deployed machine learning systems, especially in dynamic environments where data distribution can shift. The OCRR benchmark provides a framework for future research to improve online learning methods and enhance the adaptability of classifiers in real-world applications.
OGPO: Sample Efficient Full-Finetuning of Generative Control Policies
Reinforcement Learning
Generative Models
Robotics
- OGPO enables sample-efficient full-finetuning of GCPs using off-policy critic networks.
- The algorithm achieves state-of-the-art performance on complex manipulation tasks.
- OGPO can finetune poorly-initialized behavior cloning policies without expert data.
- An optimized variant, OGPO+, incorporates additional enhancements for improved performance.
Read more
OGPO: Sample Efficient Full-Finetuning of Generative Control Policies
Summary
This paper introduces Off-policy Generative Policy Optimization (OGPO), a novel algorithm designed for the sample-efficient finetuning of Generative Control Policies (GCPs) in robotic manipulation tasks. GCPs, which utilize expressive generative models like diffusion and flow models, have shown promise in robotic applications but face challenges in balancing sample efficiency and policy improvement. OGPO addresses these challenges by employing off-policy critic networks to maximize data reuse and propagate policy gradients through the entire generative process. The algorithm operates within a bi-level Markov Decision Process (MDP) framework, optimizing a nested inner denoising MDP and an outer environment dynamics MDP. This decoupled approach allows OGPO to leverage the cost asymmetry of sample collection, achieving both stable and expressive updates. The paper demonstrates that OGPO significantly outperforms existing methods in various manipulation tasks, including multi-task settings and high-precision control, and can effectively finetune poorly-initialized behavior cloning policies without requiring expert data. Additionally, an optimized variant, OGPO+, is proposed, which incorporates enhancements like Best-of-N planning and policy distillation, further improving performance while minimizing hyperparameter tuning.
Methodology
OGPO utilizes a bi-level MDP framework, where it performs off-policy Temporal Difference learning to optimize a Q function over expensive environment samples, while using on-policy reinforcement learning updates to extract policies from a denoising MDP over computationally cheap samples. This decoupled optimization approach enhances both sample efficiency and policy expressiveness.
Results
The empirical evaluations show that OGPO outperforms existing finetuning methods in various robotic manipulation tasks, achieving high success rates with minimal expert data and hyperparameter tuning. The OGPO+ variant further enhances performance through advanced planning and policy distillation techniques.
Implications
The findings suggest that OGPO could revolutionize the way robotic policies are finetuned, enabling more robust and adaptable robotic systems that require less human intervention and data collection. This could lead to broader applications in autonomous robotics and real-world deployment scenarios.
Using Common Random Numbers for Simulation-based Planning with Rollouts
Reinforcement Learning
Theory
Optimization
- Introduction of a new estimator for value difference in simulation-based planning.
- Demonstration of variance reduction using common random numbers in rollouts.
- Validation of the proposed method through experiments on synthetic tasks.
- Application of the method in real-world scenarios, including pension disbursement and game planning.
Read more
Using Common Random Numbers for Simulation-based Planning with Rollouts
Summary
This paper investigates the use of common random numbers in simulation-based planning with rollouts, a technique commonly employed for decision-making in stochastic environments. The authors propose a new estimator that reduces variance in the estimation of relative utility when simulations utilize a rollout policy beyond a certain depth. By leveraging common random numbers, the proposed method enhances the accuracy of utility estimates, leading to improved task performance. The paper presents experimental results on synthetic tasks that validate the effectiveness of the proposed approach. Additionally, the authors demonstrate the practical significance of their innovation through two applications: a single-step lookahead planning task in pension disbursement and the deployment of the Upper Confidence bounds for Trees (UCT) algorithm in the game of Ludo. The findings suggest that the new estimator can outperform traditional estimators that rely on full independence or full dependence, which often exhibit higher variance in certain tasks.
Methodology
The authors analyze the statistical properties of different estimators for value differences between policies in a finite-horizon Markov Decision Problem (MDP). They propose a new estimator that utilizes common random numbers to reduce variance in utility estimates. The methodology includes theoretical derivations and empirical validation through experiments on synthetic tasks and practical applications.
Results
The experiments confirm that the proposed estimator significantly improves task performance compared to traditional estimators. The application of the method in the pension disbursement task and the UCT algorithm for Ludo demonstrates its practical utility and effectiveness in real-world scenarios.
Implications
The findings suggest that using common random numbers can enhance the performance of simulation-based planning methods, making them more reliable for decision-making in stochastic environments. This has potential applications in various fields, including finance, gaming, and robotics, where effective planning under uncertainty is crucial.
Hierarchical Support Vector State Partitioning for Distilling Black Box Reinforcement Learning Policies
Reinforcement Learning
Interpretability
- Introduction of State Vector Space Partitioning (SVSP) for distilling RL policies.
- SVSP achieves a 7.4% improvement in mean return over Voronoi State Partitioning (VSP).
- Reduction of required sub-policies by 82.1% compared to VSP.
- Validation on LunarLanderContinuous shows SVSP outperforms both TD3 and VSP.
Read more
Hierarchical Support Vector State Partitioning for Distilling Black Box Reinforcement Learning Policies
Summary
This paper introduces State Vector Space Partitioning (SVSP), a novel approach designed to mimic black-box reinforcement learning (RL) policies through the use of human-interpretable sub-policies. The authors propose a method that partitions a dataset of state-action pairs using linear support vector machine (SVM) splits, resulting in a compact and structured representation of the original policy. SVSP demonstrates significant improvements in performance, achieving a mean return increase of 7.4% over previous critic-driven state partitioning methods, such as Voronoi State Partitioning (VSP), and a 2.8% improvement over the original TD3 policy. Additionally, SVSP reduces the number of required sub-policies by 82.1% compared to VSP. The methodology leverages theoretical insights from SVMs to create decision boundaries that effectively capture the behavior of the original policy while maintaining interpretability. The validation of SVSP is conducted using the LunarLanderContinuous benchmark, where it outperforms both the original TD3 agent and VSP in terms of return and complexity. These findings suggest that SVSP can facilitate a more flexible form of policy distillation, allowing for the selection of decision boundaries and surrogate models that closely align with the original black-box behavior.
Methodology
The SVSP method involves hierarchical partitioning of the state space into regions where sub-policies can approximate the original RL policy. A binary SVM classifier is used to determine the decision boundaries based on the performance of the sub-policies, guided by a critic that evaluates the expected return of actions in each state. The process iteratively refines the partitioning until a maximum depth is reached.
Results
In experiments on the LunarLanderContinuous benchmark, SVSP achieved a mean return of 166.3, surpassing the original TD3's 161.8 and VSP's 154.8. SVSP utilized only 10 sub-policies compared to VSP's 55.9, demonstrating both higher performance and reduced complexity.
Implications
The findings suggest that SVSP can be utilized in various applications requiring interpretable reinforcement learning policies, such as robotics and automated decision-making systems, where understanding the decision process is crucial.
Quadrature-TreeSHAP: Depth-Independent TreeSHAP and Shapley Interactions
Interpretability
Efficient ML
Theory
- Introduces a quadrature-based reformulation of Path-Dependent TreeSHAP.
- Achieves depth-independent computation of Shapley values and higher-order interactions.
- Demonstrates significant speed improvements over existing TreeSHAP methods.
- Provides a stable and efficient implementation for both CPU and GPU.
Read more
Quadrature-TreeSHAP: Depth-Independent TreeSHAP and Shapley Interactions
Summary
The paper introduces Quadrature-TreeSHAP, a novel method for computing Shapley values and Shapley interaction values in decision tree ensembles. Traditional methods, particularly Path-Dependent SHAP, face challenges related to depth-dependent runtime, numerical stability, and support for higher-order interactions. Quadrature-TreeSHAP addresses these issues by reformulating the computation using a weighted-Banzhaf interaction polynomial and integrating it over a feature participation probability using Gauss–Legendre quadrature. This approach achieves numerical stability and is practically insensitive to tree depth, allowing for efficient computation on both CPU and GPU. The authors demonstrate that their method significantly outperforms existing techniques in terms of speed and stability across various benchmarks, making it a valuable tool for interpreting machine learning models.
Methodology
The authors developed Quadrature-TreeSHAP by expressing Shapley values as integrals of a weighted Banzhaf polynomial, evaluated using Gauss–Legendre quadrature. They empirically validated the number of quadrature points required for achieving machine precision, finding that only 8 points are sufficient. The method exploits SIMD for efficient computation and is implemented in C++ with GPU support.
Results
Quadrature-TreeSHAP computes Shapley values 1.06×–10.59× faster than TreeSHAP on CPU and 1.84×–6.95× faster than GPUTreeSHAP on GPU. For Shapley pairwise interactions, speedups of 3.80×–58.11× were observed, with higher-order interactions achieving up to 1200× speedup compared to TreeSHAP-IQ. The method also demonstrated greater numerical stability than existing approaches.
Implications
The development of Quadrature-TreeSHAP has significant implications for the interpretability of machine learning models, particularly in fields that rely on decision tree ensembles. Its efficiency and stability make it suitable for real-time applications in finance, healthcare, and other domains where model explanations are critical.
Toward Structural Multimodal Representations: Specialization, Selection, and Sparsification via Mixture-of-Experts
Multimodal
- S3 framework decomposes multimodal inputs into semantic experts for improved task-specific representation.
- The methodology includes three stages: Specialization, Selection, and Sparsification.
- S3 demonstrates superior performance on MultiBench benchmarks compared to existing multimodal learning methods.
- A reverse U-shaped trend in performance indicates optimal sparsity levels enhance accuracy.
Read more
Toward Structural Multimodal Representations: Specialization, Selection, and Sparsification via Mixture-of-Experts
Summary
The paper introduces S3 (Specialization, Selection, Sparsification), a novel framework for multimodal representation learning (MMRL) that emphasizes a structural approach to handling multimodal data. Unlike traditional methods that encode all signals into a fixed embedding, S3 decomposes multimodal inputs into semantic experts, allowing for selective routing based on task requirements. The framework consists of three stages: Specialization, which creates concept-level experts in a shared latent space; Selection, which adapts routing to activate task-relevant experts; and Sparsification, which prunes low-utility paths to produce compact representations. The authors demonstrate that S3 outperforms existing methods across four MultiBench benchmarks, revealing a reverse U-shaped trend in performance relative to sparsity, with optimal accuracy achieved at intermediate levels of sparsity. This suggests that structuring multimodal representations as selectable semantic components enhances efficiency and control over information use, providing a viable alternative to contrastive learning and InfoMax-driven approaches.
Methodology
The S3 framework employs a structural perspective on multimodal representation learning by decomposing inputs into interpretable semantic units. It utilizes a Mixture-of-Experts approach to create specialized components that adaptively align with task demands. The framework's three stages—Specialization, Selection, and Sparsification—allow for dynamic routing and pruning of low-utility paths to yield compact representations.
Results
S3 consistently outperformed prior multimodal representation learning methods across four MultiBench benchmarks. The experiments revealed a reverse U-shaped trend in performance, with peak accuracy at intermediate levels of sparsity, indicating that the model effectively suppresses task-irrelevant noise while retaining essential signals.
Implications
The findings suggest that structuring multimodal representations as selectable semantic components can lead to more efficient and interpretable models. This approach has potential applications in various domains where multimodal data integration is crucial, such as computer vision, natural language processing, and robotics.
RFPrompt: Prompt-Based Expert Adaptation of the Large Wireless Model for Modulation Classification
Efficient ML
Multimodal
- RFPrompt offers a parameter-efficient adaptation mechanism for wireless foundation models to handle OOD tasks.
- The framework utilizes learnable prompt tokens to adapt a frozen pretrained backbone, minimizing parameter overhead.
- Empirical results show significant improvements in robustness and performance for real-world IQ classification tasks.
- RFPrompt effectively closes over 79% of the performance gap compared to fully fine-tuned models using only 0.34% of the parameters.
Read more
RFPrompt: Prompt-Based Expert Adaptation of the Large Wireless Model for Modulation Classification
Summary
The paper presents RFPrompt, a novel framework for automatic modulation classification (AMC) that addresses the challenges of adapting large wireless models to out-of-distribution (OOD) tasks. Traditional deep learning approaches for AMC struggle with robustness due to distribution shifts caused by hardware impairments and varying propagation environments. RFPrompt introduces a parameter-efficient method of adaptation using learnable deep prompt tokens while keeping the pretrained backbone of the Large Wireless Model (LWM) frozen. This approach allows for task-specific adaptation with minimal additional parameters. The authors evaluate RFPrompt on the LWM, a mixture-of-experts model pre-trained on diverse wireless data, and demonstrate its effectiveness in both standard and OOD modulation classification scenarios. The results indicate that RFPrompt significantly enhances robustness and performance, particularly in real-world settings with limited supervision, outperforming conventional baselines and closing the performance gap with fine-tuned models while using only a fraction of the parameters.
Methodology
The authors propose RFPrompt, which employs prompt-based adaptation by introducing learnable tokens that guide the frozen LWM backbone during the classification task. This method allows for efficient specialization without modifying the pretrained model's weights, focusing on enhancing task-specific discriminability through the prompts.
Results
RFPrompt consistently outperformed conventional baselines in modulation classification tasks, particularly under OOD conditions. It achieved over 79% of the performance of fully fine-tuned models while utilizing only 0.34% of the parameters, demonstrating significant robustness and efficiency.
Implications
The findings suggest that prompt-based learning can be a practical and effective strategy for adapting large wireless models to challenging real-world RF environments, potentially improving the deployment of AMC systems in cognitive radio and spectrum monitoring applications.
Unified Framework of Distributional Regret in Multi-Armed Bandits and Reinforcement Learning
Reinforcement Learning
Theory
- Introduces a unified framework for distributional regret in MAB and RL.
- Presents a novel algorithm (EQO+) with a flexible exploration bonus.
- Establishes both gap-independent and gap-dependent distributional regret bounds.
- Achieves optimal trade-offs between expected and distributional regret.
Read more
Unified Framework of Distributional Regret in Multi-Armed Bandits and Reinforcement Learning
Summary
This paper presents a unified framework for analyzing distributional regret in stochastic multi-armed bandits (MAB) and episodic reinforcement learning (RL). The authors formalize a distributional regret bound that provides probabilistic guarantees uniformly across all confidence levels δ ∈(0, 1]. They introduce a UCBVI-style algorithm, EQO+, which incorporates an exploration bonus based on visit counts and user-specified parameters. The paper derives both gap-independent and gap-dependent distributional regret bounds, elucidating how these parameters influence the trade-offs between expected performance, tail risk, and instance-dependent behavior. Notably, the authors achieve optimal trade-offs in both minimax and instance-dependent settings. For MAB with A arms and horizon T, they establish a distributional regret bound of order O(√AT log(1/δ)), confirming a conjecture from prior work. This work bridges gaps in the understanding of distributional behavior in both MAB and RL, providing a comprehensive analysis that extends previous results.
Methodology
The authors develop a UCBVI-style algorithm that incorporates a bonus term based on the visit count of state-action pairs. They analyze the distribution of cumulative regret and derive bounds that are uniform across all confidence levels. The methodology includes deriving both gap-independent and gap-dependent bounds, and introducing a regularity assumption to generalize existing noise models.
Results
The paper establishes a distributional regret bound of O(√AT log(1/δ)) for MAB, matching minimax lower bounds. It also provides expected regret bounds of O(√AT) and demonstrates that the proposed framework achieves optimal trade-offs in various settings, including both minimax and instance-dependent regimes.
Implications
This work has significant implications for the design of algorithms in online decision-making scenarios, particularly in MAB and RL contexts. The unified framework allows for better understanding and control of the distributional properties of regret, which can lead to more robust and efficient algorithms in practice.
Breaking the Quality-Privacy Tradeoff in Tabular Data Generation via In-Context Learning
Generative Models
- DiffICL is the first approach to frame tabular data generation as an in-context learning problem.
- The method effectively mitigates the memorization issue prevalent in small-data regimes, enhancing both data quality and privacy.
- DiffICL outperforms existing generative models across 14 datasets in terms of quality and privacy protection.
- The synthetic data generated can be used for data augmentation, improving downstream task performance.
Read more
Breaking the Quality-Privacy Tradeoff in Tabular Data Generation via In-Context Learning
Summary
This paper addresses the challenge of generating high-quality tabular data while maintaining privacy, particularly in scenarios with limited training data. The authors identify a significant tradeoff between data quality and privacy in existing generative models, where enhancing data quality often leads to increased memorization of training samples, thereby compromising privacy. To overcome this issue, they propose DiffICL, a novel approach that formulates tabular data generation as an in-context learning (ICL) problem. By leveraging pretrained structural priors from a diverse set of datasets, DiffICL enables the model to infer data distributions based on limited context rather than memorizing individual samples. The authors evaluate DiffICL on 14 real-world datasets, demonstrating that it significantly improves both data quality and privacy protection compared to existing generative methods. Additionally, the synthetic data generated by DiffICL serves as effective data augmentation, enhancing performance in downstream tasks. The findings suggest that the quality-privacy tradeoff can be effectively managed through innovative training paradigms.
Methodology
DiffICL employs in-context learning by pretraining on a large collection of datasets to learn structural priors. It partitions datasets into context and query sets, training the model to generate query samples conditioned on context samples. This approach reduces the incentive for memorization and enhances the model's ability to infer the underlying data distribution. A dual-axis attention Transformer is utilized for conditional generation over latent embeddings, accommodating the heterogeneity of tabular data.
Results
The evaluation of DiffICL on 14 real-world datasets shows marked improvements in both data quality and privacy protection compared to traditional generative models like CTGAN, TVAE, and tabDDPM. The results indicate that DiffICL establishes a superior quality-privacy Pareto frontier, demonstrating its effectiveness in generating synthetic data that is both high-quality and privacy-preserving.
Implications
The findings of this study have significant implications for fields that rely on tabular data, such as healthcare and finance, where privacy concerns are paramount. By providing a method that balances data quality and privacy, DiffICL can facilitate safer data sharing and enhance the utility of synthetic data in various applications.
Constraint-Enhanced Reinforcement Learning Based on Dynamic Decoupled Spherical Radial Squashing
Reinforcement Learning
Robotics
Optimization
- DD-SRad achieves hard per-step constraint satisfaction with probability 1.
- The method provides exact ℓ∞ coverage of the feasible action space, addressing the limitations of existing spherical parameterization methods.
- Empirical results show significant improvements in task performance and constraint adherence compared to traditional methods.
- The approach is compatible with existing off-policy RL frameworks, enabling seamless integration.
Read more
Constraint-Enhanced Reinforcement Learning Based on Dynamic Decoupled Spherical Radial Squashing
Summary
This paper addresses the challenge of deploying reinforcement learning (RL) policies on physical robots, specifically focusing on actuator rate constraints that limit how quickly each joint can move. Existing methods inadequately handle the heterogeneity of these constraints across different joints, which can lead to significant performance issues. The authors propose a novel approach called Dynamic Decoupled Spherical Radial Squashing (DD-SRad), which computes a position-adaptive radius for each actuator independently. This method ensures that the feasible action space aligns closely with the actual constraints, allowing for hard constraint satisfaction at every control step. DD-SRad not only preserves well-conditioned gradients during training but also allows for exact policy gradient backpropagation without runtime solver overhead. Experimental results on MuJoCo benchmarks demonstrate that DD-SRad achieves the highest task return without any constraint violations, outperforming spherical baselines by 30% to 50% in constraint-space coverage. Additionally, simulations with Unitree H1 and G1 humanoid robots validate the approach, confirming its effectiveness in real-world applications.
Methodology
The authors introduce DD-SRad, which utilizes a smooth analytic action parameterization that computes a position-adaptive effective radius for each action dimension. This allows for precise alignment with the per-joint feasible region, overcoming the limitations of isotropic spherical constraints. The method supports exact backpropagation and integrates into existing RL architectures without requiring additional runtime solvers.
Results
DD-SRad demonstrated superior performance on MuJoCo benchmarks, achieving the highest task returns at zero constraint violations. The method showed a 30% to 50% improvement in constraint-space coverage compared to spherical baselines. Additionally, it successfully transferred to real-world simulations with Unitree humanoid robots, validating its effectiveness in practical applications.
Implications
The findings suggest that DD-SRad can significantly enhance the deployment of RL policies in robotics, particularly in scenarios where actuator rate constraints are critical. This could lead to safer and more efficient robotic control systems in various applications, including industrial automation and humanoid robotics.
Text-Conditional JEPA for Learning Semantically Rich Visual Representations
Computer Vision
NLP
Multimodal
- Introduction of TC-JEPA, which enhances I-JEPA with text conditioning for better semantic representation.
- Utilization of image captions to reduce prediction uncertainty in masked feature prediction.
- Demonstrated improvements in downstream performance, training stability, and scalability.
- Establishment of a new vision-language pretraining paradigm based on feature prediction, outperforming contrastive methods.
Read more
Text-Conditional JEPA for Learning Semantically Rich Visual Representations
Summary
This paper introduces Text-Conditional JEPA (TC-JEPA), an enhancement of the Image-based Joint-Embedding Predictive Architecture (I-JEPA) aimed at improving visual self-supervised learning through masked feature prediction. The authors identify that the inherent visual uncertainty at masked positions complicates feature prediction, which can hinder the learning of semantic representations. TC-JEPA addresses this issue by incorporating image captions as a fine-grained text conditioner that utilizes sparse cross-attention over input text tokens to modulate predicted patch features. This approach allows patch features to be more predictable and semantically meaningful, as they are conditioned on relevant text. The authors demonstrate that TC-JEPA not only improves downstream performance and training stability but also exhibits promising scalability. Furthermore, TC-JEPA establishes a novel vision-language pretraining paradigm based solely on feature prediction, outperforming traditional contrastive methods in various tasks, particularly those requiring fine-grained visual understanding and reasoning. The paper emphasizes the importance of fine-grained text conditioning in enhancing the predictive power of visual representations, thus contributing to the broader field of self-supervised learning.
Methodology
TC-JEPA employs a fine-grained text conditioner that uses cross-attention over text tokens to modulate the predicted patch features at multiple layers of the predictor. This method aims to reduce prediction uncertainty by leveraging human- or LMM-generated image captions, allowing for a more semantically aligned representation learning process.
Results
The experiments conducted show that TC-JEPA achieves superior performance compared to traditional contrastive methods across various tasks, including classification and dense prediction tasks like segmentation. It also demonstrates strong scalability and stability during training, approaching the performance of state-of-the-art invariance learning methods.
Implications
The findings suggest that incorporating text conditioning into visual representation learning can significantly enhance the model's ability to understand and reason about fine-grained visual details. This approach could have applications in areas requiring detailed visual comprehension, such as image segmentation, visual question answering, and multimodal tasks.
From Video-to-PDE: Data-Driven Discovery of Nonlinear Dye Plume Dynamics
Computer Vision
Theory
Interpretability
- Development of a comprehensive video-to-PDE pipeline for modeling dye plume dynamics.
- Utilization of weak-form regression to mitigate issues with noisy video data.
- Implementation of rollout calibration and bootstrap diagnostics for coefficient assessment.
- The derived PDE model outperforms traditional advection-diffusion models.
Read more
From Video-to-PDE: Data-Driven Discovery of Nonlinear Dye Plume Dynamics
Summary
This paper presents a novel video-to-PDE pipeline designed to extract continuum models from uncalibrated video recordings of dye plume dynamics. The authors address the challenges of inferring physical models from image intensity data and the instability of direct numerical differentiation on noisy video frames. The pipeline involves several key steps: converting grayscale video data into a normalized scalar field, isolating bulk drift from intrinsic spreading, and employing weak-form sparse regression to identify an effective transport law. The methodology includes a unique rollout calibration process to refine model coefficients and assess their robustness using bootstrap diagnostics. The resulting reduced model, which incorporates a nonlinear-gradient transport law, demonstrates superior predictive performance compared to traditional advection-diffusion models. Additionally, the selected PDE allows for a Cole–Hopf linearization, showcasing the potential for deriving interpretable and simulable models from visual data.
Methodology
The methodology involves converting video data into a normalized scalar field, isolating bulk drift using intensity-weighted centroids, and applying weak-form sparse regression to discover effective transport laws. The model coefficients are refined through an inverse physics-informed neural network (iPINN) and recalibrated against forward rollouts, with uncertainty quantified using chronological block bootstrap techniques.
Results
The pipeline successfully identifies a reduced nonlinear-gradient transport law that significantly outperforms traditional advection-diffusion models on held-out video frames. The selected PDE retains a positive Laplacian coefficient and admits a Cole–Hopf reduction to a linear advection-diffusion equation, demonstrating its robustness and interpretability.
Implications
The findings suggest that uncalibrated visual data can be effectively utilized to derive compact and predictive continuum models, which could have applications in various fields such as fluid dynamics, environmental monitoring, and any domain where visual data is prevalent.
Rethinking the Rank Threshold for LoRA Fine-Tuning
NLP
Large Language Models
Theory
- The rank requirement for LoRA fine-tuning can be reduced from 12 to 1 for binary classification tasks.
- The use of non-symmetric manifold dimension analysis leads to a weaker capacity requirement.
- The Polyak–Łojasiewicz inequality allows for the removal of the rank threshold in cross-entropy settings.
- Empirical results demonstrate that rank 1 performs competitively across various binary classification tasks.
Read more
Rethinking the Rank Threshold for LoRA Fine-Tuning
Summary
This paper addresses the rank threshold for Low-Rank Adaptation (LoRA) fine-tuning in the context of neural networks, particularly focusing on binary classification tasks. Previous work established a rank condition (r(r + 1)/2 > KN) for avoiding spurious local minima in the loss landscape, suggesting a rank of at least 12 for few-shot setups. The author presents three main results that significantly reduce this requirement to a rank of 1. First, by employing a non-symmetric count of the LoRA manifold dimension, a weaker capacity requirement is derived. Second, the application of the Polyak–Łojasiewicz inequality in the cross-entropy loss context eliminates the rank threshold entirely. Third, a Rademacher-complexity bound indicates that rank-one is optimal for binary classification when the bias term is saturated. Empirical evaluations across various binary tasks show that rank 1 is competitive with the previously recommended rank of 12, while in multi-class settings, the optimal rank tends to be higher than 1. The findings suggest that the theoretical understanding of rank requirements in LoRA fine-tuning can be significantly relaxed, particularly for binary classification tasks.
Methodology
The paper employs theoretical analysis to derive new rank conditions for LoRA fine-tuning, utilizing concepts from the neural tangent kernel (NTK) regime, Rademacher complexity, and the Polyak–Łojasiewicz inequality. Empirical evaluations are conducted on multiple binary classification tasks using different encoder architectures to validate the theoretical findings.
Results
The analysis shows that for binary classification, a rank of 1 is sufficient for effective fine-tuning, outperforming the previously established threshold of 12. In multi-class scenarios, the optimal rank is found to be greater than 1, confirming the theoretical predictions regarding bias saturation and complexity.
Implications
These findings could lead to more efficient fine-tuning practices in NLP applications, allowing practitioners to use lower rank adaptations without sacrificing performance. This could reduce computational costs and memory usage in deploying large transformer models.
Uni-OPD: Unifying On-Policy Distillation with a Dual-Perspective Recipe
Large Language Models
Reinforcement Learning
Multimodal
- Identification of two core bottlenecks in OPD: insufficient exploration of informative states and unreliable teacher supervision.
- Introduction of a dual-perspective optimization strategy that enhances both student exploration and teacher signal reliability.
- Comprehensive validation of Uni-OPD across diverse settings, showcasing its effectiveness and versatility.
- Demonstration of faster convergence and improved performance compared to existing OPD and reinforcement learning methods.
Read more
Uni-OPD: Unifying On-Policy Distillation with a Dual-Perspective Recipe
Summary
The paper introduces Uni-OPD, a unified framework for On-Policy Distillation (OPD) aimed at enhancing the knowledge transfer from specialized expert models to a single student model. The authors identify two critical bottlenecks in existing OPD methods: insufficient exploration of informative states by the student and unreliable teacher supervision during student rollouts. To address these issues, Uni-OPD employs a dual-perspective optimization strategy that enhances both student exploration and the reliability of teacher signals. This is achieved through two data balancing strategies: offline difficulty-aware and online correctness-aware balancing, alongside an outcome-guided margin calibration mechanism. The framework is validated through extensive experiments across five domains and sixteen benchmarks, demonstrating its effectiveness in various distillation settings, including single-teacher and multi-teacher scenarios, strong-to-weak distillation, and cross-modal distillation. The results indicate that Uni-OPD consistently outperforms traditional OPD methods and converges faster than reinforcement learning approaches, providing valuable insights into the optimization of OPD.
Methodology
The methodology involves a dual-perspective optimization strategy that combines offline and online data balancing techniques to promote exploration of informative states. Additionally, an outcome-guided margin calibration mechanism is introduced to ensure that teacher supervision remains reliable and order-consistent with the outcome rewards during training.
Results
The experiments conducted across five domains and sixteen benchmarks reveal that Uni-OPD consistently outperforms traditional OPD methods and shows faster convergence compared to reinforcement learning approaches. The framework demonstrates versatility in handling various distillation scenarios, including single-teacher, multi-teacher, strong-to-weak, and cross-modal distillation.
Implications
The findings suggest that Uni-OPD can significantly enhance the efficiency of knowledge distillation in large language models (LLMs) and multimodal language models (MLLMs), potentially leading to improved performance in applications requiring complex reasoning and domain knowledge integration.
On Adaptivity in Zeroth-Order Optimization
Optimization
Large Language Models
Efficient ML
- Adaptive ZO methods like ZO-Adam do not outperform well-tuned ZO-SGD in high-dimensional settings.
- ZO gradients are isotropic and lack the coordinate-wise heterogeneity that adaptive methods exploit.
- MEAZO is proposed as a memory-efficient alternative that achieves global step size adaptation with minimal memory usage.
- MEAZO matches the performance of ZO-Adam while retaining the memory footprint of ZO-SGD.
Read more
On Adaptivity in Zeroth-Order Optimization
Summary
This paper investigates the effectiveness of adaptive zeroth-order (ZO) optimization methods for fine-tuning large language models (LLMs) under memory constraints. The authors challenge the prevailing belief that adaptive methods like ZO-Adam provide significant convergence advantages over well-tuned ZO-SGD, highlighting that these adaptive methods incur considerable memory overhead without substantial benefits in high-dimensional settings. They demonstrate that ZO gradients lack coordinate-wise heterogeneity, which limits the effectiveness of adaptive mechanisms. To address these issues, the authors propose MEAZO, a memory-efficient adaptive ZO optimizer that tracks only a single scalar for global step size adaptation. Theoretical convergence guarantees are provided for MEAZO, and empirical results show that it matches the performance of ZO-Adam while maintaining the memory efficiency of ZO-SGD. Additionally, MEAZO exhibits enhanced robustness to step size choices, making it a practical alternative for ZO fine-tuning in memory-constrained environments.
Methodology
The authors conducted a theoretical analysis of zeroth-order optimization methods, particularly focusing on the properties of ZO gradients in high dimensions. They introduced MEAZO, which simplifies the adaptive mechanism by tracking a single scalar for step size adaptation. The performance of MEAZO was evaluated through experiments on various LLMs and tasks, comparing it against existing adaptive and non-adaptive ZO optimizers.
Results
The experiments demonstrated that MEAZO achieves performance comparable to ZO-Adam while using the same memory as ZO-SGD. It also showed improved robustness to step size variations, particularly in grouped or block-structured optimization settings. The theoretical guarantees provided support the convergence properties of MEAZO under standard assumptions.
Implications
The findings suggest that for memory-constrained environments, MEAZO can serve as an effective and efficient alternative for fine-tuning large language models, potentially leading to broader adoption of zeroth-order optimization techniques in practical applications.
A geometric relation of the error introduced by sampling a language model's output distribution to its internal state
NLP
Large Language Models
Interpretability
- Introduces a geometric framework to analyze sampling errors in language models.
- Demonstrates that the curvature of token embeddings relates to the model's internal world representation.
- Uses chess as a controlled environment to evaluate model behavior and decision-making.
- Shows that the geometry of token space can reflect the model's internal representation of problems.
Read more
A geometric relation of the error introduced by sampling a language model's output distribution to its internal state
Summary
This paper explores the sensitivity of GPT-style language models to single-token changes during output generation, particularly when the predicted probability distribution is spread across multiple tokens. The author presents a geometric perspective on this sensitivity by deriving an so(n)-valued 1-form that is solely dependent on the geometry of token embeddings. The curvature of this geometric construct is shown to have semantic significance, particularly in chess reasoning tasks, where it correlates with the model's internal world representation. The study employs off-the-shelf instruction-tuned models and analyzes the relationship between the model's internal state and its learned world model through the lens of geometry, specifically focusing on how the geometry of token embeddings can reflect the model's decision-making process. The findings indicate that the geometry of token space can provide insights into how models represent problems internally, particularly in contexts where the model's output is ambiguous due to closely competing tokens.
Methodology
The study employs a geometric approach to derive a relationship between the sampling error in language models and their internal states. It utilizes off-the-shelf models (Qwen2.5 and Mistral-Small) and conducts experiments in a chess reasoning context, analyzing the model's hidden states and decision-making at critical generation points. The research includes the use of linear probes to interpret the learned world model without fine-tuning the models.
Results
The experiments reveal that the curvature derived from the geometric relationship correlates with the model's performance on chess reasoning tasks. The transformations observed cluster by board region and respect the importance of chess pieces, indicating that the geometry of token embeddings is semantically meaningful and reflects the model's internal representation of the chess world.
Implications
The findings suggest that understanding the geometric properties of token embeddings can enhance interpretability in language models, providing insights into their decision-making processes. This could lead to improved model design and training strategies that account for the geometric relationships within token spaces, potentially enhancing performance in tasks requiring nuanced understanding.
ELVIS: Ensemble-Calibrated Latent Imagination for Long-Horizon Visual MPC
Reinforcement Learning
Robotics
Optimization
- ELVIS combines recurrent state-space models with Gaussian-mixture MPPI for improved long-horizon planning.
- The framework adapts the effective return horizon in real-time, enhancing robustness against model errors.
- It employs uncertainty-aware exploration and exploitation strategies to improve planning reliability.
- ELVIS achieves state-of-the-art performance on benchmark visual tasks and demonstrates effective zero-shot transfer to real-world applications.
Read more
ELVIS: Ensemble-Calibrated Latent Imagination for Long-Horizon Visual MPC
Summary
The paper introduces ELVIS, a novel latent model predictive controller (MPC) designed to enhance long-horizon planning in visual control tasks using model-based reinforcement learning (RL). Traditional approaches struggle with long rollouts due to branching futures and multi-modal action-value distributions, compounded by model errors and visual occlusions. ELVIS addresses these challenges by employing a Dreamer-style recurrent state space model (RSSM) and replacing the standard unimodal model predictive path integral (MPPI) with a Gaussian-mixture MPPI. This allows the system to maintain multiple coherent hypotheses over long horizons, thus avoiding mode averaging. Additionally, ELVIS stabilizes deep imagination through an ensemble of latent critics that provide an uncertainty-aware return, which adapts the planning process to balance bootstrapping and look-ahead strategies. The framework is evaluated across fourteen DeepMind Control Suite visual tasks, achieving state-of-the-art performance compared to existing methods like TD-MPC2 and DreamerV3. Furthermore, ELVIS demonstrates robust zero-shot transfer capabilities to a real-world sand spraying task with severe occlusions, improving surface quality metrics and showcasing its practical applicability beyond simulation.
Methodology
ELVIS utilizes a recurrent state-space model (RSSM) to capture latent belief under partial observability, combined with a Gaussian-mixture MPPI for long-horizon planning. The approach incorporates an ensemble of latent critics to provide an uncertainty-aware return, which modulates the planning process by adapting the effective return horizon based on real-time confidence thresholds.
Results
The evaluation of ELVIS on fourteen DeepMind Control Suite visual tasks shows that it outperforms existing state-of-the-art methods, including TD-MPC2 and DreamerV3. Additionally, ELVIS successfully transfers to a real-world sand spraying task, improving surface quality metrics significantly, indicating its robustness and practical applicability.
Implications
The advancements presented in ELVIS could lead to more reliable and efficient visual control systems in robotics, particularly in environments with partial observability and occlusions. Its ability to adaptively manage uncertainty and improve long-horizon planning could enhance various applications, including autonomous navigation and manipulation tasks in complex real-world scenarios.
OSAQ: Outlier Self-Absorption for Accurate Low-bit LLM Quantization
Large Language Models
Efficient ML
Optimization
- OSAQ introduces an additive weight transformation to suppress outliers in low-bit quantization.
- The method exploits the low-rank properties of the Hessian matrix to identify a stable null space.
- OSAQ does not require inter-layer transformations, maintaining efficiency during inference.
- The approach is validated through extensive experiments, showing significant performance improvements over existing methods.
Read more
OSAQ: Outlier Self-Absorption for Accurate Low-bit LLM Quantization
Summary
The paper introduces Outlier Self-Absorption Quantization (OSAQ), a novel method aimed at improving low-bit quantization of Large Language Models (LLMs) by addressing the challenge of systematic outliers in model weights. Traditional quantization methods, such as scaling and rotation, have shown limited effectiveness in mitigating the impact of these outliers, which can significantly degrade model performance, especially in low-bit settings. OSAQ leverages the low-rank consistency of the Hessian matrix associated with the task loss, identifying a stable null space that allows for an additive transformation of weights. This transformation suppresses outliers without altering the task loss and does not require inter-layer adjustments or introduce inference overhead. The authors demonstrate that OSAQ can be efficiently computed using a closed-form solution, avoiding the need for resource-intensive training or iterative processes. Experimental results reveal that OSAQ, when combined with existing methods like GPTQ, achieves substantial improvements in quantization performance, including a notable reduction in perplexity for 2-bit quantization scenarios. Overall, OSAQ presents a significant advancement in the field of model compression for LLMs, enabling more efficient deployment in resource-constrained environments.
Methodology
The methodology involves estimating the Hessian of the task loss with respect to model weights, extracting its null space through eigenvalue decomposition, and constructing an additive transformation of the weights by linearly combining vectors from this null space. The optimal coefficients for this transformation are derived using a closed-form solution, ensuring that the transformation effectively suppresses outliers without impacting task performance.
Results
Extensive experiments demonstrate that OSAQ significantly enhances low-bit quantization performance. For instance, in 2-bit quantization, OSAQ integrated with GPTQ achieves over 40% lower perplexity compared to the standard GPTQ method, indicating a marked improvement in model efficiency and accuracy.
Implications
The findings suggest that OSAQ can facilitate the deployment of Large Language Models in environments with limited computational resources, making advanced AI capabilities more accessible. This method could be particularly beneficial in applications requiring real-time processing and low-latency responses, such as chatbots and interactive AI systems.
Continual Distillation of Teachers from Different Domains
Efficient ML
- Introduction of the Continual Distillation paradigm for training models on a sequence of teacher models.
- Identification of Unseen Knowledge Transfer (UKT) and Unseen Knowledge Forgetting (UKF) as critical challenges.
- Development of Self External Data Distillation (SE2D) to mitigate UKF while maximizing UKT.
- Empirical validation of SE2D's effectiveness in improving cross-domain generalization.
Read more
Continual Distillation of Teachers from Different Domains
Summary
This paper introduces a novel paradigm called Continual Distillation (CD), where a student model sequentially learns from a stream of teacher models trained on different datasets without retaining access to previous teachers. The authors identify two main challenges in this approach: the unavailability of teacher training data and the varying expertise of teachers. To address these challenges, the paper proposes the concept of Unseen Knowledge Transfer (UKT), which allows the student to learn from domains not present in the training data but known to the teacher, and Unseen Knowledge Forgetting (UKF), where knowledge from previous teachers is lost when training on new ones. The proposed method, Self External Data Distillation (SE2D), aims to balance UKT and UKF by preserving logits on external data, thereby stabilizing learning across heterogeneous teachers. Experimental results demonstrate that SE2D effectively reduces UKF and enhances cross-domain generalization across multiple benchmarks.
Methodology
The authors propose a method called Self External Data Distillation (SE2D) that focuses on preserving logits from external data during the distillation process. This method allows the student model to maintain performance on previously learned domains while acquiring new knowledge from unseen domains. The approach is validated through experiments on various benchmarks to assess its effectiveness in reducing knowledge forgetting and enhancing generalization.
Results
The experiments show that SE2D significantly reduces Unseen Knowledge Forgetting (UKF) and improves cross-domain generalization compared to traditional distillation methods. The results indicate that the proposed method effectively balances the trade-off between transferring unseen knowledge and retaining previously learned information.
Implications
The findings suggest that Continual Distillation could be a viable strategy for training models in scenarios where access to previous training data is limited. This approach has potential applications in developing more efficient and specialized models in various domains, particularly in settings where large-scale datasets are impractical to store or access.
Road Risk Monitor: A Deployable U.S. Road Incident Forecasting System with Live Weather and Road-Level Tiles
Time Series
- Development of a nationwide road incident forecasting system integrating multiple data sources.
- Implementation of a dual-scale modeling approach for improved prediction accuracy.
- Provision of a public codebase for reproducibility and local deployment.
- Achieved high performance metrics for both baseline and road-segment models.
Read more
Road Risk Monitor: A Deployable U.S. Road Incident Forecasting System with Live Weather and Road-Level Tiles
Summary
The paper presents the Road Risk Monitor, a comprehensive system designed for nationwide road incident forecasting in the U.S. This system integrates various data sources, including historical incident archives, live weather data, and national road geometry, to provide real-time predictions of road safety. The Road Risk Monitor employs a dual-scale modeling approach, combining a nationwide H3 baseline model trained on fatal crash data from FARS with a road-segment forecasting pipeline utilizing TIGER/Line geometry and US-Accidents events. The system is designed to be deployable and reproducible, offering a public web application and APIs for accessing predictions. The published codebase includes utilities for data processing, model training, and web serving, enabling users to regenerate the system locally. The results indicate that the baseline model achieves an AUROC of 0.894 and an average precision of 0.715 on a held-out dataset, while the road-segment model shows exceptional performance with an AUROC of 0.9999. The paper emphasizes the importance of an integrated service that can handle diverse data inputs and provide actionable insights for transportation agencies and public safety teams.
Methodology
The methodology involves creating a multi-source forecasting pipeline that combines historical and live data. The system uses an H3 baseline model for nationwide predictions and a road-segment model for localized forecasts. Data from various sources, including FARS, NOAA ISD-Lite, TIGER/Line, and US-Accidents, is processed to generate training tables and serving assets. The system is designed to adapt to live weather inputs and provide predictions through a FastAPI application.
Results
The baseline model achieved an AUROC of 0.894 and an average precision of 0.715 on a held-out year, while the road-segment model reported an AUROC of 0.9999 and an average precision of 0.9999 on an internal holdout dataset. The system successfully processed over 322,000 cleaned incidents and generated a comprehensive set of road segments and matched events across 49 states.
Implications
The Road Risk Monitor has significant implications for transportation agencies, logistics operators, insurers, and public safety teams by providing a reliable tool for predicting road incidents. Its ability to integrate live weather data and historical incident records can enhance decision-making processes related to road safety and infrastructure planning.
Why Geometric Continuity Emerges in Deep Neural Networks: Residual Connections and Rotational Symmetry Breaking
Theory
Large Language Models
- Geometric continuity in weight matrices is influenced by residual connections and symmetry-breaking nonlinearities.
- Activation functions and normalization layers have distinct roles in shaping geometric continuity.
- Continuity is projection-specific in transformers, with different layers exhibiting varying degrees of continuity.
- A nonlinear but rotation-preserving activation fails to maintain continuity, highlighting the importance of symmetry breaking.
Read more
Why Geometric Continuity Emerges in Deep Neural Networks: Residual Connections and Rotational Symmetry Breaking
Summary
This paper investigates the phenomenon of geometric continuity in deep neural networks, where weight matrices of adjacent layers exhibit similar structures, particularly in their principal singular vectors. The authors identify two key mechanisms that contribute to this continuity: the presence of residual connections, which enhance gradient coherence across layers, and symmetry-breaking nonlinearities that maintain a shared coordinate frame among layers. Through experiments on toy multi-layer perceptrons (MLPs) and small transformers, the study reveals that activation functions and normalization layers play distinct roles in shaping continuity. Specifically, while activation functions concentrate continuity in the leading singular direction, normalization distributes it across multiple directions. The findings indicate that continuity is projection-specific in transformers, with different layers developing continuity in either input or output space. The study concludes that symmetry breaking, rather than nonlinearity itself, is crucial for the emergence of geometric continuity in deep networks.
Methodology
The authors conducted ablation experiments on toy MLPs and small transformers, utilizing singular value decomposition (SVD) to analyze weight matrices and measure the cosine similarity of leading singular vectors between adjacent layers. They compared different configurations, including the presence and absence of residual connections and various activation functions.
Results
The experiments demonstrated that both residual connections and symmetry-breaking nonlinearities are necessary for maintaining geometric continuity. In transformers, the study found that layers responsible for reading from the residual stream developed input-space continuity, while those writing to it developed output-space continuity. The results also indicated that removing activation functions significantly reduced continuity in certain layers.
Implications
The insights from this study could inform the design of more efficient neural network architectures by leveraging geometric continuity for layer pruning and cross-layer parameter sharing, ultimately enhancing model performance and reducing computational costs.
Distribution-Free Pretraining of Classification Losses via Evolutionary Dynamics
Optimization
Theory
- EDL learns a transferable classification loss using synthetic data without real sample access.
- The framework employs a ranking-consistency objective to enforce meaningful loss penalties.
- An evolutionary strategy with chaotic mutation enhances the robustness and exploration of loss shape optimization.
- EDL can replace traditional loss functions like cross-entropy, yielding competitive accuracy.
Read more
Distribution-Free Pretraining of Classification Losses via Evolutionary Dynamics
Summary
This paper introduces the Evolutionary Dynamic Loss (EDL), a novel framework designed to learn a transferable classification loss in probability space using synthetic prediction-label pairs, without the need for real samples during the pretraining phase. The authors argue that traditional loss functions, such as cross-entropy, are fixed and may not adapt well to varying training conditions. EDL addresses this limitation by parameterizing the loss as a lightweight network and training it with a ranking-consistency objective that imposes larger penalties for more erroneous predictions. To effectively search for optimal loss shapes, the authors employ an evolutionary strategy enhanced by chaotic mutation, which improves exploration and robustness under noisy evaluations. Experimental results on the CIFAR-10 dataset with ResNet backbones demonstrate that EDL can replace cross-entropy, achieving competitive or improved accuracy. The study also shows that chaotic mutation leads to faster convergence and better performance metrics compared to traditional Gaussian mutation. Overall, EDL presents a distribution-free approach to loss learning that can be integrated into standard classifier training pipelines, offering a promising direction for enhancing model performance across diverse tasks.
Methodology
The methodology involves generating unlimited synthetic prediction-label pairs and training a parametric loss network using a ranking-consistency objective. An evolutionary strategy is used for loss shape optimization, incorporating chaotic mutation to enhance exploration and reduce premature convergence.
Results
Experiments conducted on the CIFAR-10 dataset show that EDL can effectively replace cross-entropy loss, achieving competitive or improved accuracy. The use of chaotic mutation resulted in faster convergence and better performance metrics compared to standard Gaussian mutation.
Implications
The findings suggest that EDL can be a valuable tool for developing adaptable loss functions that improve model performance across various datasets and training conditions, potentially leading to more robust machine learning applications.
DynaTab: Dynamic Feature Ordering as Neural Rewiring for High-Dimensional Tabular Data
Theory
Optimization
Efficient ML
- DynaTab introduces dynamic feature ordering to improve model performance on high-dimensional tabular data.
- The model predicts when feature permutation will be beneficial based on dataset complexity.
- DynaTab integrates order-aware mechanisms such as positional embeddings and masked attention.
- The architecture shows significant performance gains compared to 45 state-of-the-art models.
Read more
DynaTab: Dynamic Feature Ordering as Neural Rewiring for High-Dimensional Tabular Data
Summary
The paper introduces DynaTab, a novel deep learning architecture designed to enhance the performance of models on high-dimensional tabular data by employing dynamic feature ordering (DFO). Traditional deep learning models struggle with tabular data due to its lack of inherent structure, which limits their effectiveness. DynaTab addresses this challenge by dynamically reordering features based on a lightweight criterion that quantifies dataset complexity and the potential benefits of feature permutation. The architecture integrates a combination of positional embeddings, importance-based gating, and masked attention layers, allowing it to adaptively optimize feature connectivity. The model is trained end-to-end using bespoke DFO and dispersion losses, demonstrating statistically significant improvements over 45 state-of-the-art baselines across 36 real-world datasets. The results suggest that DynaTab represents a promising new paradigm for deep learning on high-dimensional tabular data, mimicking the brain's neuroplasticity to enhance learning and decision-making processes.
Methodology
DynaTab employs a dynamic feature ordering algorithm inspired by neural rewiring principles. It utilizes a lightweight criterion to assess dataset complexity and the potential advantages of feature permutation. The model incorporates a combination of learned positional embeddings, importance-based gating, and masked attention layers, enabling it to adaptively reorder features and optimize inter-feature connectivity. The training is conducted end-to-end with specific losses tailored for dynamic feature ordering and dispersion.
Results
DynaTab achieved statistically significant improvements in performance across 36 high-dimensional tabular datasets when benchmarked against 45 existing state-of-the-art models. The results indicate that the dynamic feature ordering approach effectively enhances the model's ability to capture dependencies and reduce redundancy in high-dimensional data.
Implications
The findings suggest that DynaTab could be a valuable tool for practitioners dealing with high-dimensional tabular data in various domains, including finance, healthcare, and social sciences. Its ability to adaptively optimize feature ordering may lead to more robust and accurate predictive models in these fields.
Adaptive Data Compression and Reconstruction for Memory-Bounded EEG Continual Learning
Time Series
Efficient ML
Theory
- Introduction of ADaCoRe, a novel pipeline for memory-efficient EEG continual learning.
- Utilization of morphology-aware techniques to enhance data compression and reconstruction.
- Demonstrated significant performance gains over existing UICL methods under strict memory constraints.
- Ablation studies highlight the importance of each component in the proposed pipeline.
Read more
Adaptive Data Compression and Reconstruction for Memory-Bounded EEG Continual Learning
Summary
This paper addresses the challenges of analyzing Electroencephalography (EEG) signals, which are often hindered by noise and inter-subject variability, particularly in scenarios with limited labeled data. The author proposes a novel approach called Adaptive Data Compression and Reconstruction (ADaCoRe) aimed at enhancing Unsupervised Individual Continual Learning (UICL) for EEG applications. Unlike traditional methods that store full past samples, ADaCoRe employs a memory-efficient pipeline that leverages the structured morphology of EEG signals. The pipeline consists of four main components: saliency-driven keyframe protection to retain critical segments, rational polyphase compression for efficient downsampling, adjoint reconstruction for overwriting protected indices, and prototype-confidence selection for maintaining a compact buffer of exemplars. The proposed method is evaluated across three benchmarks, demonstrating significant performance improvements over existing baselines, particularly under tight memory constraints. The findings indicate that ADaCoRe not only preserves essential EEG morphologies during compression and reconstruction but also enhances model plasticity and stability.
Methodology
The methodology involves a four-component pipeline: (1) saliency-driven keyframe protection to retain critical EEG segments, (2) rational polyphase compression for efficient downsampling of non-keyframes, (3) adjoint reconstruction that overwrites protected indices, and (4) prototype-confidence selection to maintain a diverse and informative exemplar buffer. This approach is designed to adaptively compress EEG data while preserving its morphological characteristics.
Results
ADaCoRe consistently outperformed strong baselines, including the state-of-the-art method BrainUICL, achieving performance gains of at least +2.7% and +15.3% accuracy on the ISRUC and FACED datasets, respectively. The ablation studies provided insights into the trade-offs between compression and fidelity, confirming the effectiveness of the proposed components.
Implications
The findings suggest that ADaCoRe could significantly improve the adaptability of EEG analysis systems in real-world applications, such as brain-computer interfaces and sleep monitoring, particularly in scenarios where labeled data is scarce and memory resources are limited.
QUIVER: Cost-Aware Adaptive Preference Querying in Surrogate-Assisted Evolutionary Multi-Objective Optimization
Optimization
- QUIVER optimizes the allocation of resources between objective evaluations and preference elicitation.
- The method adapts the selection of query modalities based on the difficulty of the optimization problem.
- QUIVER outperforms traditional single-modality baselines in terms of utility regret.
- The approach integrates a value-of-information perspective to unify objective exploration and preference learning.
Read more
QUIVER: Cost-Aware Adaptive Preference Querying in Surrogate-Assisted Evolutionary Multi-Objective Optimization
Summary
The paper introduces QUIVER, a novel approach to interactive multi-objective optimization that addresses the budget allocation dilemma between expensive objective evaluations and preference elicitation. The study highlights the importance of efficiently allocating resources to explore the Pareto front while simultaneously understanding decision-maker preferences. QUIVER employs a surrogate-assisted evolutionary multi-objective optimizer that adaptively selects between objective evaluations and heterogeneous preference queries (pairwise preference statements and indifference adjustments) to maximize expected decision-quality improvement per unit cost. The methodology is evaluated on DTLZ and WFG benchmark problems, demonstrating that QUIVER achieves lower utility regret compared to traditional methods, particularly excelling in challenging scenarios. The adaptive selection of query modalities is shown to vary with problem difficulty, indicating a sophisticated understanding of cost-aware preference learning.
Methodology
QUIVER utilizes a surrogate-assisted evolutionary algorithm, specifically NSGA-II, and implements a value-of-information-based action selection rule that uses Monte Carlo estimation to determine the expected improvement in decision quality per unit cost. It treats objective evaluations and preference queries as a unified budgeted action space, allowing for adaptive decision-making throughout the optimization process.
Results
In empirical evaluations on DTLZ and WFG benchmarks, QUIVER achieved the lowest utility regret, with notable improvements of 25% over baseline methods. The adaptive selection mechanism demonstrated that QUIVER allocates approximately 80% of queries to pairwise statements on easier problems and shifts to 35% indifference adjustments on harder problems, showcasing its ability to adjust to varying problem complexities.
Implications
The findings suggest that QUIVER can significantly enhance the efficiency of multi-objective optimization in real-world applications where evaluations are costly. Its adaptive querying strategy can lead to better decision-making outcomes by effectively balancing exploration and preference elicitation, making it a valuable tool for fields such as engineering design and portfolio selection.
Disease Is a Spectral Perturbation
Theory
Interpretability
Multimodal
- Introduces the concept of a biomarker Hamiltonian to model disease transformation.
- Characterizes disease as a spectral perturbation of the healthy biomarker covariance structure.
- Derives optimal prognostic statistics based on eigenmode projections.
- Establishes a unified framework that connects various existing multiomics methods.
Read more
Disease Is a Spectral Perturbation
Summary
This paper introduces a novel framework for understanding disease transformation through the lens of biomarker covariance matrices. By defining a 'biomarker Hamiltonian' based on the covariance structure of healthy controls and disease states, the authors characterize disease as a perturbation of this structure. The eigenvectors of the Hamiltonian represent normal modes of biomarker coordination, while the eigenvalues quantify the energy associated with these modes. The paper formalizes the relationship between disease perturbations and spectral changes, demonstrating that the projection of a patient's biomarker covariance onto disease-discriminant eigenmodes serves as an optimal prognostic statistic. This approach provides a unified mathematical formalism that can be applied across various disease frameworks, including cancer and neurodegenerative disorders, and addresses the fragmentation in current multiomics trajectory modeling methods.
Methodology
The authors define the biomarker Hamiltonian as the covariance matrix of biomarker measurements and utilize spectral decomposition to analyze the eigenvalues and eigenvectors. They apply matrix perturbation theory to derive the effects of disease on the Hamiltonian and establish conditions for eigenbasis transfer across cohorts.
Results
The paper shows that the perturbation of the healthy covariance structure leads to well-characterized spectral signatures that can differentiate disease states. It proves the optimality of the prognostic score derived from the projection onto disease-discriminant eigenmodes, enhancing precision in disease prognosis.
Implications
This framework has significant implications for improving disease diagnosis and prognosis through a deeper understanding of biomarker interactions. It can facilitate the development of more accurate predictive models in clinical settings and enhance the interpretability of multiomics data.
Calibration of the underlying surface parameters for urban flood using latent variables and adjoint equation
Optimization
- Introduces a Bayesian framework for urban flood parameter calibration using latent variables.
- Utilizes the adjoint equation of the Urban Flood Dynamical System model for efficient optimization.
- Demonstrates rapid convergence and robustness to observation time intervals in calibration.
- Achieves significant accuracy in calibrating Manning's coefficient for urban roads.
Read more
Calibration of the underlying surface parameters for urban flood using latent variables and adjoint equation
Summary
This paper addresses the critical issue of calibrating urban underlying surface parameters for effective urban flood simulation. The authors formulate the calibration problem as an optimization task within a Bayesian framework, utilizing the maximum likelihood principle. They introduce latent variables inspired by machine learning to account for uncertainties in the calibration process, enhancing compatibility with traditional physical parameter calibration methods. To improve optimization efficiency, the authors derive the adjoint equation of the Urban Flood Dynamical System (UFDS) model, which serves as a surrogate model. This approach allows for the efficient computation of gradient information, while techniques such as parameter sharing and localization are employed to reduce computational complexity. The proposed method is validated through a simple case, demonstrating rapid convergence and insensitivity to observation time intervals. The calibration of Manning’s coefficient for urban roads yielded a maximum relative error of 13.88% and a minimum of 1.16%, showcasing the method's effectiveness in parameter estimation.
Methodology
The authors formulated the calibration problem as an optimization problem within a Bayesian framework, employing latent variables to capture uncertainties. They constructed the adjoint equation of the UFDS model to obtain gradient information, and implemented parameter sharing and localization techniques to enhance computational efficiency.
Results
The proposed calibration method showed quick convergence in a simple case and was insensitive to the observation time interval. In a practical application, the calibration of Manning’s coefficient resulted in a maximum relative error of 13.88% and a minimum of 1.16%.
Implications
This research has significant implications for urban flood management and simulation, providing a more accurate and efficient method for calibrating underlying surface parameters, which can enhance predictive capabilities and inform better urban planning and flood mitigation strategies.
Knowledge-Free Correlated Agreement for Incentivizing Federated Learning
Federated Learning
Theory
Efficient ML
- KFCA is a knowledge-free mechanism that incentivizes honest reporting in federated learning.
- It eliminates the label-flipping vulnerability present in existing methods like Correlated Agreement.
- KFCA supports real-time reward computation without the need for report aggregation.
- Empirical evaluations show significant reductions in reward computation costs compared to traditional methods.
Read more
Knowledge-Free Correlated Agreement for Incentivizing Federated Learning
Summary
This paper introduces Knowledge-Free Correlated Agreement (KFCA), a novel mechanism designed to incentivize client contributions in federated learning (FL) without the need for ground truth, public test sets, or distribution knowledge. The authors highlight the limitations of existing methods, particularly the Correlated Agreement (CA) mechanism, which is vulnerable to label-flipping attacks and requires extensive report aggregation. KFCA addresses these issues by providing a scoring rule that is strongly truthful under a categorical-world condition, ensuring that honest reporting maximizes expected rewards. The mechanism is evaluated in two contexts: federated large language model (LLM) adapter tuning and a real-world printed circuit board (PCB) inspection task. The results demonstrate that KFCA allows for efficient, real-time reward computation, making it suitable for decentralized and blockchain-based FL applications.
Methodology
The authors propose a multi-task peer-prediction (MTPP) mechanism that rewards clients based on the correlation of their reports without requiring ground truth. KFCA operates under a categorical-world condition, ensuring strong truthfulness and efficient reward computation. The methodology includes theoretical analysis and empirical evaluation in federated LLM tuning and PCB inspection tasks.
Results
KFCA achieves orders-of-magnitude lower reward computation costs compared to Shapley-value estimators. It demonstrates strong truthfulness, ensuring that as long as fewer than 50% of clients deviate from honest reporting, the expected rewards for honest effort are maximized. The empirical results validate its feasibility in real-world applications.
Implications
The introduction of KFCA has significant implications for the future of federated learning, particularly in decentralized and blockchain environments. It provides a robust framework for incentivizing client participation without the need for extensive data verification, thus promoting broader adoption of FL in various domains.
Spatiotemporal Convolutions on EEG signal -- A Representation Learning Perspective on Efficient and Explainable EEG Classification with Convolutional Neural Nets
Time Series
Efficient ML
Interpretability
- 2D spatiotemporal convolutions significantly reduce training time for high-dimensional EEG classification tasks.
- The representational geometry differs between 1D and 2D CNN models, impacting the interpretability of learned features.
- Maintaining performance while improving efficiency is crucial for real-time EEG applications.
- Architectural design in CNNs should consider the unique characteristics of EEG data for better feature extraction.
Read more
Spatiotemporal Convolutions on EEG signal -- A Representation Learning Perspective on Efficient and Explainable EEG Classification with Convolutional Neural Nets
Summary
This paper explores the use of bi-dimensional (2D) spatiotemporal convolutions for classifying EEG signals, contrasting it with traditional one-dimensional (1D) convolutions that are commonly used in EEG analysis. The authors argue that while 1D convolutions are effective, they often concatenate spatial and temporal features without non-linear activation, potentially limiting their learning capacity. By employing 2D convolutions, the authors demonstrate a significant reduction in training time for high-dimensional EEG tasks while maintaining classification performance. The study investigates the representational geometry of the models, revealing that 1D and 2D CNNs produce markedly different internal representations despite similar spectral feature importance. This suggests that the architectural design of convolutional layers plays a crucial role in processing complex multivariate signals like EEG, emphasizing the need for tailored approaches in deep learning for neuroimaging applications.
Methodology
The authors conducted experiments comparing 1D CNNs, 2D CNNs, and a CNN+transformer hybrid model on both low-dimensional (3-channel) and high-dimensional (22-channel) EEG datasets for motor imagery classification tasks. They analyzed the training times, performance metrics, and representational similarities across the different models.
Results
The results indicated that 2D convolutions led to faster training times in high-dimensional tasks without sacrificing classification accuracy. The analysis of internal representations showed that while spectral features remained consistent, the geometrical representation of the data differed significantly between the 1D and 2D models, suggesting that the choice of convolutional architecture affects the learning process.
Implications
The findings highlight the importance of model architecture in EEG classification, suggesting that 2D convolutions can enhance efficiency and interpretability in brain-computer interface applications. This could lead to more effective real-time EEG analysis and better understanding of cognitive processes through improved feature representation.
Gated Subspace Inference for Transformer Acceleration
NLP
Large Language Models
Efficient ML
- GSI exploits the low effective rank of token activation manifolds for inference acceleration.
- The method achieves significant speedups (3.0× to 10.5×) without requiring retraining or architectural changes.
- A per-token gating mechanism ensures output distribution preservation.
- GSI extends previous work by covering all linear maps in transformers, not just MLP layers.
Read more
Gated Subspace Inference for Transformer Acceleration
Summary
This paper introduces Gated Subspace Inference (GSI), a novel method aimed at accelerating inference in transformer language models by leveraging the low effective rank of token activation manifolds at each layer. The approach decomposes activation vectors into subspace components and residuals, allowing for the computation of linear-layer outputs using a cached low-rank weight image, which reduces memory bandwidth usage. A per-token gating mechanism determines whether to compute the residual correction, ensuring that the output distribution remains within a controllable tolerance. The method was validated on three model families (GPT-2 124M, GPT-J 6B, OPT 6.7B) on AMD MI300X, achieving speedups of 3.0× to 10.5× on linear-layer weight reads while maintaining perplexity ratios below 1.00 and top-1 token agreement above 98%. Importantly, GSI requires no retraining, architectural modifications, or approximations of the attention mechanism, and at the operating point (k = 256, ε = 0.05) on GPT-J 6B, it produces identical outputs to the baseline.
Methodology
GSI decomposes activation vectors into subspace components and residuals, computes outputs using cached low-rank weight images, and employs a gating mechanism to selectively compute residuals based on their contribution to the output. The method builds on previous work by extending the activation basis across all linear maps in transformers and initializing it from previous layers to exploit subspace coherence.
Results
The GSI method demonstrated speedups of 3.0× to 10.5× in linear-layer weight reads across different model families while maintaining high accuracy, with perplexity ratios below 1.00 and top-1 token agreement above 98%. At specific operational parameters, the accelerated model produced outputs identical to the baseline model.
Implications
The GSI method has potential applications in deploying large language models more efficiently, particularly in environments with limited memory bandwidth. It can facilitate faster inference in real-time applications, making transformer models more accessible for practical use cases.
Beyond Activation Alignment: The Geometry of Neural Sensitivity
Theory
- Introduces a framework focusing on local decodable information for comparing neural representations.
- Defines second-moment local perturbation-discrimination tasks and summarizes them using expected projected pullback/Fisher metrics.
- Develops the Spectral Riemannian Alignment Score (S-RAS) for comparing neural representations.
- Empirically validates the framework across artificial and biological systems, including neural networks and mouse visual cortex data.
Read more
Beyond Activation Alignment: The Geometry of Neural Sensitivity
Summary
This paper introduces a novel framework for comparing neural representations that goes beyond traditional activation alignment methods such as Representational Similarity Analysis (RSA), Canonical Correlation Analysis (CCA), and Centered Kernel Alignment (CKA). While these methods assess the geometric similarity of neural representations, they do not account for how these representations respond to small perturbations in input stimuli. The authors propose a focus on local decodable information, which quantifies a representation's ability to discriminate small perturbations within a specified stimulus-coordinate subspace. By utilizing Fisher information and local representation geometry, the authors derive a second-moment family of local discrimination tasks, summarized by the expected projected pullback/Fisher metric. This leads to the development of the Spectral Riemannian Alignment Score (S-RAS), which allows for a comprehensive comparison of neural representations. The empirical validation of this framework demonstrates its effectiveness in recovering corresponding layers across independently trained neural networks, supporting transferable class-conditional probes, and revealing insights into the mouse visual cortex using data from the Allen Brain Observatory.
Methodology
The authors propose a framework that utilizes local decodable information and Fisher information to assess how neural representations respond to small perturbations. They define second-moment local perturbation-discrimination tasks and summarize these tasks using the expected projected pullback/Fisher metric. The comparison of these summaries is performed using a log-spectral distance on the manifold of symmetric positive definite matrices, resulting in the Spectral Riemannian Alignment Score (S-RAS).
Results
The proposed framework successfully recovers corresponding layers across independently trained artificial neural networks and supports the use of transferable class-conditional probes. It also reveals controlled dissociations between standard and robust training methods and uncovers stimulus-coordinate family effects in mouse visual cortex data from the Allen Brain Observatory.
Implications
This framework has the potential to enhance our understanding of neural representations in both artificial and biological systems, providing insights into how different models process information and respond to perturbations. It could be applied in various fields, including computational neuroscience, machine learning model evaluation, and the development of more robust neural network architectures.
Pretrained Model Representations as Acquisition Signals for Active Learning of MLIPs
Efficient ML
- Pretrained model representations can serve as effective acquisition signals for active learning in MLIPs.
- The proposed finite-width NTK and activation kernel outperform traditional acquisition methods.
- Using pretrained models reduces the data required for training MLIPs by significant margins.
- The latent space of pretrained models preserves chemically meaningful structures.
Read more
Pretrained Model Representations as Acquisition Signals for Active Learning of MLIPs
Summary
This paper addresses the challenges in training machine learning interatomic potentials (MLIPs) for reactive chemistry, particularly the high costs associated with quantum chemical labels and the scarcity of transition state configurations. The authors explore the potential of using pretrained model representations as acquisition signals in active learning (AL) to enhance the efficiency of MLIP training. They propose two novel acquisition signals derived from a pretrained MACE potential: a finite-width neural tangent kernel (NTK) and an activation kernel based on hidden latent space features. The study evaluates these methods against traditional approaches, including fixed-descriptor baselines and committee disagreement strategies, using benchmarks from reactive chemistry. The findings indicate that the pretrained model-based kernels significantly outperform existing methods, reducing the data required to achieve performance targets by an average of 38% for energy error and 28% for force error. The results suggest that pretrained models can effectively encode uncertainty-relevant structures, making them valuable for active learning in MLIP fine-tuning.
Methodology
The authors introduce two acquisition signals derived from a pretrained MACE potential: a finite-width neural tangent kernel (NTK) and an activation kernel based on hidden latent space features. They conduct evaluations primarily on the Transition1x dataset, comparing the performance of these methods against fixed-descriptor baselines, committee disagreement, and random acquisition strategies in a pool-based active learning setting.
Results
The pretrained model-based kernels consistently outperformed all other methods evaluated, achieving reductions in the number of acquisition rounds needed to reach performance targets by an average of 38.1% for energy RMSE and 28.3% for force RMSE. The study also demonstrated that the pretrained kernels provided better residual interpolation and uncertainty calibration compared to randomly initialized kernels and fixed descriptors.
Implications
The findings suggest that leveraging pretrained model representations can significantly enhance the efficiency of active learning in MLIP training, potentially leading to faster and more accurate simulations in reactive chemistry. This approach may also pave the way for further research into the use of pretrained models in other areas of machine learning and computational chemistry.
Most ReLU Networks Admit Identifiable Parameters
Theory
- Introduces a unified framework using weighted polyhedral complexes to study parameter identifiability.
- Establishes that most ReLU architectures with sufficient width have identifiable parameters.
- Settles the functional dimension for nearly all ReLU architectures as the number of parameters minus the number of hidden neurons.
- Demonstrates that minimal architectures can still have non-trivial parameter redundancies.
Read more
Most ReLU Networks Admit Identifiable Parameters
Summary
This paper investigates the identifiability of parameters in deep ReLU networks, focusing on the realization map that connects network parameters to the functions they compute. The authors introduce a framework based on weighted polyhedral complexes to analyze hidden redundancies beyond trivial symmetries like scaling and permutation. They demonstrate that for any architecture with input and hidden layers of width at least two, there exists an open set of identifiable parameters, establishing that the functional dimension equals the number of parameters minus the number of hidden neurons. The study also reveals that minimal functional representations can still exhibit non-trivial parameter redundancies and introduces a generic depth hierarchy, indicating that certain functions cannot be represented by shallower networks. Overall, the findings provide a comprehensive understanding of parameter identifiability and functional dimension in deep ReLU networks.
Methodology
The authors develop a unified framework based on weighted polyhedral complexes to analyze the realization map of ReLU networks. They investigate the structure of parameter fibers and the functional dimension by examining the Jacobian of the realization map and its rank over finite input sets.
Results
The main results indicate that for deep ReLU networks with input and hidden layers of width at least two, there exists an open set of identifiable parameters. The functional dimension is shown to be the number of parameters minus the number of hidden neurons. Additionally, the authors construct examples of minimal parameters that exhibit non-trivial redundancies and establish a depth hierarchy for generic parameters.
Implications
These findings have significant implications for understanding the structure of deep learning models, particularly in optimizing architectures for better performance and interpretability. The results can inform the design of neural networks by clarifying the relationship between architecture, parameter identifiability, and functional capacity.
Superposition Is Not Necessary: A Mechanistic Interpretability Analysis of Transformer Representations for Time Series Forecasting
Time Series
- A single-layer transformer can match the performance of deeper models in time series forecasting.
- Expanding the dictionary size in sparse autoencoders yields minimal changes in forecasting performance.
- Targeted interventions on latent features produce negligible forecast perturbations.
- Superposition is not required for competitive performance in time series forecasting.
Read more
Superposition Is Not Necessary: A Mechanistic Interpretability Analysis of Transformer Representations for Time Series Forecasting
Summary
This paper investigates the internal representations of transformer models, specifically PatchTST, in the context of time series forecasting. The author questions whether the powerful representational mechanisms of transformers in NLP, particularly superposition, are relevant for time series data. By employing sparse autoencoders (SAEs) to analyze the model's intermediate representations, the study finds that a single-layer, narrow-dimensional transformer can achieve competitive forecasting performance comparable to deeper models across various benchmarks. The analysis reveals that expanding the dictionary size of the SAEs results in negligible performance changes, indicating that the representations are sparse and stable. Furthermore, targeted interventions on dominant latent features have minimal impact on forecasting accuracy. The findings suggest that superposition is not necessary for effective time series forecasting, providing a mechanistic explanation for the competitiveness of simpler linear models like DLinear. This challenges the assumption that complex feature composition is essential for success in time series tasks.
Methodology
The study uses sparse autoencoders (SAEs) to analyze the internal representations of the PatchTST transformer model. It compares the performance of a single-layer transformer against deeper configurations and examines the effects of dictionary size expansion and targeted causal interventions on latent features.
Results
The analysis shows that a single-layer PatchTST achieves competitive forecasting performance across benchmarks. Expanding the dictionary size of SAEs does not significantly affect performance, and interventions on latent features have minimal impact on forecasts, indicating that the representations are sparse and stable.
Implications
The results suggest that time series forecasting does not require the complex representational capabilities of transformers as seen in NLP. This could lead to a reevaluation of the necessity for complex architectures in time series tasks and support the continued use of simpler models like DLinear.
FAAST: Forward-Only Associative Learning via Closed-Form Fast Weights for Test-Time Supervised Adaptation
Efficient ML
NLP
Large Language Models
- FAAST enables forward-only associative adaptation, avoiding backpropagation and iterative updates.
- The method constructs fast weights in closed form, allowing for efficient inference without retaining memory.
- FAAST achieves comparable or superior performance to backpropagation-based methods while drastically reducing adaptation time and memory usage.
- The approach is modular and can be integrated into existing neural networks, including large language models.
Read more
FAAST: Forward-Only Associative Learning via Closed-Form Fast Weights for Test-Time Supervised Adaptation
Summary
The paper introduces FAAST, a novel forward-only associative adaptation method designed to enhance the efficiency of adapting pretrained models without the high costs associated with backpropagation or memory-based learning. FAAST compiles labeled examples into fast weights through a single-pass analytical approach, enabling constant-time inference and decoupling task adaptation from the pretrained representations. This method significantly reduces adaptation time by over 90% compared to backpropagation and achieves competitive performance against memory/context-based adaptation while saving up to 95% in memory usage. The authors demonstrate FAAST's effectiveness across various benchmarks in image classification and language modeling, showcasing its potential as a scalable solution for supervised task adaptation, particularly in resource-constrained environments.
Methodology
FAAST employs a forward-only associative learning framework that compiles key-value pairs from labeled examples into fast weights using a closed-form solution derived from linear regression. This approach allows for a single-pass learning process, eliminating the need for gradient descent and iterative updates, and enabling constant-cost inference.
Results
FAAST matches or exceeds the performance of backpropagation-based adaptation in image classification tasks while reducing adaptation time by over 90%. In language modeling, it allows small models like GPT-2 to adapt at test time with over 93% savings in training and inference costs compared to memory/context-based methods. Overall, FAAST consistently outperforms zero-shot and few-shot baselines in various natural language processing tasks.
Implications
FAAST presents a highly efficient and scalable method for adapting pretrained models, making it particularly suitable for scenarios with limited computational resources. Its modular nature allows for easy integration into existing architectures, potentially enhancing the adaptability of large language models and other neural networks in real-time applications.