AI-generated summaries
Today's ML research,
without the noise.
Daily summaries of the latest machine learning papers from arXiv, processed every 8 hours.
38
Papers today
8h
Update frequency
7
Days of history
Dilated CNNs for Periodic Signal Processing: A Low-Complexity Approach
Time Series
Efficient ML
Audio & Speech
- R-DCNN offers a low-complexity solution for denoising periodic signals.
- The method requires only a single observation for training, enabling efficient generalization to other signals.
- R-DCNN achieves performance comparable to classical autoregressive methods and conventional DCNNs.
- The approach is particularly suited for resource-constrained environments like IoT devices.
Read more
Dilated CNNs for Periodic Signal Processing: A Low-Complexity Approach
Summary
This paper presents a novel approach for denoising periodic signals using a Dilated Convolutional Neural Network (DCNN) combined with a resampling technique, termed R-DCNN. The proposed method is designed for environments with strict computational and power constraints, making it suitable for applications in IoT devices and other low-power scenarios. Unlike traditional deep learning methods that require extensive computational resources and separate training for each signal observation, R-DCNN is trained using a single observation and can generalize to other signals through a lightweight resampling step that aligns time scales. This allows the same network weights to be reused across different signals with varying fundamental frequencies. The experiments conducted demonstrate that R-DCNN achieves performance comparable to state-of-the-art classical methods, such as autoregressive techniques, while significantly reducing computational complexity. The findings suggest that R-DCNN is an efficient alternative for periodic signal processing tasks, maintaining high accuracy in denoising and waveform estimation without the need for retraining on new observations.
Methodology
The methodology involves using a one-dimensional Dilated Convolutional Neural Network (DCNN) combined with a resampling technique. The model is trained on a single observation, and during inference, it applies a resampling step to align the time scales of different signals, allowing the reuse of the same network weights without retraining.
Results
The experimental results indicate that R-DCNN provides high accuracy in denoising periodic signals while significantly reducing computational complexity compared to existing methods, including deep learning DCNNs and classical autoregressive techniques.
Implications
The proposed R-DCNN method has significant implications for applications in various fields such as speech processing, medical diagnostics, and sonar, particularly in scenarios where computational resources are limited. Its efficiency makes it suitable for deployment in edge devices and IoT applications.
Forget, Then Recall: Learnable Compression and Selective Unfolding via Gist Sparse Attention
NLP
Large Language Models
Efficient ML
- Introduces a novel method for long-context modeling in LLMs using gist compression tokens.
- Proposes selective unfolding via Gist Sparse Attention (GSA) to enhance attention efficiency.
- Demonstrates significant performance improvements over existing compression and sparse attention methods.
- Enables multi-resolution context access with logarithmic complexity through recursive gist construction.
Read more
Forget, Then Recall: Learnable Compression and Selective Unfolding via Gist Sparse Attention
Summary
This paper addresses the challenge of scaling large language models (LLMs) to handle long contexts, which is hindered by the quadratic computational cost of full attention mechanisms. The authors propose a novel approach that integrates gist compression tokens as routing signals for sparse attention, allowing for an efficient and learnable method of context processing without requiring architectural changes. The proposed method, termed selective unfolding via Gist Sparse Attention (GSA), first compresses the input context into gist tokens, selects the most relevant gists based on their attention scores, and then restores the corresponding raw tokens for detailed attention. This coarse-to-fine mechanism effectively combines global representations with targeted access to fine-grained information. The authors also introduce a hierarchical extension of their framework, enabling multi-resolution context access with logarithmic decoding complexity. Empirical evaluations on LongBench and RAG benchmarks demonstrate that GSA consistently outperforms existing compression methods and inference-time sparse attention techniques across various compression ratios, showcasing its effectiveness in improving model performance while reducing computational costs.
Methodology
The authors developed an end-to-end learnable framework that utilizes interleaved gist compression tokens to summarize raw tokens. The method involves compressing the context into gist tokens, selecting relevant gists based on attention scores, and restoring raw tokens for detailed attention. This process is integrated directly into the training phase, allowing for optimization without external modules.
Results
The empirical results indicate that the proposed GSA method outperforms various compression baselines and inference-time sparse attention methods across compression ratios ranging from 8× to 32×. The evaluations on LongBench and RAG benchmarks highlight the effectiveness of GSA in improving model accuracy and efficiency.
Implications
The findings suggest that GSA can significantly enhance the performance of large language models in applications requiring long-context understanding, such as in-depth reasoning, code generation, and multi-turn interactions. The approach may lead to more efficient training and inference processes in real-world applications.
Sink-Token-Aware Pruning for Fine-Grained Video Understanding in Efficient Video LLMs
Computer Vision
Large Language Models
Efficient ML
- Introduction of sink tokens as a critical challenge in fine-grained video understanding.
- Development of Sink-Token-aware Pruning (SToP) to effectively target and suppress sink tokens.
- SToP significantly improves performance on fine-grained tasks while allowing for substantial token pruning.
- Validation of SToP across diverse benchmarks, including hallucination evaluation and open-ended generation.
Read more
Sink-Token-Aware Pruning for Fine-Grained Video Understanding in Efficient Video LLMs
Summary
This paper addresses the challenge of high inference latency in Video Large Language Models (Video LLMs) caused by the large number of visual tokens processed. Existing training-free visual token pruning methods have shown effectiveness in reducing computational costs but are primarily validated on coarse-grained tasks like Multiple-Choice Question Answering (MCQA). The authors identify a performance gap in fine-grained understanding tasks, particularly those requiring precise visual grounding, such as hallucination evaluation. They introduce the concept of 'sink tokens'—tokens that attract excessive attention but provide little semantic information—as a significant obstacle to fine-grained video understanding. To mitigate this issue, the authors propose Sink-Token-aware Pruning (SToP), a method that quantifies the sink tendency of tokens and integrates this score into existing pruning techniques. SToP is validated against state-of-the-art pruning methods and demonstrates significant performance improvements across various benchmarks, even with up to 90% of visual tokens pruned. The findings suggest that SToP enhances the model's ability to maintain fine-grained visual cues, crucial for tasks requiring detailed visual grounding.
Methodology
The authors conducted a systematic analysis of existing visual token pruning methods, identifying the impact of sink tokens on model performance. They proposed SToP, which calculates a sink score for each token to prioritize the pruning of sink tokens. This method was integrated with existing spatial and temporal pruning techniques and evaluated on various benchmarks to assess its effectiveness.
Results
The implementation of SToP led to significant performance boosts in fine-grained video understanding tasks, with improvements observed even when pruning up to 90% of visual tokens. The method outperformed existing pruning techniques, particularly in tasks sensitive to visual grounding, such as hallucination evaluation.
Implications
The findings suggest that SToP can enhance the efficiency of Video LLMs while maintaining their performance on complex tasks requiring fine-grained visual understanding. This has potential applications in real-world scenarios where detailed video analysis is necessary, such as in video summarization, content moderation, and interactive video applications.
Decoupled Travel Planning with Behavior Forest
NLP
Large Language Models
Optimization
- Introduces the Behavior Forest method to decouple travel planning tasks.
- Structures decision-making into parallel behavior trees for modular planning.
- Integrates LLMs for localized reasoning within behavior tree nodes.
- Demonstrates significant performance improvements over existing methods.
Read more
Decoupled Travel Planning with Behavior Forest
Summary
The paper introduces the Behavior Forest method for multi-constraint travel planning, addressing the limitations of existing approaches that entangle local and global constraints within a single decision space. Traditional methods struggle with the complexity of interdependent constraints, leading to inefficiencies in planning. The proposed Behavior Forest organizes the decision-making process into a forest of parallel behavior trees, where each tree focuses on a specific subtask. A global coordination mechanism is implemented to manage interactions among these trees, allowing for modular and coherent planning. Large language models (LLMs) are integrated as decision engines within the behavior tree nodes, enabling localized reasoning based on task-specific constraints. This decoupling of tasks and constraints enhances the efficiency of the planning process, reducing cognitive load on the LLM. Experimental results demonstrate that the Behavior Forest outperforms state-of-the-art methods by 6.67% on the TravelPlanner benchmark and by 11.82% at medium-level difficulty on the ChinaTravel benchmark, showcasing its effectiveness in complex multi-constraint travel planning.
Methodology
The Behavior Forest method organizes planning into a forest of parallel behavior trees, each responsible for a subtask. A global coordination mechanism orchestrates interactions among these trees. LLMs are embedded within the behavior tree nodes to perform localized reasoning and generate candidate subplans based on task-specific constraints.
Results
The proposed method achieved a 6.67% improvement over state-of-the-art methods on the TravelPlanner benchmark and an 11.82% improvement at medium-level difficulty on the ChinaTravel benchmark, indicating enhanced performance in handling complex multi-constraint travel planning.
Implications
The Behavior Forest framework has potential applications in various multi-constraint planning scenarios beyond travel, such as logistics, resource management, and any domain requiring complex decision-making with interdependent constraints.
Clinically Interpretable Sepsis Early Warning via LLM-Guided Simulation of Temporal Physiological Dynamics
Large Language Models
Time Series
Interpretability
- Introduces a novel LLM-guided framework for sepsis early warning that enhances clinical interpretability.
- Combines spatiotemporal feature extraction with medical prompt engineering to improve prediction accuracy.
- Achieves superior AUC scores compared to traditional models, demonstrating effectiveness in pre-onset prediction tasks.
- Provides interpretable physiological trajectories that support clinical decision-making.
Read more
Clinically Interpretable Sepsis Early Warning via LLM-Guided Simulation of Temporal Physiological Dynamics
Summary
This paper addresses the critical challenge of timely and interpretable early warning for sepsis, a life-threatening condition characterized by organ dysfunction due to infection. Traditional data-driven models often yield accurate predictions but lack interpretability, which can hinder clinical decision-making. The authors propose a novel framework that utilizes a Large Language Model (LLM) to simulate physiological dynamics leading up to sepsis onset. The framework comprises three main components: a spatiotemporal feature extraction module that captures dynamic dependencies among vital signs, a Medical Prompt-as-Prefix module that incorporates clinical reasoning into the LLM, and an agent-based post-processing component that ensures predictions remain within physiologically plausible ranges. By simulating the evolution of key physiological indicators and subsequently classifying sepsis onset, the model provides transparent predictions that align with clinical judgment. Evaluated on the MIMIC-IV and eICU databases, the proposed method demonstrates superior performance with AUC scores ranging from 0.861 to 0.903 across various pre-onset prediction tasks, outperforming conventional deep learning and rule-based approaches. Importantly, the model offers interpretable trajectories and risk trends that can aid clinicians in early intervention and personalized decision-making in intensive care settings.
Methodology
The methodology involves a three-part framework: (1) a spatiotemporal feature extraction module to capture dynamic relationships among multivariate vital signs, (2) a Medical Prompt-as-Prefix module to embed clinical reasoning into the LLM, and (3) an agent-based post-processing component to ensure predictions are physiologically plausible. The model first simulates physiological indicators before classifying sepsis onset, enhancing interpretability.
Results
The proposed method achieved AUC scores between 0.861 and 0.903 across various pre-onset prediction tasks, significantly outperforming traditional deep learning and rule-based approaches. The model also provided interpretable trajectories and risk trends, which are valuable for clinical decision-making.
Implications
The framework has the potential to improve early intervention strategies for sepsis in intensive care settings, enabling personalized decision-making based on interpretable predictions. It can enhance clinician confidence in using predictive models for patient care.
Low-Rank Adaptation Redux for Large Models
Large Language Models
Optimization
Efficient ML
- LoRA is a leading method for parameter-efficient fine-tuning of large models, significantly reducing computational and memory costs.
- The paper categorizes advancements in LoRA into architectural design, optimization techniques, and applications across the model lifecycle.
- Signal processing principles can enhance the understanding and development of LoRA methods, offering a structured approach to fine-tuning.
- Emerging applications of LoRA extend beyond fine-tuning to include pre-training and deployment strategies.
Read more
Low-Rank Adaptation Redux for Large Models
Summary
This paper revisits Low-Rank Adaptation (LoRA), a parameter-efficient fine-tuning method for large models, through the lens of signal processing (SP). LoRA has gained prominence due to its ability to adapt billion-parameter networks with minimal computational and memory overhead. The authors categorize recent advancements in LoRA into three axes: architectural design, efficient optimization, and applications. They discuss architectural innovations such as singular value decomposition (SVD)-based factorization and cross-layer tensorization, alongside optimization techniques like gauge-invariant optimization. The paper emphasizes that while LoRA has been widely adopted for fine-tuning, its principles can also inform pre-training and deployment strategies. By bridging SP methodologies with deep learning, the authors aim to provide a principled framework for designing efficient LoRA methods and outline future research directions that could benefit both fields.
Methodology
The authors review and categorize LoRA advancements based on architectural designs and optimization techniques, leveraging signal processing tools such as singular value decomposition (SVD) and matrix/tensor decompositions to inform their analysis.
Results
The paper does not present empirical results but provides a comprehensive overview of the theoretical underpinnings and advancements in LoRA, suggesting that these methods can be effectively applied across various stages of model development and deployment.
Implications
The findings suggest that integrating signal processing techniques into the design of LoRA methods can lead to more efficient fine-tuning strategies for large models, potentially making advanced AI capabilities more accessible to a broader range of users and applications.
Transferable SCF-Acceleration through Solver-Aligned Initialization Learning
Optimization
Efficient ML
Theory
- SAIL addresses the supervision problem in ML models for SCF initialization, improving convergence for larger molecules.
- The introduction of the Effective Relative Iteration Count (ERIC) provides a more accurate measure of convergence efficiency.
- SAIL achieves significant reductions in ERIC, outperforming previous state-of-the-art methods.
- The method is applicable to both Hamiltonian and density matrix models, enhancing their performance in practical scenarios.
Read more
Transferable SCF-Acceleration through Solver-Aligned Initialization Learning
Summary
This paper addresses the computational challenges associated with Kohn-Sham density functional theory (KS-DFT) calculations, particularly the dependency of solver iterations on the quality of initial guesses. Traditional machine learning methods that predict these initial guesses often fail when applied to larger molecules, leading to slower convergence. The authors propose a novel approach called Solver-Aligned Initialization Learning (SAIL), which differentiates through the self-consistent field (SCF) solver end-to-end, allowing for training on solver dynamics rather than ground-state references. This method is shown to significantly improve the performance of both Hamiltonian and density matrix models, particularly in extrapolating to larger molecular sizes. The authors introduce the Effective Relative Iteration Count (ERIC) as a new metric for assessing convergence efficiency, demonstrating that SAIL reduces ERIC by substantial margins across various functionals and molecular sizes. The results indicate that SAIL not only accelerates SCF calculations but also extends the applicability of machine learning in computational chemistry, particularly for larger drug-like molecules.
Methodology
The authors developed Solver-Aligned Initialization Learning (SAIL), which involves backpropagating through the SCF algorithm to optimize the initial guess for SCF calculations. This approach is label-free and relies solely on molecular geometries, addressing the limitations of traditional ground-state target training. The authors also introduced the Effective Relative Iteration Count (ERIC) to evaluate the performance of their method in terms of convergence speed.
Results
SAIL demonstrated a reduction in ERIC by 37% for PBE, 33% for SCAN, and 27% for B3LYP on the QM40 dataset, which includes molecules up to four times larger than those seen during training. Additionally, on QMugs molecules that are ten times larger than the training size, SAIL achieved a 1.25× speedup in wall-time at the hybrid level of theory, significantly enhancing the efficiency of SCF calculations.
Implications
The findings suggest that SAIL can be a transformative approach in computational chemistry, allowing for faster and more efficient SCF calculations without sacrificing accuracy. This could lead to more rapid scientific discoveries and advancements in fields that rely on electronic structure methods, such as drug discovery and materials science.
CAP: Controllable Alignment Prompting for Unlearning in LLMs
Large Language Models
Reinforcement Learning
NLP
- CAP is the first end-to-end trained prompt-driven unlearning framework for LLMs.
- The framework utilizes reinforcement learning to optimize prompts for targeted knowledge suppression.
- CAP achieves precise unlearning without the need for model parameter updates.
- Extensive experiments show CAP outperforms existing methods in forgetting rate and retention accuracy.
Read more
CAP: Controllable Alignment Prompting for Unlearning in LLMs
Summary
The paper introduces the Controllable Alignment Prompting (CAP) framework, which addresses the challenges of unlearning sensitive information in large language models (LLMs) without modifying their parameters. Current unlearning methods are often computationally expensive and impractical for closed-source models. CAP proposes a novel approach that decouples unlearning into a learnable prompt optimization process using reinforcement learning. This allows for the generation of prompts that suppress specific knowledge while maintaining the model's overall capabilities. The framework enables reversible knowledge restoration through prompt revocation and demonstrates significant improvements over existing methods in terms of forgetting rate and retention accuracy across various LLMs. CAP's design ensures strong generalizability and controllability, making it a viable solution for privacy compliance and ethical safety in LLM applications.
Methodology
The CAP framework formulates unlearning as an inference-time control problem, employing a lightweight supervised learning model (SLM) to generate input-conditioned control prefixes. These prefixes guide a frozen LLM to suppress targeted knowledge. The SLM is optimized through reinforcement learning, allowing for effective prompt generation based on direct downstream feedback.
Results
Experiments conducted on various LLMs, including LLaMA and GPT, demonstrate that CAP significantly improves both the forgetting rate of sensitive information and the retention accuracy of general knowledge, outperforming baseline methods. The framework's ability to transfer seamlessly across different models highlights its effectiveness and versatility.
Implications
The CAP framework offers a practical solution for implementing selective knowledge unlearning in LLMs, addressing regulatory and ethical concerns related to sensitive information retention. Its non-invasive nature makes it suitable for commercial applications where model weights are inaccessible, thus broadening the scope of LLM deployment in privacy-sensitive environments.
Differentially Private Model Merging
Theory
Efficient ML
Federated Learning
- Introduces two data-independent algorithms for merging private models: random selection and linear combination.
- Provides tailored privacy accounting using R´enyi differential privacy and privacy loss distributions.
- Demonstrates the theoretical superiority of linear combination over random selection in terms of privacy/utility trade-off.
- Validates the proposed methods through empirical evaluations on various datasets.
Read more
Differentially Private Model Merging
Summary
This paper addresses the challenge of adapting machine learning models to varying differential privacy (DP) requirements during inference or deployment. The authors propose a novel approach to merge existing models trained on the same dataset with different privacy/utility trade-offs without requiring additional training. They introduce two post-processing techniques: random selection (RS) and linear combination (LC), which allow for the efficient generation of models that meet specific privacy constraints. The paper provides a comprehensive privacy accounting framework based on R´enyi differential privacy and privacy loss distributions, demonstrating the theoretical superiority of the linear combination method over random selection. A case study on private mean estimation illustrates the effectiveness of the proposed methods, and empirical evaluations on synthetic and real-world datasets confirm that both RS and LC improve the privacy/utility trade-offs compared to naive privacy accounting methods.
Methodology
The authors propose two post-processing techniques for merging models: random selection (RS), which randomly outputs one of the models based on a probability distribution, and linear combination (LC), which deterministically combines models using a weighted sum. They develop algorithms to optimize the mixing coefficients for both methods to enhance utility while adhering to privacy constraints. Privacy accounting is performed using R´enyi differential privacy and privacy loss distributions.
Results
The paper establishes that the linear combination method outperforms random selection in terms of privacy and utility trade-offs. Empirical results demonstrate that both merging techniques yield better privacy/utility outcomes compared to naive privacy accounting approaches across synthetic and real-world datasets.
Implications
The proposed model merging techniques can be applied in various machine learning scenarios where privacy requirements are dynamic, allowing for efficient model adaptation without retraining. This has significant implications for industries dealing with sensitive data, such as healthcare and finance, where privacy regulations frequently change.
A Deep U-Net Framework for Flood Hazard Mapping Using Hydraulic Simulations of the Wupper Catchment
Efficient ML
- Development of a deep learning surrogate model for flood prediction using U-Net architecture.
- The model provides a computationally efficient alternative to traditional hydraulic simulations.
- Testing was conducted using hydraulic simulations from the Wupper catchment, yielding comparable results.
- The framework aims to be generalizable across various topographies for broader application.
Read more
A Deep U-Net Framework for Flood Hazard Mapping Using Hydraulic Simulations of the Wupper Catchment
Summary
This paper addresses the urgent need for rapid and reliable flood prediction tools in light of increasing flood events globally. Traditional hydraulic simulations, while accurate, are computationally expensive and slow, making them unsuitable for real-time applications. The authors propose a deep-learning-based surrogate model using a U-Net architecture to predict maximum water levels efficiently across a grid based on hydraulic simulations from the Wupper catchment in North-Rhine Westphalia, Germany. The study optimizes the U-Net architecture, patch generation, and data handling to approximate hydraulic models effectively. The results indicate that the deep learning surrogate model can provide comparable accuracy to traditional methods while significantly reducing computation time, thus offering a viable alternative for real-time flood hazard mapping.
Methodology
The authors developed a deep learning surrogate model based on the U-Net architecture, optimizing patch generation and data handling to approximate hydraulic models. The model was trained using hydraulic simulation data from the Wupper catchment, focusing on predicting maximum water levels across a grid.
Results
The deep learning surrogate model demonstrated comparable accuracy to traditional hydraulic simulations while achieving significantly faster computation times. This efficiency makes it suitable for real-time flood prediction and hazard mapping.
Implications
The proposed framework can enhance flood prediction capabilities, enabling quicker responses to flooding events. Its generalizability across different topographies suggests potential applications in various regions facing flood risks, contributing to improved disaster management and mitigation strategies.
Interpretable Quantile Regression by Optimal Decision Trees
Interpretability
- Introduces Quantile DL8.5 (QDL8.5) for optimal quantile regression trees.
- Provides predictions for the complete conditional distribution of a target variable.
- Enhances interpretability and robustness by learning multiple trees for different quantiles.
- Achieves high accuracy with minimal computational overhead compared to single tree learning.
Read more
Interpretable Quantile Regression by Optimal Decision Trees
Summary
This paper addresses the growing demand for interpretable and robust machine learning models by introducing a novel method for learning optimal quantile regression trees. The proposed method, named Quantile DL8.5 (QDL8.5), allows for predictions of the complete conditional distribution of a target variable without prior assumptions about its distribution. Unlike traditional decision trees that predict a single mean value, QDL8.5 learns a set of decision trees, each corresponding to different quantiles, thus providing a more comprehensive understanding of the target distribution. This approach enhances interpretability and robustness, particularly in applications where under- or overestimation is crucial, such as retail demand forecasting. The authors demonstrate that QDL8.5 achieves high accuracy and interpretability with minimal computational overhead compared to learning a single tree. The paper contributes to the field by extending the DL8.5 algorithm to perform quantile regression efficiently and providing a robust assessment of accuracy, execution time, and interpretability.
Methodology
The authors propose an extension of the DL8.5 algorithm to learn a set of optimal decision trees for multiple quantiles. This method allows for efficient exploration of the tree space while limiting the computational time increase associated with learning multiple trees. The approach focuses on quantile regression, enabling the model to predict values corresponding to specified quantiles, thus capturing the entire distribution of the target variable.
Results
The QDL8.5 method demonstrates high accuracy in predictions while maintaining interpretability. The authors provide empirical evidence showing that the method can learn multiple quantile regression trees with virtually no additional computational cost compared to learning a single tree. The results indicate that QDL8.5 effectively addresses the challenges of choosing a single quantile and provides robust insights into the target distribution.
Implications
The findings of this paper have significant implications for fields requiring interpretable machine learning models, such as healthcare and business strategy. By providing a clearer understanding of the target variable's distribution, practitioners can make more informed decisions based on the model's predictions. The method's robustness to outliers and ability to model multiple quantiles can enhance trust in AI systems and improve decision-making processes.
Geometric Monomial (GEM): a family of rational 2N-differentiable activation functions
Optimization
Theory
Efficient ML
- Introduction of GEM, a family of rational 2N-differentiable activation functions.
- Three variants of GEM are proposed: GEM, E-GEM, and SE-GEM, each addressing different optimization challenges.
- N-ablation study reveals N=1 is optimal for CNNs while N=2 is better for transformers.
- GEM outperforms GELU in specific benchmarks, achieving lower deficits and better performance metrics.
Read more
Geometric Monomial (GEM): a family of rational 2N-differentiable activation functions
Summary
This paper introduces a new family of activation functions called Geometric Monomial (GEM), designed to improve the optimization and performance of deep neural networks. The GEM functions are C2N-smooth and follow a log-logistic cumulative distribution function (CDF), providing a ReLU-like performance while utilizing purely rational arithmetic. The author presents three variants: the base GEM, E-GEM which allows for ε-parameterization to approximate ReLU, and SE-GEM which is a piecewise variant that mitigates dead neurons. An N-ablation study identifies N=1 as optimal for standard-depth networks, significantly reducing the GELU deficit on CIFAR-100 + ResNet-56. The paper also explores the trade-off between CNNs and transformers, suggesting different optimal values of N for each architecture. The performance of GEM is benchmarked against existing activation functions across various datasets, demonstrating competitive results, particularly in reducing the GELU deficit and achieving lower perplexity in language models.
Methodology
The paper employs a theoretical framework to define the GEM activation functions and conducts an N-ablation study to determine the optimal smoothness parameter N for different neural network architectures. Performance comparisons are made against existing activation functions (ReLU, GELU, Swish) across multiple benchmark datasets including MNIST, CIFAR-10/100, and language models like BERT and GPT-2. Metrics such as test accuracy, validation loss, and perplexity are used to evaluate performance.
Results
GEM achieves significant improvements in performance metrics: on CIFAR-100 + ResNet-56, the GELU deficit is reduced from 6.10% to 0.62% with E-GEM. SE-GEM surpasses GELU on CIFAR-10 + ResNet-56 (92.51% vs 92.44%). In language modeling, GEM achieves the lowest perplexity on GPT-2 (72.57 vs 73.76 for GELU) and the best validation loss on BERT-small (6.656).
Implications
The introduction of GEM activation functions could lead to more efficient training of deep neural networks, particularly in architectures where smoothness and gradient stability are critical. This could enhance performance in various applications, including computer vision and natural language processing.
Validating a Deep Learning Algorithm to Identify Patients with Glaucoma using Systemic Electronic Health Records
Efficient ML
- The GRA model effectively identifies high-risk glaucoma patients using systemic EHR data.
- The model achieved an AUROC of 0.883 and a PPV of 0.657, indicating strong predictive performance.
- Calibration of the model aligns with clinical risk assessments, enhancing its practical utility.
- Fine-tuning the model on local data improved its performance, demonstrating the importance of external validation.
Read more
Validating a Deep Learning Algorithm to Identify Patients with Glaucoma using Systemic Electronic Health Records
Summary
This study investigates the effectiveness of a glaucoma risk assessment (GRA) model trained on national data to identify patients at high risk of glaucoma using only systemic electronic health records (EHR) at an independent institution. The research involved a cross-sectional analysis of 20,636 patients from the Stanford Byers Eye Clinic, with 15% diagnosed with glaucoma. The pretrained GRA model was fine-tuned on the Stanford cohort and evaluated using various patient data inputs, including demographics, diagnoses, medications, and lab results. The best-performing model achieved an area under the receiver operating characteristic curve (AUROC) of 0.883 and a positive predictive value (PPV) of 0.657. The model's calibration aligned with clinical risk, showing that the highest prediction decile had a glaucoma diagnosis rate of 65.7%. Performance improved with additional training layers and data. The findings suggest that an EHR-only GRA model could facilitate scalable pre-screening for glaucoma, potentially enhancing early detection without the need for specialized imaging.
Methodology
The study utilized a cohort of 20,636 patients, extracting EHR data in the OMOP Common Data Model format. The GRA model was fine-tuned on this cohort, employing deep learning techniques, including autoencoders and a 1-dimensional convolutional neural network. Data preprocessing involved standardizing demographic and clinical features, with missing values imputed using multivariate imputation methods. The dataset was split into training, validation, and test sets for model evaluation.
Results
The fine-tuned GRA model achieved an AUROC of 0.883 and a PPV of 0.657, indicating a strong ability to predict glaucoma risk. The highest prediction decile showed a glaucoma diagnosis rate of 65.7%, suggesting effective risk stratification. Performance improved with the addition of trainable layers and more data, demonstrating the model's adaptability.
Implications
The findings suggest that EHR-based glaucoma risk assessment models can enhance early detection and screening efficiency in primary care settings. This approach could reduce the burden of undiagnosed glaucoma and improve patient outcomes by prioritizing those at higher risk for further evaluation.
A temporal deep learning framework for calibration of low-cost air quality sensors
Time Series
- Introduces a deep learning framework for calibrating low-cost air quality sensors using LSTM networks.
- Captures temporal dependencies and environmental effects, improving calibration accuracy over traditional methods.
- Achieves regulatory compliance with significant reductions in uncertainty for calibrated pollutant measurements.
- Utilizes advanced feature engineering to enhance model generalization across different temporal contexts.
Read more
A temporal deep learning framework for calibration of low-cost air quality sensors
Summary
This paper addresses the calibration challenges faced by low-cost air quality sensors (LCS), which are essential for dense urban monitoring networks but often suffer from issues like sensor drift and environmental cross-sensitivity. The authors propose a novel deep learning framework utilizing Long Short-Term Memory (LSTM) networks to calibrate measurements of PM2.5, PM10, and NO2. The model is trained on co-located reference data from the OxAria network in Oxford, UK, and is designed to capture temporal dependencies and delayed environmental effects through sequence-based learning. This approach outperforms traditional Random Forest (RF) methods, which treat observations independently, by achieving higher R2 values across training, validation, and test sets for all pollutants. The feature set includes time-lagged parameters, harmonic encodings, and interaction terms to enhance generalization on unseen temporal windows. Validation against the Equivalence Spreadsheet Tool 3.1 confirms regulatory compliance with expanded uncertainties of 22.11% for NO2, 12.42% for PM10, and 9.1% for PM2.5, demonstrating the effectiveness of the proposed calibration framework.
Methodology
The authors developed a Long Short-Term Memory (LSTM) network to calibrate measurements from low-cost air quality sensors. The model was trained on co-located reference data, incorporating a feature set that included time-lagged parameters, harmonic encodings, and interaction terms to improve generalization. The performance of the LSTM model was compared against a Random Forest baseline, which treats observations independently.
Results
The LSTM model achieved higher R2 values across training, validation, and test sets for PM2.5, PM10, and NO2 compared to the Random Forest baseline. Validation against the Equivalence Spreadsheet Tool 3.1 showed expanded uncertainties of 22.11% for NO2, 12.42% for PM10, and 9.1% for PM2.5, indicating successful calibration and regulatory compliance.
Implications
The proposed framework has significant implications for urban air quality monitoring, enabling more accurate and reliable data from low-cost sensors. This can facilitate better public health decisions and policies by providing real-time insights into air pollution levels. The approach also highlights the potential of deep learning techniques in environmental monitoring applications.
Conditional anomaly detection with soft harmonic functions
Graph Learning
- Introduction of a non-parametric method for conditional anomaly detection using soft harmonic functions.
- Regularization techniques to avoid misclassification of isolated and fringe points.
- Development of a compact computation method for building a backbone graph to facilitate label propagation.
- Demonstration of the method's efficacy on synthetic, UCI, and real-world datasets.
Read more
Conditional anomaly detection with soft harmonic functions
Summary
This paper addresses the problem of conditional anomaly detection (CAD), which focuses on identifying unusual data instances based on a subset of variables while considering the context provided by other variables. The authors propose a novel non-parametric approach using soft harmonic functions to estimate label confidence and detect anomalous mislabeling. The method incorporates regularization to mitigate the detection of isolated and fringe points, which are often problematic in traditional anomaly detection methods. The authors validate their approach on various synthetic datasets, UCI machine learning datasets, and a real-world electronic health record dataset, demonstrating its effectiveness in identifying unusual patient-management decisions. The paper highlights the importance of context in anomaly detection and introduces a graph-based method that respects the manifold structure of the data, improving upon existing local neighborhood techniques.
Methodology
The proposed method utilizes a similarity graph of instances to propagate labeling information and assess label consistency in the neighborhood of data points. It incorporates a specific regularization to handle isolated and fringe points effectively, while also employing a backbone graph to simplify computations.
Results
The proposed approach outperformed several baseline methods in detecting unusual labels across synthetic datasets, UCI datasets, and a challenging real-world electronic health record dataset, showcasing its robustness and applicability in various contexts.
Implications
This work has significant implications for fields requiring anomaly detection in context-sensitive environments, such as healthcare, finance, and social networks. The method can enhance decision-making processes by accurately identifying unusual behaviors or outcomes based on contextual information.
The Sample Complexity of Multicalibration
Theory
- Establishes the sample complexity of multicalibration as eΘ(ε−3) for certain group sizes.
- Differentiates multicalibration from marginal calibration, which has a lower sample complexity of eΘ(ε−2).
- Demonstrates that mean-ECE multicalibration is equally difficult in both batch and online settings.
- Identifies a sharp threshold phenomenon in sample complexity when κ = 0.
Read more
The Sample Complexity of Multicalibration
Summary
This paper investigates the minimax sample complexity of multicalibration in a batch learning setting, where a learner must produce a predictor with a population multicalibration error, quantified by Expected Calibration Error (ECE), that does not exceed a specified threshold ε for a given set of groups. The authors establish that for a fixed κ > 0, when the number of groups |G| is less than or equal to ε raised to the power of -κ, a sample size of eΘ(ε−3) is both necessary and sufficient, up to polylogarithmic factors. This finding is significant as it differentiates the sample complexity of multicalibration from that of marginal calibration, which has a sample complexity of eΘ(ε−2). The paper also demonstrates that mean-ECE multicalibration is equally challenging in both batch and online settings, contrasting with marginal calibration that is more complex in online scenarios. For the case where κ = 0, the authors observe a threshold phenomenon where the sample complexity remains eΘ(ε−2). Furthermore, they provide matching upper and lower bounds for a weighted Lp multicalibration metric across all 1 ≤ p ≤ 2, achieving an optimal exponent of 3/p. The lower-bound framework is extended to a regular class of elicitable properties, leading to matching bounds for calibrating properties such as expectiles and bounded-density quantiles.
Methodology
The authors utilize a theoretical approach to derive lower and upper bounds on the sample complexity of multicalibration. They employ coding theory primitives and analyze hard instances involving compressed groups and staircase distributions. The upper bounds are achieved through an online-to-batch reduction technique, which allows for the construction of a randomized predictor that meets the required multicalibration error.
Results
The main results indicate that the sample complexity for multicalibration is eΘ(ε−3) in specific conditions, while for κ = 0, it remains eΘ(ε−2). The paper also provides matching bounds for a weighted Lp multicalibration metric and extends findings to various elicitable properties, confirming the robustness of the results across different contexts.
Implications
These findings have significant implications for the design of machine learning algorithms that require calibration across multiple groups, particularly in applications where fairness and accuracy are critical. Understanding the sample complexity can guide practitioners in determining the necessary data requirements for achieving reliable multicalibration in predictive models.
The Recurrent Transformer: Greater Effective Depth and Efficient Decoding
NLP
Large Language Models
Efficient ML
- Introduces the Recurrent Transformer architecture, enhancing effective depth and efficiency.
- Emulates both conventional Transformer behavior and token-to-token recurrent updates.
- Presents a tiling algorithm that reduces memory traffic and increases arithmetic intensity.
- Demonstrates improved performance in cross-entropy loss with fewer layers compared to traditional Transformers.
Read more
The Recurrent Transformer: Greater Effective Depth and Efficient Decoding
Summary
The paper introduces the Recurrent Transformer (RT), an innovative architecture that enhances the effective depth of Transformers while maintaining efficient decoding. Traditional Transformers process tokens in parallel but are limited by their shallow temporal structure, where each layer only attends to key-value pairs from the previous layer. In contrast, the Recurrent Transformer allows each layer to compute key-value pairs based on its own activations, enabling layerwise recurrent memory. This design not only emulates conventional Transformer behavior but also facilitates token-to-token recurrent updates, thereby avoiding optimization instability. The authors present a tiling-based algorithm that significantly reduces high-bandwidth memory traffic during training, increasing effective arithmetic intensity. Experimental results demonstrate that Recurrent Transformers outperform parameter-matched Transformer baselines in cross-entropy loss while requiring fewer layers, indicating a beneficial trade-off between depth and width. The findings suggest that this architecture can lead to reduced memory footprint and lower inference latency, making it a promising advancement in sequence modeling.
Methodology
The Recurrent Transformer modifies the computation of key-value pairs within each layer, allowing them to be derived from the layer's own outputs rather than the previous layer's representations. This change facilitates recurrent memory while preserving autoregressive decoding costs. The authors also propose a tiling-based algorithm to optimize memory usage during training, which reorganizes computation to enhance efficiency.
Results
The Recurrent Transformer models showed improved cross-entropy loss on the C4 pretraining dataset compared to parameter-matched Transformer baselines. The architecture achieved these results with fewer layers, indicating that it can effectively trade depth for width, leading to reduced memory usage and faster inference times.
Implications
The Recurrent Transformer has the potential to improve the efficiency and performance of various sequence modeling tasks, particularly in natural language processing and other applications requiring deep temporal representations. Its design may also inspire further research into optimizing Transformer architectures for better resource utilization.
Understanding and Mitigating Spurious Signal Amplification in Test-Time Reinforcement Learning for Math Reasoning
Reinforcement Learning
Large Language Models
NLP
- Medium-frequency samples are identified as a major source of spurious reward signals.
- Group-relative advantage normalization amplifies these spurious signals during optimization.
- The DDRL framework introduces a balanced sampling strategy and debiased advantage estimation.
- Extensive experiments show significant performance improvements over existing TTRL methods.
Read more
Understanding and Mitigating Spurious Signal Amplification in Test-Time Reinforcement Learning for Math Reasoning
Summary
This paper addresses the challenges of spurious signal amplification in Test-Time Reinforcement Learning (TTRL) when applied to mathematical reasoning tasks. The authors identify that TTRL, which adapts models at inference time using pseudo-labeling, is susceptible to noise from spurious optimization signals. Through empirical analysis, they discover that responses with medium consistency create an ambiguity region that significantly contributes to reward noise. The study reveals that group-relative advantage estimation can amplify these spurious signals, leading to distorted learning dynamics. To counteract this issue, the authors propose a unified framework called Debiased and Denoised test-time Reinforcement Learning (DDRL). DDRL employs a frequency-based sampling strategy to filter out ambiguous samples while ensuring a balanced representation of positive and negative examples. It also incorporates a debiased advantage estimation method to eliminate biases introduced by group-relative policy optimization. Finally, DDRL features a consensus-based off-policy refinement stage that utilizes a rejection-sampled dataset for stable model updates. Experiments conducted on various large language models across multiple mathematical reasoning benchmarks demonstrate that DDRL consistently outperforms existing TTRL baselines, showcasing its effectiveness in mitigating spurious signals.
Methodology
The authors developed the DDRL framework, which includes three main components: a balanced confidence-aware sampling strategy to filter ambiguous samples, a debiased advantage estimation approach that assigns fixed advantages to mitigate bias, and a consensus-based off-policy refinement stage that utilizes a rejection-sampled dataset for stable updates. This methodology aims to reduce the impact of spurious signals during the reinforcement learning process.
Results
The experiments demonstrated that DDRL achieved significant improvements in performance, with relative gains of 15.3% on the Qwen2.5-MATH-1.5B model and 12.7% on the LLaMA-3.1-8B-Instruct model across various mathematical reasoning benchmarks, outperforming existing TTRL baselines.
Implications
The findings suggest that addressing spurious signals in reinforcement learning can lead to more robust and reliable models, particularly in applications involving mathematical reasoning and other structured tasks. The DDRL framework could be applied to enhance the performance of large language models in various domains where label noise and ambiguity are prevalent.
Sub-Token Routing in LoRA for Adaptation and Query-Aware KV Compression
NLP
Large Language Models
Efficient ML
- Introduces sub-token routing for finer control in transformer efficiency.
- Presents a query-independent design that enhances language modeling quality.
- Develops a query-aware design that maintains downstream performance under reduced KV budgets.
- Demonstrates the complementary nature of token-level and sub-token-level routing.
Read more
Sub-Token Routing in LoRA for Adaptation and Query-Aware KV Compression
Summary
This paper introduces a novel approach to enhance the efficiency of transformer models through sub-token routing within LoRA-adapted transformers. The authors argue that traditional methods of routing and compression at the token level are insufficient, as they overlook the internal non-uniformity of token representations. By focusing on sub-token routing, the paper presents two designs: a query-independent method for compression-aware language modeling and a query-aware method for downstream-task-preserving KV compression. The query-independent design combines routed subspace LoRA with value-group routing, improving the quality-compression tradeoff. The query-aware design employs a predictor-based selector that allocates a global retention budget based on query-conditioned relevance, effectively preserving downstream performance while reducing KV budgets. The paper also explores the relationship between token-level and sub-token-level routing, demonstrating that they can be combined for deeper KV compression without sacrificing task accuracy.
Methodology
The authors implement two routing designs in LoRA-adapted transformers: a query-independent design that uses routed subspace LoRA and value-group routing for language modeling, and a query-aware design that utilizes a predictor-based selector for allocating retention budgets based on query relevance. Experiments are conducted to evaluate the performance of these methods in terms of quality-compression tradeoffs and downstream task preservation.
Results
The query-independent design significantly improves the quality-compression tradeoff for language modeling tasks. The query-aware design successfully preserves downstream task performance while operating under reduced KV budgets. The analysis reveals that token-level and sub-token-level routing provide complementary benefits, allowing for deeper KV compression without loss of accuracy.
Implications
This research has potential applications in optimizing transformer models for various NLP tasks, particularly in scenarios where computational resources are limited. The findings could lead to more efficient model deployment in real-world applications, enhancing both performance and resource utilization.
Droplet-LNO: Physics-Informed Laplace Neural Operators for Accurate Prediction of Droplet Spreading Dynamics on Complex Surfaces
Theory
Efficient ML
Optimization
- Introduction of PI-LNO, a novel neural network architecture for droplet dynamics prediction.
- Significant reduction in computation time, achieving a ∼23,400× speedup over traditional CFD methods.
- Demonstrated superior performance with a mean R2 score of 0.9009 compared to existing models.
- Incorporation of physics-informed constraints enhances model accuracy and physical interpretability.
Read more
Droplet-LNO: Physics-Informed Laplace Neural Operators for Accurate Prediction of Droplet Spreading Dynamics on Complex Surfaces
Summary
The paper introduces the Physics-Informed Laplace Operator Neural Network (PI-LNO), a novel architecture designed to predict the dynamics of liquid droplet spreading on complex surfaces. Traditional computational fluid dynamics (CFD) simulations are time-consuming, often requiring 18 to 24 hours for transient computations. PI-LNO addresses this challenge by utilizing Laplace integral transforms to model the exponential transient dynamics of droplet spreading. The authors conducted extensive benchmark studies comparing PI-LNO with five state-of-the-art methods, including UNet and Physics-Informed UNet. The model was trained on multi-surface CFD data, incorporating a physics-regularized composite loss function that combines data fidelity with physical constraints from Navier-Stokes equations and Cahn-Hilliard dynamics. The results demonstrate that PI-LNO significantly outperforms existing methods, achieving a mean R2 score of 0.9009 across various spreading times, with localized absolute errors near contact-line regions. Additionally, the model shows rapid inference times, enabling real-time applications. The findings suggest that PI-LNO can serve as an effective surrogate model for parametric optimization and design in engineering systems where transient multiphase dynamics are critical.
Methodology
The authors developed the PI-LNO architecture, which leverages Laplace integral transforms to model droplet spreading dynamics. The model was trained on CFD data across varying contact angles, using a physics-regularized composite loss function that integrates data fidelity metrics with physical constraints from governing equations. Extensive comparative analyses were performed against other neural network architectures.
Results
PI-LNO achieved a mean R2 score of 0.9009 across four intermediate spreading times, outperforming other models such as UNet and DeepONet. The model demonstrated localized absolute errors near contact lines and achieved rapid inference times of 2.8 ms, indicating a significant computational efficiency compared to traditional CFD methods.
Implications
The development of PI-LNO has potential applications in various fields, including inkjet printing, spray cooling, and biomedical microfluidics. Its ability to provide accurate predictions in real-time can facilitate accelerated design iterations and optimization processes in engineering systems that involve transient multiphase dynamics.
JEPAMatch: Geometric Representation Shaping for Semi-Supervised Learning
Computer Vision
Theory
Efficient ML
- JEPAMatch addresses class imbalance and convergence issues in semi-supervised learning.
- The method combines pseudo-labeling with geometric representation shaping in latent space.
- Extensive experiments show JEPAMatch outperforms existing methods on multiple datasets.
- The approach significantly accelerates convergence and reduces computational costs.
Read more
JEPAMatch: Geometric Representation Shaping for Semi-Supervised Learning
Summary
The paper introduces JEPAMatch, a novel approach to semi-supervised learning (SSL) that addresses the limitations of existing methods like FixMatch, particularly in handling class imbalance and convergence speed. Traditional SSL methods often suffer from biased models due to majority class dominance and incorrect pseudo-labels, which hinder the formation of clear decision boundaries. JEPAMatch shifts the focus from conventional output thresholding to the geometric shaping of representations in latent space, inspired by the Latent-Euclidean Joint-Embedding Predictive Architectures (LeJEPA). This new training objective combines a semi-supervised loss with a latent-space regularization term, promoting well-structured representations while leveraging pseudo-labeling. The authors validate their approach through extensive experiments on CIFAR-100, STL-10, and Tiny-ImageNet, demonstrating that JEPAMatch consistently outperforms existing baselines and accelerates convergence, thus reducing computational costs compared to standard FixMatch-based methods.
Methodology
JEPAMatch integrates a two-level learning process: a Curriculum Level for pseudo-label selection and a Representation Level for structuring the feature space. It employs a new training objective that combines semi-supervised loss with latent-space regularization derived from LeJEPA, promoting isotropic Gaussian structures in the representation space.
Results
The proposed JEPAMatch method consistently outperformed existing baseline methods on CIFAR-100, STL-10, and Tiny-ImageNet datasets. It demonstrated improved classification performance and significantly faster convergence rates, reducing the overall computational cost compared to traditional FixMatch-based pipelines.
Implications
JEPAMatch has potential applications in various domains where labeled data is scarce, such as medical imaging and natural language processing. Its ability to improve representation learning and accelerate convergence could lead to more efficient training processes in semi-supervised learning scenarios.
Synthetic Data in Education: Empirical Insights from Traditional Resampling and Deep Generative Models
Generative Models
- Synthetic data generation can mitigate data scarcity and privacy issues in education.
- Traditional resampling methods provide high utility but lack privacy protection.
- Deep learning models offer better privacy guarantees but at a significant utility cost.
- Variational Autoencoders are identified as the optimal balance between utility and privacy.
Read more
Synthetic Data in Education: Empirical Insights from Traditional Resampling and Deep Generative Models
Summary
This paper investigates the generation of synthetic data in educational contexts, focusing on the trade-offs between traditional resampling techniques and modern deep learning methods. The authors benchmark three resampling methods (SMOTE, Bootstrap, Random Oversampling) against three deep learning models (Autoencoder, Variational Autoencoder, Copula-GAN) using a dataset of 10,000 student performance records. The evaluation criteria include distributional fidelity, machine learning utility, and privacy preservation. The results indicate that while resampling methods achieve high utility scores, they fail to protect privacy, whereas deep learning models provide better privacy guarantees at the cost of utility. The Variational Autoencoder stands out as the best compromise, maintaining significant predictive performance while ensuring complete privacy protection. The study offers practical recommendations for practitioners on when to use each method based on privacy needs, establishing a foundational benchmark for synthetic data generation in learning analytics.
Methodology
The study systematically benchmarks traditional resampling techniques (SMOTE, Bootstrap, Random Oversampling) against deep learning models (Autoencoder, Variational Autoencoder, Copula-GAN) using a dataset of 10,000 student performance records. Metrics for evaluation include Kolmogorov-Smirnov distance, Jensen-Shannon divergence for distributional fidelity, Train-on-Synthetic-Test-on-Real scores for utility, and Distance to Closest Record for privacy preservation.
Results
The findings reveal a trade-off: resampling methods achieve near-perfect utility (TSTR: 0.997) but fail to protect privacy (DCR ≈0.00), while deep learning models ensure strong privacy (DCR ≈1.00) at a significant utility cost. The Variational Autoencoder maintains 83.3% predictive performance while ensuring complete privacy protection.
Implications
The results provide a foundational benchmark for educators and researchers in selecting appropriate synthetic data generation methods based on their specific needs for utility and privacy. This guidance can enhance the development of educational technologies while respecting student privacy.
Dynamical Priors as a Training Objective in Reinforcement Learning
Reinforcement Learning
- DP-RL framework introduces an auxiliary loss to impose temporal structure in RL training.
- The approach does not modify the reward structure, environment, or policy architecture.
- Dynamical priors significantly alter decision trajectories, promoting temporally coherent behavior.
- The study demonstrates that training objectives can control the temporal geometry of decision-making.
Read more
Dynamical Priors as a Training Objective in Reinforcement Learning
Summary
This paper introduces Dynamical Prior Reinforcement Learning (DP-RL), a novel training framework that enhances standard reinforcement learning (RL) by incorporating an auxiliary loss derived from external state dynamics (ESD). The motivation behind DP-RL is to address the temporal incoherence often exhibited by RL policies, such as abrupt confidence shifts and oscillatory behavior, which can occur even when high rewards are achieved. By imposing a dynamical prior during training, DP-RL encourages action probabilities to evolve in a temporally structured manner without altering the reward, environment, or policy architecture. The study evaluates the effectiveness of DP-RL across three minimal environments—Drift, Threshold Hover, and Decision Window—each designed to highlight specific temporal decision-making challenges. The results demonstrate that the introduction of dynamical priors significantly alters decision trajectories in task-dependent ways, promoting smoother and more coherent decision-making processes. This work emphasizes that training objectives can be leveraged to control the temporal dynamics of policy learning, paving the way for RL agents that exhibit desired temporal properties while maintaining the flexibility of policy gradient methods.
Methodology
The study compares two types of agents: a standard REINFORCE agent and an ESD-guided RL agent, both trained under identical conditions. The ESD-derived auxiliary loss encourages temporally coherent evolution of action probabilities. Three minimal environments are designed to expose distinct failure modes of standard RL policies, focusing on how the temporal geometry of decisions changes with the introduction of the dynamical prior.
Results
The results indicate that the DP-RL framework effectively stabilizes timing and promotes gradual confidence buildup in decision-making. The evaluation metrics, including decision jerk, oscillation count, and timing variance, show that the dynamical prior influences the evolution of action probabilities in a manner that aligns with the task structure, leading to more coherent decision-making compared to standard RL approaches.
Implications
The findings suggest that RL agents can be designed with enhanced temporal properties, such as robustness to noise and resistance to premature commitment, without sacrificing performance. This could lead to more biologically inspired decision-making processes in artificial agents, improving their applicability in real-world scenarios.
Supervised Learning Has a Necessary Geometric Blind Spot: Theory, Consequences, and Minimal Repair
Theory
- Theorem establishes a necessary geometric flaw in representations learned via ERM, termed the geometric blind spot.
- Introduces the Trajectory Deviation Index (TDI) to measure geometric distortion, revealing limitations of existing metrics.
- Confirms that the blind spot worsens with model scale and is amplified by task-specific fine-tuning.
- Proposes a minimal fix (PMH) that effectively reduces the blind spot while maintaining performance.
Read more
Supervised Learning Has a Necessary Geometric Blind Spot: Theory, Consequences, and Minimal Repair
Summary
This paper presents a theoretical framework demonstrating that empirical risk minimization (ERM) in supervised learning imposes a geometric constraint on learned representations, termed the 'geometric blind spot.' The author proves that any encoder minimizing supervised loss must retain non-zero Jacobian sensitivity in directions that are correlated with labels during training but are considered nuisance at test time. This phenomenon is not merely a limitation of specific architectures or datasets but a fundamental characteristic of the supervised learning objective itself. The paper unifies several previously isolated empirical findings, including non-robust predictive features and the robustness-accuracy tradeoff, under this new theoretical lens. The author introduces the Trajectory Deviation Index (TDI) as a diagnostic tool to measure this geometric distortion, showing that common methods like adversarial training do not effectively address the blind spot. The findings indicate that the blind spot is present across various tasks and scales, worsening with model size and task-specific fine-tuning. A minimal repair method, PMH, is proposed, which significantly reduces the TDI without major architectural changes, demonstrating that Gaussian noise is the unique perturbation that uniformly suppresses Jacobian sensitivity.
Methodology
The paper employs theoretical proofs to establish the geometric constraints imposed by ERM, introduces the Trajectory Deviation Index (TDI) for measuring Jacobian sensitivity, and conducts empirical evaluations across multiple vision tasks and language models to validate the findings. The PMH method is tested against standard adversarial training techniques to assess its effectiveness in mitigating the blind spot.
Results
The results indicate that the TDI effectively captures the geometric blind spot, with PMH achieving the lowest TDI values across multiple tasks. The study shows that adversarial training can worsen the clean-input geometry, while PMH significantly reduces the TDI without substantial accuracy loss. The blind spot ratio was observed to worsen with increasing model size and task-specific fine-tuning, confirming the theoretical predictions.
Implications
The findings suggest that current supervised learning methods may inherently limit model robustness and generalization. The introduction of TDI provides a new diagnostic tool for evaluating model performance, and the proposed PMH method offers a potential pathway for improving model robustness without extensive architectural changes. This work has implications for the design of future machine learning models and training methodologies.
Spectral Embeddings Leak Graph Topology: Theory, Benchmark, and Adaptive Reconstruction
Graph Learning
Federated Learning
Theory
- Introduction of LoGraB, a benchmark for fragmented graph learning.
- Development of AFR, an adaptive method for reconstructing noisy spectral fragments.
- Establishment of the Spectral Leakage Proposition for polynomial-time graph recovery.
- Demonstration of AFR's effectiveness in maintaining performance under privacy constraints.
Read more
Spectral Embeddings Leak Graph Topology: Theory, Benchmark, and Adaptive Reconstruction
Summary
This paper addresses the challenges of Graph Neural Networks (GNNs) in real-world applications where graph data is often fragmented, noisy, and privacy-sensitive. The authors introduce a unified framework that includes the Local Graph Benchmark (LoGraB), which systematically decomposes standard datasets into fragmented benchmarks, allowing for controlled experimentation with parameters such as neighborhood radius, spectral quality, noise level, and coverage ratio. The paper also presents Adaptive Fidelity-driven Reconstruction (AFR), a method designed to recover graph topology from noisy spectral fragments. AFR employs a novel fidelity score to assess the quality of graph patches and utilizes techniques like RANSAC-Procrustes alignment and Bundle Adjustment for robust reconstruction. The theoretical contributions include the Spectral Leakage Proposition, which establishes conditions under which polynomial-time recovery of graph topology is feasible. Experimental results demonstrate that LoGraB effectively reveals the strengths and weaknesses of GNNs under fragmentation, and AFR achieves superior performance on multiple datasets while maintaining privacy under differential privacy constraints.
Methodology
The authors formalize the fragmented-graph learning setting and introduce LoGraB for systematic benchmarking. They propose AFR, which uses a fidelity score to guide the reconstruction of graph topology from noisy fragments, employing techniques such as RANSAC-Procrustes alignment and Bundle Adjustment.
Results
Experiments on nine benchmarks show that LoGraB effectively highlights the performance of GNNs under fragmentation. AFR achieves the best F1 score on 7 out of 9 datasets and retains 75% of its undefended F1 score under differential privacy conditions.
Implications
The findings suggest that GNNs can be effectively evaluated and improved in fragmented and privacy-sensitive environments. The methods developed can enhance the robustness of graph learning in real-world applications, such as federated learning and privacy-preserving systems.
Evaluating Post-hoc Explanations of the Transformer-based Genome Language Model DNABERT-2
Interpretability
- Adaptation of AttnLRP for explaining DNABERT-2 model predictions.
- Comparison of explanations from DNABERT-2 and a baseline CNN.
- Demonstration that Transformer-based models can yield biologically relevant insights.
- Evaluation of explanations using multiple metrics including sparsity and faithfulness.
Read more
Evaluating Post-hoc Explanations of the Transformer-based Genome Language Model DNABERT-2
Summary
This paper investigates the explainability of the Transformer-based genome language model DNABERT-2 by adapting the AttnLRP method, which is an extension of layer-wise relevance propagation tailored for attention mechanisms. The authors aim to determine whether post-hoc explanations derived from DNABERT-2 can provide biological insights similar to those obtained from convolutional neural networks (CNNs). They evaluate the adapted AttnLRP on genomic datasets, comparing the explanations generated by DNABERT-2 with those from a baseline CNN. The results indicate that AttnLRP produces reliable explanations that align with known biological patterns, suggesting that Transformer-based models can also facilitate hypothesis generation in genomics. This work enhances the understanding of explainability in genome language models and establishes a framework for comparing relevance attributions across different neural network architectures.
Methodology
The authors adapted the AttnLRP method to the DNABERT-2 model, applying it to genomic datasets for classification tasks. They evaluated the explanations using metrics for sparsity, complexity, similarity, faithfulness, and localization, and compared these results with explanations generated from a CNN using traditional LRP.
Results
The adaptation of AttnLRP to DNABERT-2 yielded explanations that corresponded well with known biological patterns, demonstrating that the Transformer-based model can provide insights similar to those obtained from CNNs. The evaluation metrics indicated that the explanations were reliable and informative.
Implications
The findings suggest that Transformer-based genome language models like DNABERT-2 can be effectively utilized for generating biological insights, which may enhance understanding of genomic functions and regulatory mechanisms. This could have significant implications for diagnosis, treatment, and prevention of diseases.
Toward Efficient Membership Inference Attacks against Federated Large Language Models: A Projection Residual Approach
Large Language Models
Federated Learning
NLP
- ProjRes is the first projection residuals-based passive MIA specifically designed for FedLLMs.
- The method achieves near 100% accuracy in membership inference, outperforming previous techniques.
- ProjRes operates without the need for shadow models or auxiliary classifiers, enhancing efficiency.
- The study reveals significant privacy vulnerabilities in FedLLMs, necessitating a reevaluation of their security measures.
Read more
Toward Efficient Membership Inference Attacks against Federated Large Language Models: A Projection Residual Approach
Summary
This paper addresses the vulnerability of Federated Large Language Models (FedLLMs) to Membership Inference Attacks (MIAs), which can expose sensitive information despite the models' design to maintain data privacy. The authors propose a novel attack method called ProjRes, which utilizes projection residuals derived from hidden embedding vectors to analyze the relationship between gradients and input data. Unlike traditional MIAs, ProjRes does not require shadow models or auxiliary classifiers, making it more efficient and robust. The study demonstrates that ProjRes achieves near 100% accuracy in identifying data membership across four benchmarks and four LLMs, significantly outperforming existing methods by up to 75.75%. The findings highlight a critical privacy vulnerability in FedLLMs, suggesting a need for a reassessment of their security assumptions and the potential for improved defenses against MIAs.
Methodology
The authors developed ProjRes, which analyzes hidden embedding vectors and their projection residuals in the gradient subspace to determine data membership. This approach circumvents the limitations of existing MIAs by eliminating the need for shadow models and auxiliary classifiers, focusing instead on the intrinsic relationships between gradients and inputs.
Results
Experiments conducted on four benchmarks and four different LLMs demonstrated that ProjRes achieves nearly perfect accuracy in membership inference, outperforming existing methods by as much as 75.75%. The method also proved effective against strong differential privacy defenses, indicating a robust capability to expose membership information.
Implications
The findings of this research underscore the need for enhanced security measures in FedLLMs to protect against MIAs. The proposed ProjRes method could serve as a basis for developing more effective defenses and improving the overall privacy of federated learning systems.
Measure Twice, Click Once: Co-evolving Proposer and Visual Critic via Reinforcement Learning for GUI Grounding
Reinforcement Learning
Computer Vision
Multimodal
- Introduction of a Propose-then-Critic framework for GUI grounding.
- Utilization of a co-evolutionary reinforcement learning strategy to enhance model capabilities.
- Dynamic maturity-aware mechanism to balance prediction accuracy and diversity.
- Significant improvements in grounding accuracy and critic reliability.
Read more
Measure Twice, Click Once: Co-evolving Proposer and Visual Critic via Reinforcement Learning for GUI Grounding
Summary
This paper addresses the challenge of Graphical User Interface (GUI) grounding, which involves mapping natural language instructions to precise pixel coordinates on screens. Traditional methods often struggle with localization due to visually homogeneous elements and dense layouts. The authors propose a novel Propose-then-Critic framework that replaces static self-consistency strategies with a learnable selection mechanism. This framework utilizes a co-evolving reinforcement learning paradigm that dynamically balances the training objectives of a proposer (which generates candidate coordinates) and a critic (which evaluates these proposals). The approach enhances the robustness of the critic through diverse outputs from the proposer, while the critic's improved discrimination capabilities allow the proposer to explore spatially more extensively. The proposed method significantly improves grounding accuracy and critic reliability across multiple benchmarks, demonstrating a relative improvement of up to 17.2% in grounding capability. This work highlights the potential of using visual feedback and co-evolutionary strategies to enhance the performance of models in complex GUI environments.
Methodology
The authors developed a Propose-then-Critic framework that transforms the GUI grounding task into a Visual Perception Ranking paradigm. This involves generating diverse candidate coordinates, visualizing them, and then using a critic to rank and select the best candidate. A co-evolutionary reinforcement learning strategy is employed to optimize the proposer and critic simultaneously, with a focus on balancing accuracy and diversity through a maturity-aware mechanism.
Results
The proposed method achieved a relative improvement of up to 17.2% in grounding capability across six benchmarks, significantly enhancing both the accuracy of generated coordinates and the reliability of the critic's evaluations.
Implications
This research has implications for the development of more effective autonomous GUI agents, improving their ability to interpret and execute user instructions in complex digital environments. The framework could be applied to various applications requiring precise interaction with graphical interfaces, such as automated testing tools, virtual assistants, and accessibility technologies.
Generalizing Numerical Reasoning in Table Data through Operation Sketches and Self-Supervised Learning
NLP
Large Language Models
Efficient ML
- Identifies three failure modes in numerical reasoning: reasoning inefficiency, data scarcity for logical supervision, and header dependency.
- Introduces operation sketches to focus models on contextual reasoning rather than surface-level patterns.
- Combines operation sketches with header anonymization and self-supervised learning in the TaNOS framework.
- Achieves superior performance on FinQA with significantly less training data compared to traditional methods.
Read more
Generalizing Numerical Reasoning in Table Data through Operation Sketches and Self-Supervised Learning
Summary
The paper addresses the challenges of numerical reasoning in expert-domain tables, which often show high in-domain accuracy but struggle with domain shifts. The authors introduce TaNOS, a continual pre-training framework designed to enhance the robustness and transferability of numerical reasoning. TaNOS comprises three key components: header anonymization to mitigate lexical memorization, operation sketches that provide minimal structural cues for reasoning, and self-supervised learning that generates correctness-guaranteed program-question pairs from tables. By decoupling domain semantics from numerical operation structures, TaNOS significantly improves performance on tasks like FinQA, achieving 80.13% execution accuracy with only 10% of the training data, surpassing traditional supervised fine-tuning (SFT) methods. The framework also demonstrates minimal performance degradation across domain shifts, highlighting its potential for robust generalization in numerical reasoning tasks.
Methodology
The methodology involves a continual pre-training framework (TaNOS) that integrates three components: (1) header anonymization to reduce lexical memorization, (2) operation sketches that provide minimal structural cues for reasoning, and (3) self-supervised learning to create program-question pairs from tables without manual annotation. This approach aims to enhance the model's ability to generalize across different domains by focusing on structural reasoning rather than lexical patterns.
Results
TaNOS applied to an 8B instruction-tuned model achieved 80.13% execution accuracy on FinQA with only 10% of the training data, outperforming the SFT baseline (73.97%) trained on full data. In domain-shift experiments, TaNOS maintained a performance gap of less than 2 percentage points, while SFT showed over a 10 percentage point gap, indicating improved robustness and generalization.
Implications
The findings suggest that the TaNOS framework can be effectively applied to various expert-domain numerical reasoning tasks, enhancing the robustness and transferability of models in real-world applications such as finance, engineering, and biology. This approach could lead to more reliable AI systems capable of handling diverse and complex data structures.
GFlowState: Visualizing the Training of Generative Flow Networks Beyond the Reward
Generative Models
Interpretability
- GFlowState enhances the interpretability of GFlowNets by visualizing training dynamics.
- The system provides multiple interactive views for analyzing sampling behavior and policy evolution.
- Case studies show GFlowState's effectiveness in debugging and assessing GFlowNets.
- The tool addresses the gap in existing visualization methods for GFlowNet training.
Read more
GFlowState: Visualizing the Training of Generative Flow Networks Beyond the Reward
Summary
The paper introduces GFlowState, a visual analytics system aimed at enhancing the interpretability of Generative Flow Networks (GFlowNets) during their training process. GFlowNets are designed to generate samples in proportion to a reward function, making them valuable in applications such as molecular and material discovery. However, understanding their training dynamics has been challenging due to the complexity of the underlying structures and the combinatorial nature of the sample spaces. GFlowState addresses this issue by providing multiple interactive visualization views that allow users to analyze sampling trajectories, compare generated samples to reference datasets, and examine training dynamics. The system includes a ranking of generated objects based on their rewards, a state projection visualization, a directed acyclic graph of generated trajectories, and a transition heatmap indicating sampling probabilities. Through case studies, the authors demonstrate how GFlowState can assist in debugging and evaluating GFlowNets, thereby improving their development and application in various domains.
Methodology
The authors developed GFlowState through an iterative feedback process with GFlowNet developers, identifying key analysis tasks necessary for understanding GFlowNet training. The system incorporates various visualization techniques to represent sampling trajectories, sample space comparisons, and training dynamics effectively.
Results
The evaluation of GFlowState through case studies demonstrated its utility in revealing insights about GFlowNet training, such as identifying underexplored regions and sources of training failure. The visualizations provided a clearer understanding of how GFlowNets explore the sample space and adjust sampling probabilities.
Implications
GFlowState has the potential to significantly improve the interpretability and debugging of GFlowNets, facilitating their application in scientific discovery and other fields requiring generative modeling. By making training dynamics observable, it can help accelerate the development of more effective generative models.
Transferable Physics-Informed Representations via Closed-Form Head Adaptation
Theory
Optimization
Efficient ML
- Introduction of Pi-PINN framework for transferable physics-informed representations.
- Closed-form head adaptation significantly reduces computational costs for adapting to new PDE instances.
- Improved generalization across PDE families through shared embedding learning.
- Empirical results show substantial speed and accuracy improvements over traditional PINNs.
Read more
Transferable Physics-Informed Representations via Closed-Form Head Adaptation
Summary
This paper introduces a novel approach called Pi-PINN (Pseudoinverse Physics-Informed Neural Networks) aimed at enhancing the generalization capabilities of physics-informed neural networks (PINNs) for solving partial differential equations (PDEs). Traditional PINNs struggle with generalization to new PDE instances due to a lack of training data and slow training processes. Pi-PINN addresses these limitations by learning transferable physics-informed representations in a shared embedding space, allowing for rapid adaptation to new PDEs through closed-form head adaptation using a least-squares-optimal pseudoinverse. The authors explore the integration of data-driven multi-task learning losses with physics-informed losses, leading to improved performance. Empirical results demonstrate that Pi-PINN achieves predictions 100–1000 times faster than conventional PINNs and exhibits 10–100 times lower relative error compared to typical data-driven models, even with minimal training samples. This work highlights the potential of transferable representations in enhancing the efficiency and applicability of PINNs across various scientific and engineering domains.
Methodology
The authors developed the Pi-PINN framework, which decouples learning into a shared embedding for transferable structure and a task-specific output head. This allows for efficient adaptation to new PDE instances using a closed-form linear solve, avoiding the need for costly gradient-based optimization. The study also investigates the synergy between multi-task learning objectives and physics-informed residual losses.
Results
Pi-PINN demonstrated 100–1000 times faster prediction and adaptation compared to conventional PINNs, while achieving 10–100 times lower relative error than typical data-driven models, even with only two training samples. The framework was validated on various PDE problems, including Poisson's equation, Helmholtz equation, and Burgers' equation.
Implications
The findings suggest that Pi-PINN can significantly enhance the efficiency and generalization of PINNs, making them more applicable in real-world scenarios where rapid adaptation to new PDE instances is required. This could have broad implications for scientific and engineering applications involving complex physical phenomena.
mcdok at SemEval-2026 Task 13: Finetuning LLMs for Detection of Machine-Generated Code
Large Language Models
NLP
Generative Models
- The mcdok system is designed for multi-domain detection of machine-generated code.
- It adapts the existing mdok approach for better code understanding.
- The system is evaluated across three subtasks: binary detection, authorship detection, and hybrid code detection.
- Results show competitive performance, but significant room for improvement remains.
Read more
mcdok at SemEval-2026 Task 13: Finetuning LLMs for Detection of Machine-Generated Code
Summary
The paper presents the mcdok system developed for SemEval-2026 Task 13, which focuses on the detection of machine-generated code across multiple programming languages. This task is particularly challenging due to the advancements in large language models (LLMs) that blur the lines between human-written and machine-generated code. The authors adapted their previous mdok approach, which was designed for text detection, to better suit the nuances of code detection. The task consists of three subtasks: binary detection of machine-generated code, multi-class authorship detection, and hybrid code detection. The mcdok system utilizes various base models optimized for code understanding, achieving competitive results in all subtasks, although there remains a significant gap to the top-performing systems. The methodology includes a parameter-efficient finetuning process using QLoRA and a careful selection of training data to balance class distributions. The results indicate that while the mcdok system performs well, further improvements are necessary to close the performance gap with leading systems.
Methodology
The mcdok system is based on the mdok framework, utilizing parameter-efficient finetuning via QLoRA with 4-bit quantization. The authors selected suitable base models for each subtask, including Gemma-3-27B-PT for binary detection, CodeGemma-7B for authorship detection, and Qwen2.5-Coder-14B for hybrid code detection. The training process involved careful data selection and balancing to ensure robust performance across different classes.
Results
The mcdok system achieved competitive results in all three subtasks of the SemEval-2026 Task 13, indicating its effectiveness in detecting machine-generated code. However, the authors noted that the performance margins compared to the top systems were significant, suggesting that further refinements and enhancements are needed.
Implications
The findings from this research have implications for improving the detection of machine-generated code, which is increasingly relevant in software development and cybersecurity. Enhanced detection systems could help in identifying code authenticity and authorship, thereby supporting better code quality and security practices.
A Scale-Adaptive Framework for Joint Spatiotemporal Super-Resolution with Diffusion Models
Generative Models
Time Series
Computer Vision
- Introduces a scale-adaptive framework for joint spatiotemporal super-resolution using diffusion models.
- Allows for the reuse of the same architecture across different spatial and temporal super-resolution factors.
- Decomposes the SR task into deterministic and stochastic components to enhance performance.
- Demonstrates effectiveness on precipitation data, a challenging application in climate science.
Read more
A Scale-Adaptive Framework for Joint Spatiotemporal Super-Resolution with Diffusion Models
Summary
This paper presents a novel scale-adaptive framework for joint spatiotemporal super-resolution (SR) using diffusion models, specifically targeting climate applications such as precipitation data. Traditional deep learning methods for video super-resolution typically focus on either spatial or temporal resolution enhancement, often requiring separate architectures for different super-resolution factors. The proposed framework addresses this limitation by allowing the same model architecture to be reused across various spatial and temporal scales. This is achieved by decomposing the spatiotemporal SR task into two components: a deterministic prediction of the conditional mean using attention mechanisms and a residual conditional diffusion model that incorporates an optional mass-conservation transform. The framework adapts to different SR factors by retuning three hyperparameters before retraining: the diffusion noise schedule amplitude, the temporal context length, and the mass-conservation function. The effectiveness of this approach is demonstrated through experiments on reanalysis precipitation data over France, showcasing the model's ability to span super-resolution factors from 1 to 25 in space and 1 to 6 in time, thus providing a reusable architecture for joint spatiotemporal super-resolution across scales.
Methodology
The methodology involves decomposing the spatiotemporal super-resolution task into a deterministic predictor and a diffusion model for residuals. The framework adapts to different super-resolution factors by retuning hyperparameters related to diffusion noise, temporal context, and mass conservation, allowing for a single architecture to be effectively applied across various scales.
Results
The proposed framework successfully demonstrated the ability to handle super-resolution factors ranging from 1 to 25 in space and 1 to 6 in time, producing realistic ensembles of precipitation data. This indicates that the model can effectively capture the complexities of spatiotemporal dynamics in environmental datasets.
Implications
The framework has significant implications for climate science, particularly in applications requiring high-resolution precipitation data for impact modeling, such as flood forecasting. Its adaptability allows for efficient model deployment across different datasets without the need for extensive redesign.
SGD at the Edge of Stability: The Stochastic Sharpness Gap
Optimization
Theory
- Introduces stochastic self-stabilization to explain sharpness suppression in mini-batch SGD.
- Derives a closed-form expression for the equilibrium sharpness gap that depends on batch size and gradient noise.
- Demonstrates that smaller batch sizes lead to flatter solutions in the loss landscape.
- Experimental validation on MLPs, CNNs, and ResNets shows quantitative agreement with theoretical predictions.
Read more
SGD at the Edge of Stability: The Stochastic Sharpness Gap
Summary
This paper investigates the behavior of mini-batch Stochastic Gradient Descent (SGD) in relation to the sharpness of neural network loss landscapes, particularly focusing on the phenomenon termed the Edge of Stability (EoS). Previous work established that full-batch Gradient Descent (GD) exhibits a self-stabilization mechanism that maintains sharpness at a threshold of 2/η, where η is the step size. However, for mini-batch SGD, sharpness stabilizes below this threshold, and the authors seek to explain this suppression. They introduce the concept of stochastic self-stabilization, which posits that gradient noise contributes additional variance to the dynamics of SGD, leading to a lower equilibrium sharpness. The authors derive a closed-form expression for the sharpness gap, demonstrating that it depends on batch size and gradient noise variance. Their theoretical findings are supported by experimental results on various neural network architectures, confirming the predictions regarding sharpness behavior during training.
Methodology
The authors extend the self-stabilization framework from full-batch GD to mini-batch SGD by analyzing the effects of gradient noise on sharpness dynamics. They define stochastic predicted dynamics and prove a coupling theorem that relates the true SGD trajectory to a constrained trajectory. The analysis involves deriving a closed-form expression for the sharpness gap based on the progressive sharpening rate, self-stabilization strength, and gradient noise variance.
Results
The paper presents a closed-form expression for the equilibrium sharpness gap, showing that it is influenced by the step size, batch size, and gradient noise. Experimental results confirm that the predicted sharpness gaps align with observed behaviors in various neural network architectures, validating the theoretical framework.
Implications
The findings suggest that understanding the dynamics of sharpness in SGD can lead to improved optimization strategies for training neural networks, particularly in selecting appropriate batch sizes and step sizes to enhance generalization performance.
Temporal Taskification in Streaming Continual Learning: A Source of Evaluation Instability
Time Series
- Temporal taskification is a structural component of evaluation in streaming CL.
- Different valid splits of the same data stream can lead to varying CL regimes and performance metrics.
- The proposed framework allows for efficient diagnosis of taskification robustness before model training.
- Shorter taskifications result in noisier patterns and greater sensitivity to boundary perturbations.
Read more
Temporal Taskification in Streaming Continual Learning: A Source of Evaluation Instability
Summary
This paper investigates the impact of temporal taskification in Streaming Continual Learning (CL), arguing that the way continuous data streams are partitioned into discrete tasks significantly influences evaluation outcomes. The authors introduce a taskification-level framework that incorporates plasticity and stability profiles, profile distance, and Boundary-Profile Sensitivity (BPS) to assess the robustness of different task splits before training models. They conduct experiments on the CESNET-Timeseries24 dataset using various CL methods, revealing that different temporal splits lead to substantial variations in forecasting errors, forgetting rates, and backward transfer. The findings highlight that taskification is not merely a preprocessing step but a critical evaluation variable that can alter benchmark conclusions in CL.
Methodology
The authors developed a taskification-level framework that includes plasticity and stability profiles, profile distance, and Boundary-Profile Sensitivity (BPS) to evaluate the robustness of different temporal task splits. They conducted experiments on the CESNET-Timeseries24 dataset, applying continual finetuning, Experience Replay, Elastic Weight Consolidation, and Learning without Forgetting across various temporal splits (9, 30, and 44 days).
Results
The experiments demonstrated that variations in temporal taskification led to significant differences in forecasting error, forgetting, and backward transfer. Specifically, shorter taskifications were associated with noisier distribution-level patterns and higher BPS, indicating increased sensitivity to perturbations in task boundaries.
Implications
The findings suggest that researchers and practitioners in continual learning should carefully consider the taskification process when designing experiments and interpreting results. This work encourages the integration of taskification as a first-class evaluation variable in CL benchmarks, potentially leading to more reliable and valid conclusions in future studies.
Frequency-Forcing: From Scaling-as-Time to Soft Frequency Guidance
Generative Models
Computer Vision
- Frequency-Forcing combines hard and soft frequency guidance for improved image generation.
- The method utilizes a self-forcing signal derived from data, avoiding reliance on pretrained models.
- Frequency-Forcing consistently outperforms strong baselines in FID scores on ImageNet-256.
- The approach maintains compatibility with standard flow-matching architectures.
Read more
Frequency-Forcing: From Scaling-as-Time to Soft Frequency Guidance
Summary
This paper introduces Frequency-Forcing, a novel approach to image generation that combines the strengths of two existing paradigms: K-Flow and Latent Forcing. While K-Flow imposes a hard frequency constraint by treating frequency scaling as flow time, Latent Forcing offers a soft ordering mechanism by coupling pixel flow with an auxiliary semantic latent flow. Frequency-Forcing innovatively applies a soft frequency guidance mechanism by utilizing a self-forcing signal derived from a lightweight learnable wavelet packet transform, rather than relying on a heavy pretrained encoder. This method allows for a more adaptable and efficient generation process, maintaining the core flow coordinate while improving the quality of pixel generation. The authors demonstrate that Frequency-Forcing consistently enhances FID scores on ImageNet-256 compared to strong baselines, and it can be effectively combined with semantic streams for further improvements, showcasing its versatility as a path-preserving alternative to hard frequency flows.
Methodology
The methodology involves a novel frequency-guided generation process where a standard pixel flow is augmented by an auxiliary low-frequency stream derived from a learnable wavelet packet transform. This self-forcing signal enables the model to adaptively learn a basis that is better suited to the data statistics, thus improving the generation quality without the need for heavy external dependencies.
Results
The experimental results indicate that Frequency-Forcing achieves significant improvements in FID scores over existing pixel and latent-space baselines on the ImageNet-256 dataset. The method also demonstrates the ability to naturally integrate with semantic streams, yielding additional performance gains.
Implications
The findings suggest that Frequency-Forcing could be a valuable technique in generative modeling, particularly in applications requiring high-quality image synthesis. Its flexibility and efficiency make it suitable for various tasks in computer vision, potentially leading to advancements in areas such as image editing, restoration, and other generative applications.
Task-specific Subnetwork Discovery in Reinforcement Learning for Autonomous Underwater Navigation
Reinforcement Learning
Robotics
Interpretability
- MTRL effectively utilizes shared knowledge across tasks, indicating successful knowledge sharing.
- Only a small fraction of network weights are task-specific, suggesting minimal specialization is needed for individual objectives.
- Context variables play a crucial role in enabling the network to differentiate between related tasks.
Read more
Task-specific Subnetwork Discovery in Reinforcement Learning for Autonomous Underwater Navigation
Summary
This paper addresses the challenges faced by autonomous underwater vehicles (AUVs) in performing multiple navigation tasks under dynamic and uncertain conditions. Traditional control methods struggle with these complexities, necessitating robust and interpretable control policies. The authors explore multi-task reinforcement learning (MTRL) as a solution, which utilizes shared representations to enhance adaptability across tasks. However, existing MTRL approaches lack transparency in their decision-making processes. To investigate this, the authors analyze a pretrained MTRL network in the HoloOcean simulator, focusing on task-specific subnetworks that facilitate navigation towards different marine species. Their findings reveal that only 1.5% of the network's weights are used to differentiate tasks, with 85% of these weights connecting context variables to the hidden layers. This highlights the critical role of context in task differentiation. The study contributes to understanding shared and specialized components in MTRL, paving the way for improved model editing, transfer learning, and continual learning for underwater monitoring.
Methodology
The authors employed a pretrained Double DQN value network for underwater navigation tasks. They pruned the network to identify task-specific subnetworks and analyzed the overlap between these subnetworks to understand shared and task-specific components. Initial experiments were conducted in MiniGrid to establish methodological feasibility before extending the analysis to underwater navigation tasks in the HoloOcean simulator.
Results
The analysis demonstrated that MTRL networks leverage a significant portion of their weights for shared knowledge across tasks, confirming effective knowledge sharing. The study also found that only a small portion of weights were dedicated to task-specific functions, emphasizing the importance of context variables in task differentiation.
Implications
The findings have significant implications for the development of safer and more reliable autonomous navigation policies for underwater applications. Understanding the internal structure of MTRL networks can enhance interpretability, trust, and safety in real-world deployments of AUVs.
TabSHAP
Large Language Models
Interpretability
- Introduces TabSHAP, a model-agnostic interpretability framework for LLM-based tabular classifiers.
- Utilizes Jensen-Shannon divergence for distributional attribution, capturing shifts in model confidence.
- Implements feature-level atomic masking to maintain prompt syntax and semantic integrity.
- Demonstrates significantly higher faithfulness in feature attribution compared to existing methods.
Read more
TabSHAP
Summary
TabSHAP is a novel interpretability framework designed for Large Language Models (LLMs) fine-tuned on serialized tabular data, addressing the critical challenge of interpretability in high-stakes domains. Traditional methods often fail to provide faithful attributions for LLM-based tabular classifiers, relying instead on global proxies or simplistic probability shifts. TabSHAP introduces a Shapley-style sampled-coalition estimator that utilizes Jensen-Shannon divergence to measure the distributional impact of features, rather than merely tracking prediction changes. This approach allows for feature-level atomic masking, preserving the integrity of serialized prompts by removing entire key-value pairs instead of individual tokens. Additionally, TabSHAP employs a verbalizer-style class aggregation to ensure stable attribution scores across various token representations. Experimental validation on the Adult Income and Heart Disease datasets demonstrates that TabSHAP significantly outperforms random baselines and XGBoost proxies in terms of faithfulness, effectively isolating critical features that influence model decisions. The framework not only enhances the interpretability of LLMs in tabular contexts but also bridges the gap between their advanced capabilities and the interpretability requirements necessary for responsible deployment in sensitive applications.
Methodology
TabSHAP employs a Shapley-style sampled-coalition estimator to assess feature importance through Jensen-Shannon divergence, comparing the model's output probability distributions before and after feature removal. It masks features at the level of serialized key:value pairs to preserve prompt integrity and aggregates class probabilities using a verbalizer-style approach.
Results
Experimental results on the Adult Income and Heart Disease benchmarks show that TabSHAP achieves higher faithfulness in feature attribution than random baselines and XGBoost proxies. The method effectively identifies critical features that lead to significant drops in model confidence when removed, highlighting the differences in decision logic between LLMs and traditional tree-based models.
Implications
TabSHAP's framework enhances the interpretability of LLMs in tabular data contexts, making it suitable for deployment in high-stakes fields such as healthcare and finance, where understanding model decisions is crucial for trust and accountability.