AI-generated summaries
Today's ML research,
without the noise.
Daily summaries of the latest machine learning papers from arXiv, processed every 8 hours.
48
Papers today
8h
Update frequency
7
Days of history
Reversal Q-Learning
Reinforcement Learning
Generative Models
Robotics
- RQL utilizes an expanded MDP framework to treat flow refinement steps as separate actions.
- The algorithm generates virtual on-policy trajectories by reversing flows, enabling off-policy learning.
- RQL effectively reduces the curse of horizon through bias-and-variance reduction techniques.
- Experimental results show RQL achieves superior performance in offline RL compared to existing methods.
Read more
Reversal Q-Learning
Summary
This paper introduces Reversal Q-Learning (RQL), a novel off-policy reinforcement learning (RL) algorithm that leverages iterative generative modeling techniques, particularly flow matching, to enhance offline RL performance. The authors propose an expanded Markov decision process (MDP) framework that treats individual flow refinement steps as separate actions, allowing for effective off-policy learning using prior data. RQL addresses challenges associated with traditional flow-based RL methods, such as instability from backpropagation through time and inefficient use of learned value functions. By generating virtual on-policy trajectories through 'reversing' flows and applying a bias-and-variance reduction technique, RQL mitigates the curse of horizon problem in off-policy RL. The authors validate their approach through extensive experiments on 50 simulated robotic tasks, demonstrating that RQL outperforms existing state-of-the-art flow-based offline RL algorithms, particularly in long-horizon manipulation and locomotion tasks.
Methodology
The methodology involves creating an expanded MDP framework where flow refinement steps are treated as distinct actions. RQL generates virtual trajectories by reconstructing flow trajectories from existing state-action pairs in the dataset. This is achieved through reverse flows, allowing for unbiased return estimates. The algorithm then applies multi-step returns to these trajectories to reduce the effective horizon for value function learning.
Results
RQL was tested across 50 challenging simulated robotic tasks, where it consistently outperformed several strong off-policy flow-based RL baselines, achieving the best average offline RL performance. The results indicate that RQL is particularly robust in long-horizon manipulation and locomotion environments.
Implications
The development of RQL has significant implications for offline reinforcement learning, particularly in robotics and complex task environments. By improving the efficiency and stability of training flow policies, RQL can enhance the applicability of RL in real-world scenarios where data is limited and sample efficiency is crucial.
Sum-of-Squares Degree Barriers for the Reweighted-Hinge Method in Robust Halfspace Learning: A Christoffel-Function Characterization
Theory
- Introduces the Christoffel function as a key tool for understanding outlier removal in robust learning.
- Establishes a margin-degree tradeoff that explains the logarithmic margin requirement for effective learning.
- Demonstrates that degree-2 certificates have a fundamental barrier in handling adversarial noise.
- Develops a degree-2t algorithm that improves robustness against noise while maintaining efficiency.
Read more
Sum-of-Squares Degree Barriers for the Reweighted-Hinge Method in Robust Halfspace Learning: A Christoffel-Function Characterization
Summary
This paper addresses the challenge of robust halfspace learning in the presence of adversarial noise, specifically focusing on the limitations of low-degree moment-based certificates in outlier removal. The author introduces a novel perspective using the Christoffel function to characterize the boundaries of corruption that can be hidden from bounded-degree tests. The main contribution is the establishment of a resolution principle that links the Sum-of-Squares (SoS) degree of outlier-removal certificates to the Christoffel function of the clean marginal. This principle leads to three significant consequences: a margin-degree tradeoff indicating that certifying dense data incurs a logarithmic cost in terms of margin, a degree-2 outlier barrier demonstrating that degree-2 certificates are limited in their effectiveness, and the development of a degree-2t algorithm that can effectively manage the corruption rate. The findings highlight the inherent limitations of existing methods and provide a clearer understanding of the relationship between the degree of the certificate and the robustness of the learning process.
Methodology
The paper employs a theoretical framework based on the Christoffel function to analyze the effectiveness of low-degree moment-based certificates in robust halfspace learning. It utilizes a combination of mathematical analysis and algorithmic design to derive the margin-degree tradeoff and develop a new degree-2t algorithm for outlier removal.
Results
The analysis reveals that the maximum corruption mass that can be hidden from a degree-2t certificate is precisely characterized by the Christoffel function. The paper also establishes that certifying dense data requires a SoS degree that grows logarithmically with the desired error rate, and it identifies a degree-2 barrier that limits the effectiveness of certain outlier-removal strategies.
Implications
The findings suggest that existing robust learning methods may need to be re-evaluated in light of the identified barriers, and they provide a theoretical foundation for developing more effective algorithms that can handle higher levels of adversarial noise. This work could influence future research in robust machine learning and outlier detection.
Learning Fair Pareto-Optimal Policies in Multi-Objective Reinforcement Learning
Reinforcement Learning
Optimization
Theory
- Introduces a framework for fairness in multi-policy MORL, allowing for a diverse set of policies based on varying user preferences.
- Theoretical analysis confirms that fair policies for certain welfare functions remain in the convex coverage set (CCS).
- Demonstrates that non-stationary and stochastic policies can improve fairness over traditional stationary and deterministic approaches.
- Presents three scalable methods for learning fair policies using a single parameterized network.
Read more
Learning Fair Pareto-Optimal Policies in Multi-Objective Reinforcement Learning
Summary
This paper addresses the challenge of fairness in multi-objective reinforcement learning (MORL), where policies must balance optimality and equity across conflicting objectives. The authors propose a novel framework that enables the learning of a set of fair policies, overcoming the limitations of existing single-policy methods that cater to fixed user preferences. The theoretical analysis shows that for concave and piecewise-linear welfare functions, fair policies remain within the convex coverage set (CCS). The authors demonstrate that non-stationary and stochastic policies can enhance fairness compared to stationary and deterministic policies. They introduce three scalable methods: an extension of Envelope for fair stationary policies, a non-stationary method using state-augmented accrued rewards, and a novel stochastic policy learning approach. Extensive experiments across three domains validate the effectiveness of these methods, showing that they yield fairer solutions compared to existing MORL baselines.
Methodology
The authors develop a framework for multi-policy MORL that learns a set of Pareto-optimal policies. They provide theoretical insights into the behavior of fair policies under specific welfare functions and propose three algorithms: an extension of Envelope for stationary policies, a non-stationary method incorporating state-augmented accrued rewards, and a stochastic policy learning method. These methods leverage a single parameterized network to maintain scalability.
Results
The proposed methods were validated through extensive experiments across three different domains. The results showed that the methods successfully learned a set of fair policies that accommodate different user preferences, outperforming existing MORL baselines in terms of fairness.
Implications
The findings suggest that the proposed framework can be applied in various domains where fairness is critical, such as healthcare, finance, and social decision-making, enabling more equitable outcomes across multiple objectives.
Beyond the Blood Draw: Explainable Machine Learning for Non-Invasive Dysglycemia Risk Screening
Interpretability
- Developed non-invasive ML models for dysglycemia risk screening using NHANES data.
- LightGBM model achieved the highest AUC of 0.820, outperforming established clinical risk scores.
- SHAP analysis identified key predictors such as age, race/ethnicity, and waist-to-height ratio.
- Demonstrated consistent model performance across demographic subgroups.
Read more
Beyond the Blood Draw: Explainable Machine Learning for Non-Invasive Dysglycemia Risk Screening
Summary
This paper addresses the challenge of diagnosing dysglycemia, which includes prediabetes and diabetes, through non-invasive methods. The authors developed and validated machine learning models that utilize a strictly non-invasive feature set, derived from the National Health and Nutrition Examination Survey (NHANES) data from 2017 to 2023, involving 14,352 participants. Six machine learning models were trained using stratified 5-fold cross-validation and compared against two established clinical risk scores: the Finnish Diabetes Risk Score (FINDRISC) and the American Diabetes Association Risk Test. The LightGBM model outperformed both clinical scores, achieving an area under the receiver operating characteristic curve (AUC) of 0.820, compared to 0.745 and 0.783 for FINDRISC and ADA Risk Test, respectively. SHAP analysis revealed that age, race/ethnicity, and waist-to-height ratio were the most significant predictors of dysglycemia risk. The study also assessed algorithmic fairness across various demographic groups, confirming consistent model performance across different strata. These findings suggest that explainable, laboratory-free dysglycemia screening could be effectively implemented in community settings and self-tracking health applications.
Methodology
The study utilized data from NHANES, employing a stratified, multistage probability sampling design. Six machine learning models were trained using a strictly non-invasive feature set, which included self-reported data, basic anthropometry, and home blood pressure measurements. The models were validated using stratified 5-fold cross-validation, and SHAP was applied for model interpretability. Algorithmic fairness was evaluated across various demographic subgroups.
Results
The LightGBM model achieved an AUC of 0.820 (95% CI: 0.806–0.835), outperforming the FINDRISC (0.745) and ADA Risk Test (0.783). Subgroup analyses showed consistent performance across demographic strata, with AUC values ranging from 0.735 to 0.832. SHAP analysis highlighted age, race/ethnicity, and waist-to-height ratio as significant predictors.
Implications
The findings suggest that non-invasive, explainable machine learning models can facilitate early dysglycemia detection, particularly in resource-limited settings. This approach could enhance community health initiatives and empower individuals to monitor their health through self-tracking applications.
Confusion-Aware Transfer Teacher Curriculum Learning Framework: Disentangling Scoring and Pacing Effects
Computer Vision
Efficient ML
Interpretability
- Introduction of a confusion-aware scoring function that improves difficulty ranking.
- Development of evaluation methods to separately assess scoring and pacing effects.
- Demonstration that confusion-aware curriculum ordering enhances data efficiency.
- Findings indicate that improving scoring alone is insufficient for overcoming TTF's limitations.
Read more
Confusion-Aware Transfer Teacher Curriculum Learning Framework: Disentangling Scoring and Pacing Effects
Summary
This paper addresses the challenges in curriculum learning (CL) by disentangling the effects of scoring and pacing in the Transfer Teacher framework (TTF). The authors introduce a confusion-aware difficulty scoring function that considers both the confidence in the correct class and the distribution of probabilities across incorrect classes. They validate this scoring function through two evaluation protocols: stage-wise test subsets and a pacing-isolated baseline. Experiments conducted on CIFAR-10 using ResNet-18 and VGG-16 demonstrate that while the confusion-aware scoring aligns with human intuition, merely improving the scoring function does not enhance accuracy over standard training. However, confusion-aware curriculum ordering shows significant data-efficiency benefits, outperforming random ordering by up to 8.7% in a 20% data regime. The findings suggest that TTF can be a promising method for data-efficient training despite its limitations in generalization.
Methodology
The authors propose a confusion-aware difficulty score that combines the model's confidence in the true class and the variance of the probability distribution over incorrect classes. They validate the scoring function using stage-wise test subsets and a pacing-isolated baseline. The methodology includes training teacher models (ResNet-18 and VGG-16) on CIFAR-10 and evaluating the effects of scoring and pacing on model performance.
Results
The confusion-aware scoring function produced interpretable difficulty rankings that aligned with human intuition. However, at full data, neither curriculum nor anti-curriculum ordering improved accuracy over standard training. In contrast, confusion-aware curriculum ordering led to consistent data-efficiency improvements, outperforming random ordering by up to 8.7% in the 20% data regime.
Implications
The findings suggest that curriculum learning can be enhanced by considering confusion in class predictions, which may lead to more effective training strategies in various machine learning applications. The proposed framework could be applied to improve data efficiency in training models across different domains.
ReQAT: Achieving Full-Precision Reasoning Accuracy with 4-bit Floating-Point Quantization-Aware Training
Large Language Models
Efficient ML
- ReQAT effectively addresses the accuracy degradation caused by 4-bit quantization in LRMs.
- The framework includes innovative techniques such as TAQ, SEM, and Q-FIT to enhance reasoning accuracy.
- ReQAT achieves higher accuracy than BF16 fine-tuning while maintaining the same training budget.
- Significant throughput improvements (up to 3.9×) are observed on NVIDIA hardware, facilitating efficient LRM inference.
Read more
ReQAT: Achieving Full-Precision Reasoning Accuracy with 4-bit Floating-Point Quantization-Aware Training
Summary
The paper introduces ReQAT, a novel training framework designed to enhance the performance of Large Reasoning Models (LRMs) when using 4-bit floating-point (FP4) quantization. Traditional quantization methods often lead to significant accuracy degradation, particularly in low-entropy tokens like digits and operators, which are crucial for reasoning tasks. ReQAT addresses this issue through three innovative components: Trace-Aligned QAT (TAQ), which focuses training updates on critical low-entropy decisions; Selective Entropy Minimization (SEM), which enhances model confidence at these positions; and Q-FIT, a quantization-friendly initialization method that stabilizes key-value (KV) cache transformations. The authors demonstrate that ReQAT not only recovers but also surpasses the accuracy of full-precision fine-tuning (BF16) while achieving substantial throughput improvements on NVIDIA hardware. This makes ReQAT a promising approach for efficient deployment of LRMs in real-world applications.
Methodology
ReQAT employs a reasoning-centric training framework that includes Trace-Aligned QAT (TAQ) to focus on critical low-entropy decisions, Selective Entropy Minimization (SEM) to reinforce model confidence at low-entropy positions, and Q-FIT for quantization-friendly initialization of KV cache transformations. The framework is evaluated on various LRMs and benchmarks, including AIME-120, MATH-500, and GSM8K.
Results
ReQAT consistently surpasses BF16 full fine-tuning accuracy, achieving 65.94% AIME accuracy under NVFP4 W4A4KV4 compared to 56.83% for the BF16 baseline. Additionally, it demonstrates a 3.1× throughput speedup on B200 and 3.9× on DGX Spark relative to BF16 baselines, indicating significant efficiency gains.
Implications
ReQAT presents a practical solution for deploying Large Reasoning Models in production environments, enabling efficient inference with high throughput while maintaining accuracy. This has potential applications in various fields requiring complex reasoning capabilities, such as natural language processing, automated decision-making systems, and real-time data analysis.
HawkesNest: A Multi-Axis Synthetic Benchmark for Spatiotemporal Pattern Complexity
Time Series
Theory
Generative Models
- Introduction of HawkesNest as a synthetic benchmark for STPP models.
- Establishment of four axes of spatiotemporal complexity with deterministic indices.
- Demonstration of model sensitivity to complexity through diagnostic tests.
- Validation of monotonicity and near-orthogonality of complexity indices.
Read more
HawkesNest: A Multi-Axis Synthetic Benchmark for Spatiotemporal Pattern Complexity
Summary
The paper introduces HawkesNest, a novel synthetic benchmark designed to evaluate spatiotemporal point process (STPP) models. Traditional evaluations often rely on real-world datasets, which obscure the underlying generative structures, making it challenging to diagnose model failures. HawkesNest addresses this by establishing a generator-aligned benchmark that systematically controls for four axes of spatiotemporal complexity: space-time entanglement, background heterogeneity, cross-type interaction, and domain topology. Each axis is associated with a deterministic index derived from the latent data-generating mechanism. The authors demonstrate that these indices exhibit monotonicity and near-orthogonality, allowing for effective diagnostic stress tests of STPP models. The paper illustrates the utility of HawkesNest by showing that existing Hawkes-family models degrade in performance under conditions of joint heterogeneity and entanglement complexity, while the AutoSTPP model shows sensitivity to increases in space-time entanglement. This work provides a structured approach to benchmarking STPP models, enhancing the understanding of their performance under controlled conditions.
Methodology
The authors developed a modular multivariate Hawkes backbone that allows for the composition of background, triggering, interaction, and domain components while maintaining global rate and stability constraints. They conducted controlled sweeps to validate the indices associated with the complexity axes and tested various STPP models against these benchmarks.
Results
The study confirmed that the complexity indices are monotonic and nearly orthogonal, facilitating effective evaluation of STPP models. It was found that Hawkes-family baselines showed performance degradation under conditions of joint heterogeneity and entanglement complexity. Additionally, the AutoSTPP model was identified as sensitive to isolated increases in space-time entanglement.
Implications
HawkesNest provides a robust framework for evaluating STPP models, enabling researchers to better understand model limitations and improve their designs. This benchmark can be applied across various domains where spatiotemporal data is prevalent, such as crime analysis, epidemiology, and environmental monitoring.
Remember, Don't Re-read: Stateful ReAct Agents for Token-Efficient Autonomous Experimentation
Large Language Models
Optimization
Efficient ML
- Introduction of a stateful ReAct agent for autonomous experimentation using LLMs.
- Significant reduction in token consumption (90% for hyperparameter tuning, 52% for code optimization) compared to stateless designs.
- Utilization of a typed persistent state to carry experimental history across iterations.
- Implementation details provided for practitioners to replicate the stateful autoresearch agent.
Read more
Remember, Don't Re-read: Stateful ReAct Agents for Token-Efficient Autonomous Experimentation
Summary
This paper introduces a novel approach to autonomous experimentation using large language models (LLMs) by reformulating the autoresearch pattern into a stateful ReAct agent architecture. The traditional stateless design incurs significant token costs due to the need to reconstruct experimental context at each iteration, leading to inefficiencies as the number of experiments increases. The proposed stateful ReAct agent utilizes a typed persistent state to carry experimental history across iterations, allowing for more efficient interactions with the ML system. The architecture is implemented using LangGraph, which enables the agent to manage its state effectively through a tool-calling interface. The paper evaluates the performance of the stateful agent on two benchmarks: hyperparameter tuning and code performance optimization. Results show that the stateful agent consumes significantly fewer tokens—90% less for hyperparameter tuning and 52% less for code optimization—while maintaining comparable optimization quality. This structural token reduction is achieved by operating within a fixed-size conversation window, thus enabling longer experiment sequences without the need for prompt truncation or summarization. The paper provides detailed architectural insights, making it accessible for practitioners to implement similar stateful autoresearch agents in their workflows.
Methodology
The methodology involves reformulating the autoresearch pattern into a stateful ReAct agent using LangGraph. The agent maintains a typed persistent state that carries forward experimental history, reasoning traces, and convergence tracking across iterations. The architecture includes a state graph with nodes for reasoning, tool execution, and convergence checking, allowing for efficient management of the experimental process.
Results
The stateful ReAct agent demonstrated a 90% reduction in token usage during hyperparameter tuning (2,492 tokens vs. 24,465 tokens) and a 52% reduction in code optimization (627K tokens vs. 1,275K tokens), while achieving similar optimization quality in both tasks. The structural efficiency stems from the agent's ability to operate within a fixed-size conversation window, significantly lowering per-iteration costs.
Implications
The findings suggest that stateful architectures can greatly enhance the efficiency of LLMs in iterative experimentation, making them more practical for long-term research workflows. This approach could be applied to various domains requiring extensive experimentation, such as machine learning model development and optimization.
One-Step Generalization Ratio Guided Optimization for Domain Generalization
Optimization
Theory
Computer Vision
- GENIE optimizer addresses parameter imbalance in Domain Generalization.
- Incorporates One-Step Generalization Ratio (OSGR) to improve gradient alignment.
- Achieves higher generalization potential while retaining SGD's convergence rate.
- Empirically validated across five standard DG datasets, outperforming established optimizers.
Read more
One-Step Generalization Ratio Guided Optimization for Domain Generalization
Summary
This paper addresses the challenge of Domain Generalization (DG), which aims to train models that can generalize to unseen target domains without overfitting to domain-specific features. The authors introduce GENIE (Generalization-ENhancing Iterative Equalizer), a novel optimizer that utilizes the One-Step Generalization Ratio (OSGR) to evaluate each parameter's contribution to loss reduction and to assess gradient alignment. Unlike traditional gradient-based methods that may reinforce spurious correlations, GENIE dynamically equalizes OSGR through a preconditioning factor, preventing a small subset of parameters from dominating the optimization process. Theoretical analysis shows that GENIE balances convergence contribution and gradient alignment, achieving a higher OSGR while maintaining the convergence rate of Stochastic Gradient Descent (SGD). Empirical results demonstrate that GENIE outperforms existing optimizers across multiple DG datasets and enhances the performance of various DG and single-DG methods.
Methodology
The authors propose GENIE, which integrates OSGR to quantify the contribution of each parameter to generalization. By employing a preconditioning factor, GENIE dynamically adjusts parameter-wise updates to ensure balanced optimization and mitigate overfitting to spurious correlations. Theoretical analysis supports the optimizer's design, focusing on both convergence speed and gradient alignment.
Results
GENIE consistently outperformed established optimizers in empirical tests across five standard DG datasets. The optimizer demonstrated improved performance in existing DG and single-DG algorithms, validating its effectiveness and broad applicability.
Implications
The development of GENIE has significant implications for improving model robustness in real-world applications where data distributions shift. It provides a new perspective on optimizing neural networks for better generalization, which can be beneficial in various fields requiring domain adaptation.
Volterra Generative Models
Generative Models
- Introduction of Volterra generative models that utilize path-dependent noise for improved temporal correlation.
- Development of Gaussian-quadrature-based Markovian approximations for fractional Volterra kernels.
- Derivation of an augmented reverse-time dynamics that maintains data-dimensional learning.
- Identification of covariance degeneracies and introduction of a Gaussian-bridge reconstruction sampler.
Read more
Volterra Generative Models
Summary
This paper introduces Volterra generative models, a novel continuous-time score-based framework that incorporates path-dependent noise through fractional kernels, addressing the limitations of traditional score-based diffusion models that rely on memoryless Brownian perturbations. The authors construct finite-dimensional Markovian lifts using Gaussian quadrature and a hybrid finite-difference exponential approximation to manage the non-Markovian dynamics of the Volterra processes. They prove squared error bounds and derive an augmented linear-Gaussian forward process, allowing the learning to remain data-dimensional by utilizing residual states and analytic auxiliary Gaussian scores. The study identifies issues of covariance and reverse-time degeneracies due to shared Brownian factors and signed weights, leading to the development of stabilized conditioning and a Gaussian-bridge reconstruction sampler for larger lifts. Experimental results on MNIST and CIFAR-10 demonstrate that the proposed model can enhance score-based generation, particularly with persistent fractional perturbations and small Markovian lifts, while the bridge sampler offers stability for larger configurations.
Methodology
The authors replace traditional Brownian perturbations with path-dependent stochastic convolutions using fractional kernels. They construct finite-dimensional Markovian lifts through Gaussian quadrature and hybrid finite-difference methods, allowing for the approximation of non-Markovian dynamics. The framework includes deriving an augmented linear-Gaussian forward process and addressing issues of covariance degeneracies through stabilized conditioning.
Results
The study demonstrates that Volterra generative models can effectively improve score-based generation on benchmark datasets like MNIST and CIFAR-10. The introduction of persistent fractional perturbations and the Gaussian-bridge reconstruction sampler enhances stability and performance, particularly in larger model configurations.
Implications
The findings suggest that incorporating path-dependent noise in generative models can lead to better representation of complex data distributions, potentially benefiting applications in image generation and other domains requiring high-dimensional data modeling.
Stop the Sampler! Classifier-Based Adaptive Stopping for Sampling Kernels
Generative Models
Theory
Efficient ML
- Introduces a classifier-based approach for adaptive stopping in MCMC sampling.
- Establishes theoretical connections between learned stopping policies and target densities.
- Implements a multilevel training scheme to enhance exploration in complex sampling scenarios.
- Demonstrates significant improvements in trajectory lengths and mode coverage over standard MCMC.
Read more
Stop the Sampler! Classifier-Based Adaptive Stopping for Sampling Kernels
Summary
This paper addresses the challenge of sampling from complex, unnormalized probability densities in Bayesian inference and probabilistic modeling. Traditional Markov chain Monte Carlo (MCMC) methods often face issues with slow mixing and high computational costs due to fixed trajectory lengths. The authors propose a novel framework that incorporates adaptive stopping into the sampling process by utilizing state-dependent neural classifiers trained to determine when a trajectory has reached a high-density region. By framing MCMC within the theory of non-acyclic generative flow networks (GFlowNets), the authors establish a theoretical connection between optimal classifiers and the target density through detailed balance conditions. They introduce a multilevel training scheme to enhance exploration in complex geometries. Experimental evaluations demonstrate that this approach significantly reduces average trajectory lengths while improving mode coverage and mixing compared to standard MCMC methods.
Methodology
The authors utilize GFlowNet methodology to train neural network classifiers that determine when to stop sampling trajectories in MCMC. They establish a theoretical framework linking stopping dynamics to the target density and develop practical algorithms that integrate learned stopping with adaptive sampling kernels. A multilevel training approach is introduced to improve exploration and training efficiency.
Results
The experimental results show that the proposed method yields substantially shorter sampling trajectories compared to standard MCMC methods. Additionally, it improves the ability to capture the geometry of complex target distributions and enhances mixing across modes through learned drift corrections.
Implications
This work has potential applications in various fields requiring efficient sampling from complex distributions, such as Bayesian statistics, probabilistic deep learning, and natural sciences. The adaptive stopping mechanism could lead to faster and more accurate inference in these domains.
CheckMIABench: Firm Foundations For Membership Inference Attacks on Language Models
NLP
Large Language Models
Theory
- Introduces CheckMIABench, a benchmark for evaluating MIAs on LLMs.
- Addresses the issue of distribution shifts that undermine MIA evaluations.
- Demonstrates that existing MIA methods may have inflated performance due to data distribution differences.
- Provides a modular library for implementing MIAs, promoting further research.
Read more
CheckMIABench: Firm Foundations For Membership Inference Attacks on Language Models
Summary
This paper addresses the challenges of evaluating membership inference attacks (MIAs) on large language models (LLMs), which are crucial for assessing the privacy properties of these models. The authors highlight the difficulties in creating clean evaluation benchmarks due to distribution shifts between member and non-member data sets, which can lead to misleading results. They propose a new benchmark, CheckMIABench, that allows for principled evaluation of MIAs by utilizing intermediate checkpoints of open-source models and their training data. This approach ensures that both member and non-member sets are drawn from the same distribution, thus providing a more accurate assessment of MIA effectiveness. The authors apply their framework to evaluate several existing MIA methods on the Pythia and OLMo model families, revealing that many previously reported attack performances may be overstated due to the distributional differences rather than genuine privacy leakage. Additionally, they release a modular library for designing and implementing MIAs, facilitating further research in this area.
Methodology
The authors construct a benchmark by leveraging the insight that training data before and after a fixed point during training are drawn from the same distribution. They validate their dataset splits using 'blind baselines' and benchmark several published MIA methods against the Pythia and OLMo model families, ensuring that the evaluation is not confounded by distributional differences.
Results
The evaluation of existing MIA methods on the new benchmark revealed more limited performance than previously reported, indicating that many attacks may not be as effective as claimed when evaluated under controlled conditions. The authors also found that simple supervised learning methods could outperform some of the more complex MIAs due to the distributional biases in the data.
Implications
The findings suggest that researchers need to be cautious when interpreting MIA results, as existing benchmarks may not accurately reflect the privacy risks posed by LLMs. The open-sourcing of the modular library also encourages further exploration and development of MIA techniques, which could lead to improved privacy protections in machine learning applications.
EnvShip-Bench: An Environment-Enhanced Benchmark for Short-Term Vessel Trajectory Prediction
Time Series
- Introduction of EnvShip-Bench as a unified benchmark for vessel trajectory prediction.
- Standardized forecasting protocol with consistent observation and prediction settings.
- Creation of a quality-first compact subset for efficient experimentation.
- Inclusion of synchronized environmental and social-context extensions.
Read more
EnvShip-Bench: An Environment-Enhanced Benchmark for Short-Term Vessel Trajectory Prediction
Summary
The paper introduces EnvShip-Bench, a comprehensive benchmark designed for short-term vessel trajectory prediction, addressing the limitations of existing maritime AIS resources. Current datasets suffer from inconsistent forecasting protocols, varying data quality, and a lack of contextual annotations, which complicate fair comparisons and context-aware modeling. EnvShip-Bench is constructed from large-scale raw AIS data sourced from the Danish Maritime Authority and NOAA, processed through a standardized pipeline. It features a unified forecasting protocol that includes 10 minutes of observation and prediction with a 20-second sampling rate in vessel-centric local metric coordinates. The benchmark comprises a large-scale core dataset, a quality-first compact subset for reproducible experimentation, and synchronized environmental and nearby-vessel context extensions. This structure supports various forecasting approaches, including trajectory-only, environment-aware, and interaction-aware models, all evaluated under a consistent framework. The extensive analysis of the benchmark statistics demonstrates its potential to enhance maritime trajectory forecasting research by providing a standardized and extensible foundation.
Methodology
The authors developed EnvShip-Bench by processing large-scale raw AIS data from the Danish Maritime Authority and NOAA through a common pipeline. They established a standardized forecasting protocol and created a layered benchmark that includes a core dataset, a compact subset, and context extensions, enabling systematic evaluation of various forecasting models.
Results
EnvShip-Bench provides a comprehensive dataset that covers diverse vessel categories and realistic operating conditions. The benchmark's statistics reveal a broad range of motion patterns and support for trajectory-only, environment-aware, and interaction-aware forecasting, demonstrating its effectiveness in facilitating fair comparisons and advancing research in maritime trajectory prediction.
Implications
EnvShip-Bench has significant implications for intelligent shipping, maritime surveillance, and navigation safety by providing a robust framework for developing and evaluating vessel trajectory prediction models. It encourages the adoption of standardized practices in the field, potentially leading to improved safety and efficiency in maritime operations.
Intelligence Is Not the Bottleneck: Validating an LLM First-Pass Manuscript Score Against Peer-Review Outcomes
Large Language Models
NLP
- AIPR provides a structured, evidence-grounded review and a numeric score without fine-tuning on prior reviews.
- The overall score effectively separates accepted from rejected submissions with an AUROC of 0.82.
- A simple prompt on the same model yields scores comparable to the full AIPR pipeline, highlighting the model's inherent validity.
- The study emphasizes the need for validation of automated scoring systems against human outcomes in peer review.
Read more
Intelligence Is Not the Bottleneck: Validating an LLM First-Pass Manuscript Score Against Peer-Review Outcomes
Summary
This paper investigates the validity of a large language model (LLM) system, AIPR, designed to assist in the peer review process by providing a first-pass manuscript score across five quality dimensions. Unlike previous studies that primarily evaluated the text generated by LLMs, this research focuses on the numeric score assigned by AIPR and its correlation with actual peer-review outcomes. The study analyzes 300 submissions from the ICLR 2026 conference, comparing the AIPR scores against public decision outcomes and reviewer ratings. The findings reveal that AIPR's overall score effectively distinguishes between accepted and rejected submissions, achieving an area under the receiver operating characteristic curve (AUROC) of 0.82. The score demonstrates a consistent increase across decision tiers and aligns closely with mean reviewer ratings. Notably, the model's performance is robust, as a simple one-paragraph prompt yields scores nearly as effective as the full AIPR pipeline, indicating that the model itself is a significant contributor to the score's validity. The paper emphasizes the importance of validating automated scoring systems against human outcomes, particularly in the context of peer review, which is often criticized for its inconsistency and bias.
Methodology
The study utilized a frozen pipeline of the AIPR system to evaluate 300 ICLR submissions, employing a pre-registered design to ensure unbiased results. The AIPR system generated scores based on a prompt without fine-tuning, and the outcomes were compared against public decision records and reviewer ratings from the conference.
Results
The AIPR system achieved an AUROC of 0.82, indicating strong performance in distinguishing between accepted and rejected papers. The overall score correlated well with reviewer ratings and showed a consistent trend across decision tiers. The reliability of the scoring was confirmed, with minimal variation across repeated evaluations.
Implications
The findings suggest that LLMs can provide valuable assistance in the peer review process by offering reliable first-pass assessments. This could help alleviate the burden on human reviewers and improve the efficiency of manuscript evaluation, particularly in high-volume submission environments.
Constrained Diffusion Models with Primal-Dual Inference
Optimization
Generative Models
Theory
- Constrained sampling is framed as a saddle-point problem in the Lagrangian dual domain.
- The PDI algorithm allows for joint evolution of primal denoising and dual ascent during sampling.
- A single dual-variable-conditioned score network generalizes across various Gibbs distributions.
- Convergence of dual iterates to optimal multipliers is established, with bounds on residual dual mismatch effects.
Read more
Constrained Diffusion Models with Primal-Dual Inference
Summary
This paper introduces constrained diffusion models utilizing primal-dual inference (PDI) to sample from optimal distributions in entropy-regularized optimization problems with average constraints. The authors formalize constrained sampling in the Lagrangian dual domain, where the optimal distribution is represented as a Gibbs distribution indexed by the optimal dual variable. Unlike traditional methods that estimate the dual multiplier before sampling, PDI integrates the dual variable into the sampling state, allowing for joint inference of the primal distribution and the dual variable during the reverse diffusion process. Each denoising step employs the score field associated with the current dual multiplier and updates the multiplier based on constraint violations observed in the denoised samples. A single dual-conditioned score network is trained to adapt to various Gibbs distributions encountered during inference. The paper proves the convergence of the time-averaged dual variables to a neighborhood of the optimal multiplier and bounds the impact of residual dual mismatch on the terminal distribution. Empirical evaluations demonstrate PDI's effectiveness in constrained sampling tasks, outperforming previous dual-training methods in scenarios such as Gaussian mixtures, wireless resource allocation, and portfolio management.
Methodology
The authors develop the PDI algorithm, which interleaves dual ascent with reverse diffusion sampling. At each step, the algorithm uses the score field associated with the current dual variable to denoise samples and updates the dual variable based on estimated constraint violations. A dual-conditioned score network is trained to represent a family of Gibbs distributions, allowing for adaptability to different optimization instances.
Results
PDI was empirically validated on various constrained sampling tasks, including a mixture of Gaussians, wireless resource allocation, and portfolio management. The results indicate that PDI outperforms traditional dual-training approaches, particularly in scenarios with shifted constraints, demonstrating robustness and adaptability.
Implications
The proposed PDI framework has significant implications for solving constrained optimization problems in various fields, such as telecommunications and finance, where probabilistic solutions are essential. It enhances the capability of diffusion models to handle constraints effectively, potentially leading to more efficient resource allocation and decision-making processes.
ReRAM-aware Model Finetuning addressing I-V Non-linearity and Retention Errors
Efficient ML
- Proposes a finetuning-based approach for DNN deployment on ReRAM, addressing non-idealities.
- Mitigates I-V non-linearity using a range-shrunk sinh transformation.
- Incorporates retention errors into a regularization loss during finetuning.
- Achieves similar accuracy to base models with minimal training overhead.
Read more
ReRAM-aware Model Finetuning addressing I-V Non-linearity and Retention Errors
Summary
This paper addresses the limitations of traditional computing architectures in deploying deep neural networks (DNNs) on ReRAM (Resistive RAM) due to intrinsic non-idealities such as I-V non-linearity and retention errors. The authors propose a finetuning-based hardware-aware training algorithm that allows for robust DNN deployment on ReRAM with minimal computational overhead. The proposed method incorporates a range-shrunk sinh transformation to mitigate I-V non-linearity and integrates retention errors into a regularization loss during the finetuning process. The framework is evaluated on various models and tasks, including image classification and question-answering, demonstrating that it achieves comparable accuracy to base models while significantly reducing the training burden associated with hardware-aware training frameworks that typically require training from scratch. The results indicate that the finetuning method leads to less than 2% accuracy degradation for models like MobileNetV3 on ImageNet and a minimal drop in F-1 score on the SQuAD v2 dataset.
Methodology
The authors developed a finetuning technique that adjusts pretrained models to compensate for ReRAM-specific non-idealities. This includes a differential ReRAM architecture and a weight-to-conductance mapping strategy to minimize variance. The training process incorporates input-transformation nonlinearity and retention-induced weight variance as a regularization loss, optimizing both task loss and regularization loss during model adaptation.
Results
The proposed method demonstrated iso-accuracy for larger models such as ResNet-18 and DeiT-Tiny, with less than 2% accuracy degradation for MobileNetV3 on ImageNet. For the SQuAD v2 dataset, the finetuning resulted in only a 1-point degradation in F-1 score, indicating effective performance retention despite the hardware non-idealities.
Implications
This work has significant implications for the deployment of large-scale DNNs on edge hardware, particularly in applications requiring low-power and efficient inference. It enables the transition of foundational models to practical hardware implementations without extensive retraining, thus facilitating broader adoption of in-memory computing technologies.
A Comparative Study of Graph Neural Network Layer Selection for Interaction Modelling in Driving Trajectory Prediction
Graph Learning
Robotics
Time Series
- Evaluation of 19 graph convolutional layer types for trajectory prediction.
- Identification of five effective layer combinations, including ARMA and Chebyshev layers.
- Sum-based aggregation methods outperform mean-based methods.
- Multi-head attention mechanisms enhance interaction modeling.
Read more
A Comparative Study of Graph Neural Network Layer Selection for Interaction Modelling in Driving Trajectory Prediction
Summary
This paper investigates the effectiveness of various Graph Neural Network (GNN) layers for predicting driving trajectories, a critical aspect of autonomous driving systems. The authors conduct a comparative study of 19 different graph convolutional layer types, focusing on their ability to model spatial interactions and temporal dynamics among road agents. The study identifies five superior layer combinations, notably including ARMA, Chebyshev, and topology-aware layers, which consistently outperform others in terms of prediction accuracy. The findings also reveal practical design principles for GNN architecture, such as the advantages of sum-based aggregation over mean-based methods, the benefits of multi-head attention mechanisms for richer interactions, and the importance of assigning different weights to various hop distances. This research not only fills a gap in the understanding of GNN architectures for trajectory prediction but also provides actionable insights for practitioners aiming to enhance the interpretability and effectiveness of their models.
Methodology
The authors build upon a previous architecture to evaluate 19 different graph convolutional layers, focusing on their spatial and temporal processing capabilities. They utilize a modified approach that maintains a centered map representation, reducing computational complexity while extending the prediction horizon. The architecture integrates map data processed through a ResNet-18 to create initial node representations, which are then updated through the selected GNN layers.
Results
The study highlights that certain layer combinations, particularly ARMA, Chebyshev, and topology-aware layers, yield superior performance in trajectory prediction tasks. The findings indicate that sum-based aggregation methods are more effective than mean-based methods, and the use of multi-head attention mechanisms significantly enriches the interaction modeling process. The proposed architecture allows for a longer prediction horizon with a shorter observation window, enhancing overall prediction accuracy.
Implications
The insights from this study can guide the design of more effective and interpretable GNN-based trajectory prediction models, which are crucial for the development of safe and efficient autonomous driving systems. The findings may also benefit traffic safety assessments and adaptive control in Intelligent Transportation Systems (ITS).
Monotonic Kolmogorov-Arnold Networks: A Theoretical and Empirical Study of Monotonicity as an Inductive Bias
Theory
Interpretability
- MKAN guarantees hard monotonicity across all parameter values, simplifying the training process.
- A novel representation-cost theorem provides a principled sizing rule for monotone encoders.
- Empirical results demonstrate MKAN's competitive performance on benchmark datasets, validating its effectiveness.
- MKAN uniquely combines hard monotonicity with per-edge functional transparency, enhancing interpretability.
Read more
Monotonic Kolmogorov-Arnold Networks: A Theoretical and Empirical Study of Monotonicity as an Inductive Bias
Summary
This paper introduces Monotonic Kolmogorov-Arnold Networks (MKAN), which address the need for hard monotonicity in neural networks while maintaining functional transparency. The authors critique existing monotonic neural network architectures, particularly the limitations of MonoKAN, which only enforces monotonicity on a restricted parameter subset and requires complex training procedures. MKAN overcomes these challenges by ensuring monotonicity through exponential reparameterization of B-spline coefficients, positive edge weights, and a monotone base activation, allowing for standard gradient descent training. The paper presents a representation-cost theorem, demonstrating that any feature extractor inducing a ball-shaped semantic-neighborhood partition can be realized monotonically with a bounded increase in output dimensions. Empirical evaluations show that MKAN performs competitively against state-of-the-art monotone neural networks on various benchmarks, achieving higher Spearman alignment in recovering ground-truth factors compared to existing methods. This work highlights the advantages of incorporating monotonicity as an inductive bias in neural network design, particularly in applications where outputs are expected to respond monotonically to inputs.
Methodology
The MKAN architecture employs exponential reparameterization of B-spline coefficients, positive edge weights, and a monotone base activation to ensure monotonicity. The training process is conducted using standard unconstrained gradient descent, making it more straightforward than previous approaches. The representation-cost theorem is derived to establish the relationship between the number of output dimensions and the number of non-monotone coordinates in the feature extractor.
Results
MKAN was evaluated on the SMM/ICML-2024 benchmark suite, showing competitive performance against state-of-the-art monotone neural networks in both classification and regression tasks. In self-supervised settings, MKAN validated the 2N* prediction from the representation-cost theorem and achieved significantly higher Spearman alignment in recovering ground-truth factors compared to KAN, MLP, and linear baselines.
Implications
The findings suggest that incorporating monotonicity as an inductive bias can enhance the interpretability and reliability of neural networks in applications where monotonic relationships are expected, such as in economic and scientific domains. The representation-cost theorem also provides a framework for designing monotonic encoders in future research.
From Reasoning Traces to Reusable Modules: Understanding Compositional Generalization in Language Model Reasoning
NLP
Large Language Models
Reinforcement Learning
- Introduces a hierarchical latent selection model to formalize compositional generalization in LLMs.
- Demonstrates the complementary roles of SFT and RL in developing reusable reasoning modules.
- Shows that RL can effectively extract and recombine atomic modules from compound reasoning traces.
- Finds that training on compound traces enhances generalization compared to isolated modules.
Read more
From Reasoning Traces to Reusable Modules: Understanding Compositional Generalization in Language Model Reasoning
Summary
This paper investigates the compositional generalization capabilities of large language models (LLMs) through a novel hierarchical latent selection model. The authors argue that the success of post-training pipelines combining supervised fine-tuning (SFT) and reinforcement learning (RL) is driven by the ability to identify and reuse atomic reasoning modules. The proposed model distinguishes between atomic skills (local operations) and routing mechanisms (how information is selected and composed). The authors demonstrate that SFT provides the raw materials for reasoning traces, while RL effectively decomposes these traces into reusable modules. Through controlled experiments, they show that RL can extract atomic modules from compound reasoning traces and recombine them to solve new configurations, leading to improved generalization. The findings suggest that training on compound traces is more beneficial than training on isolated modules, and that an effective training protocol involves SFT ensuring coverage of all atomic modules while RL focuses on exploring novel compositions. This work contributes to a deeper understanding of how LLMs can achieve compositional generalization in reasoning tasks.
Methodology
The authors propose a hierarchical latent selection model that captures the structure of reasoning traces through discrete selection variables. They conduct controlled experiments using synthetic reasoning tasks to validate their theoretical framework, focusing on how RL can decompose compound traces into atomic skills and routing mechanisms.
Results
The experiments reveal that RL can match the accuracy of direct atomic supervision while generalizing to unseen compositions. The strongest out-of-distribution performance is achieved when SFT covers the atomic inventory and RL explores novel compositions, with minimal overlap between the two.
Implications
This research has implications for improving the reasoning capabilities of LLMs, particularly in tasks requiring compositional generalization. It suggests that effective training strategies can enhance the flexibility and robustness of language models in handling complex reasoning tasks.
Probing, Fusion, and Trustworthiness: A Systematic Evaluation of Foundation Model Representations for Multimodal Cancer Analysis
Multimodal
- Foundation models can effectively extract representations from multimodal cancer data.
- Unimodal representations from images and omics data provide complementary predictive signals.
- Multimodal fusion strategies can enhance predictive performance, especially when no single modality is dominant.
- Conformal prediction demonstrates the trustworthiness of model predictions and uncertainty quantification.
Read more
Probing, Fusion, and Trustworthiness: A Systematic Evaluation of Foundation Model Representations for Multimodal Cancer Analysis
Summary
This paper systematically evaluates the effectiveness of foundation models (FMs) in extracting representations from multimodal cancer data, specifically focusing on whole-slide images and transcriptomic profiles. The study uses two real-world commercial cohorts, IH-BC and IH-NSCLC, to benchmark the performance of various FMs across eight classification tasks. The authors first assess unimodal probing performance and find that image and omics representations provide complementary predictive signals. They then explore multimodal fusion strategies to determine if combining these modalities enhances predictive performance. The study also incorporates conformal prediction to evaluate the trustworthiness of the models, revealing that while FM representations perform competitively on out-of-distribution data, multimodal fusion is particularly beneficial when no single modality dominates the signal. The findings highlight the importance of uncertainty-aware inference in clinical applications, suggesting that even when point predictions fail, the true diagnosis can often be recovered within a prediction set.
Methodology
The methodology involves three main parts: unimodal probing of image and omics representations across classification tasks, multimodal learning comparisons using three fusion strategies, and trustworthiness evaluation through conformal prediction. The study utilizes a deidentified dataset from Flatiron Health and Caris Life Sciences, focusing on breast cancer and non-small cell lung cancer cases.
Results
The results indicate that FM representations achieve competitive performance on out-of-distribution datasets. Multimodal fusion strategies improve performance on certain tasks, particularly when neither modality dominates the predictive signal. Conformal prediction shows that the majority of failed point predictions still allow for recovery of the true diagnosis within the prediction set, underscoring the value of uncertainty quantification.
Implications
The findings suggest that foundation models can be effectively utilized in clinical settings for cancer diagnosis, particularly when integrating multimodal data. The emphasis on uncertainty-aware inference could lead to more reliable clinical decision support systems, enhancing diagnostic accuracy and patient outcomes.
Simulation-Augmented Multi-Step Split Conformal Prediction for Aggregated Forecasts
Time Series
- SA-MSCP improves empirical coverage for aggregated forecasting tasks.
- The method uses block bootstrap to simulate future paths while preserving temporal dependence.
- Empirical evaluations show significant coverage gains over traditional methods.
- The approach is applicable to various temporal resolutions and aggregation units.
Read more
Simulation-Augmented Multi-Step Split Conformal Prediction for Aggregated Forecasts
Summary
This paper presents SA-MSCP, a novel simulation-augmented multi-step split conformal prediction method aimed at improving uncertainty quantification in aggregated forecasting tasks, such as annual totals and year-over-year growth rates. The method addresses the challenges posed by temporal dependence and nonlinear transformations in aggregated data, which complicate the application of traditional conformal prediction methods. SA-MSCP utilizes expanding-window cross-validation to collect residuals and employs a block bootstrap technique to simulate future paths, thereby preserving local dependence in the residuals. The prediction intervals are constructed from empirical quantiles of the simulated paths. The empirical evaluation of SA-MSCP on the M4 dataset and a proprietary dataset of 2,000 series demonstrates significant improvements in empirical coverage compared to a baseline simulated-path approach. The results indicate that simulation-enhanced conformal calibration is a robust framework for uncertainty quantification in aggregated time-series forecasting.
Methodology
SA-MSCP employs a simulation-augmented multi-step split conformal prediction framework. It collects residuals through expanding-window cross-validation and simulates future paths using a block bootstrap method to maintain local dependence. Prediction intervals are derived from empirical quantiles of the simulated paths, allowing for uncertainty quantification across multiple forecast horizons.
Results
The experiments reveal that SA-MSCP consistently outperforms a baseline simulated-path method in terms of empirical coverage across various targets, including monthly forecasts, aggregated annual totals, and year-over-year growth rates. The method achieves higher coverage rates, albeit with wider prediction intervals, indicating a trade-off between coverage and interval width.
Implications
The findings suggest that SA-MSCP can be effectively utilized in practical applications requiring reliable uncertainty quantification, such as in finance and retail forecasting. The method's ability to handle aggregated data and temporal dependence makes it a valuable tool for practitioners in these fields.
When Confidence Lacks Concepts: Interpretable OOD Detection via Representation Perturbations
Computer Vision
Interpretability
- Introduces CAPS, a novel framework for interpretable OOD detection using representation perturbations.
- Utilizes Sparse Autoencoders to learn class-specific concept vectors for enhanced interpretability.
- Demonstrates effectiveness across multiple medical imaging domains, including endoscopy and histopathology.
- Establishes a connection between extracted features and clinically meaningful visual patterns.
Read more
When Confidence Lacks Concepts: Interpretable OOD Detection via Representation Perturbations
Summary
This paper addresses the challenge of Out-of-Distribution (OOD) detection in deep neural networks, particularly in the context of medical imaging, where overgeneralization can lead to critical diagnostic errors. The authors propose a novel framework called Class-conditioned Activation PerturbationS (CAPS) that enhances interpretability in OOD detection by analyzing the stability of model predictions under class-specific perturbations. By utilizing Sparse Autoencoders (SAEs), the framework learns class-specific concept vectors that decompose dense representations into interpretable components. During inference, the model perturbs deeper-layer representations using these concept vectors and measures the stability of class logits. The hypothesis is that in-distribution samples will show low sensitivity to these perturbations, while OOD samples will exhibit significant deviations due to representational misalignment. The authors validate their approach across various medical imaging modalities, demonstrating its effectiveness and the clinical relevance of the extracted features.
Methodology
The authors employ Sparse Autoencoders (SAEs) to learn interpretable class-specific concept vectors from in-distribution data. The framework assesses the stability of model predictions by perturbing deeper-layer representations along these learned directions and measuring changes in class logits. This approach allows for a concept-conditioned analysis of OOD detection.
Results
The CAPS framework was evaluated on various medical imaging tasks, showing that in-distribution samples exhibited low sensitivity to perturbations, while OOD samples demonstrated significant deviations. The results indicate that the extracted features correspond to clinically relevant patterns, validating the framework's effectiveness in real-world applications.
Implications
The proposed method enhances the interpretability of OOD detection in safety-critical domains like healthcare, potentially improving trust and reliability in AI-driven diagnostic tools. It opens avenues for further research into interpretable machine learning methods that can be applied across various domains.
Diagnosing and Repairing Shape-Prior Shortcuts in Long-Range Single-Shot Fringe Projection Profilometry
Computer Vision
Robotics
Interpretability
- Identifies and addresses the limitations of existing single-shot FPP methods in long-range applications.
- Introduces a novel architecture, PhiCalNet, which improves depth reconstruction by focusing on phase representation.
- Demonstrates the effectiveness of mechanistic interpretability and uncertainty quantification in diagnosing and repairing model errors.
- Achieves a significant reduction in mean absolute error (MAE) from 14.54 mm to 4.46 mm with the new architecture.
Read more
Diagnosing and Repairing Shape-Prior Shortcuts in Long-Range Single-Shot Fringe Projection Profilometry
Summary
This paper addresses the challenges of long-range single-shot fringe projection profilometry (FPP), a technique for 3D reconstruction that has primarily been studied in close-range settings. The authors identify key issues such as the degradation of fringe signal-to-noise ratio at long distances, the ill-posed nature of single-shot reconstruction, and the lack of mechanistic interpretability in existing architectures. They propose a unified approach that employs mechanistic interpretability (MI) and conformal uncertainty quantification (UQ) to diagnose and repair architectural shortcomings in FPP systems. Through systematic experiments on a synthetic benchmark, the authors demonstrate that existing architectures rely on shape priors rather than effective fringe-phase decoding. They introduce PhiCalNet, a new architecture that outputs a wrapped-phase representation, significantly improving accuracy. The study concludes with a detailed analysis of the performance improvements and the convergence of MI and UQ diagnostics, highlighting the importance of architectural choices in achieving reliable long-range FPP.
Methodology
The authors utilize mechanistic interpretability (MI) and conformal uncertainty quantification (UQ) as diagnostic tools to identify architectural failures in existing FPP models. They conduct systematic ablation studies on a synthetic dataset of 15,600 fringe images to evaluate model performance. The proposed PhiCalNet architecture is designed to output wrapped-phase representations, which are then mapped to depth using a differentiable calibration layer.
Results
The study finds that the baseline UNet architecture achieves a mean absolute error (MAE) of 14.54 mm, primarily relying on shape priors. The introduction of PhiCalNet reduces the MAE to 4.46 mm, with the residual error concentrated at the ±π wrap discontinuity. Additionally, applying MI and UQ diagnostics to PhiCalNet results in a 64% reduction in root-mean-square error (RMSE) by rejecting the top 5% of object pixels based on uncertainty.
Implications
The findings suggest that improving architectural design in machine learning models can significantly enhance the reliability and accuracy of long-range 3D reconstruction techniques. This has potential applications in fields requiring high-precision metrology, such as robotics, surface inspection, and manufacturing process control.
StarOR: Synergizing Tree Search and Test-Time Reinforcement Learning for Optimization Modeling
Optimization
Reinforcement Learning
- Introduction of StarOR, a synergistic framework combining MCTS and Test-Time RL for optimization modeling.
- Decomposition of the modeling process into four hierarchical stages for improved policy adaptation.
- Implementation of an unsupervised multi-faceted reward system for evaluating formulation quality.
- Demonstration of state-of-the-art performance across multiple optimization benchmarks.
Read more
StarOR: Synergizing Tree Search and Test-Time Reinforcement Learning for Optimization Modeling
Summary
The paper introduces StarOR, a novel framework that combines Monte Carlo Tree Search (MCTS) with Test-Time Reinforcement Learning (RL) to enhance optimization modeling. Traditional methods in this domain often struggle with adapting to new problem distributions and can be brittle due to the hierarchical nature of optimization tasks, where early errors can propagate and invalidate entire models. StarOR addresses these challenges by decomposing the modeling process into four stages and employing a transient LoRA adapter updated via Gradient Reinforcement Policy Optimization (GRPO) at each non-terminal node. This allows for instance-specific policy refinement using MCTS-generated siblings as local comparison sets. Additionally, an unsupervised multi-faceted reward system provides feedback for intermediate decisions without requiring ground-truth labels. The experimental results demonstrate that StarOR achieves state-of-the-art performance across five optimization benchmarks, outperforming existing methods and large language models (LLMs) even with a smaller backbone. The framework's ability to adaptively refine policies during the modeling process marks a significant advancement in automated optimization modeling, making it more robust and reliable.
Methodology
StarOR employs a synergistic search-and-adaptation framework that integrates MCTS with Test-Time RL. The modeling process is divided into four stages, with a transient LoRA adapter updated at each non-terminal node. The framework utilizes MCTS-generated siblings for local comparisons, facilitating instance-specific policy refinement. An unsupervised multi-faceted reward system is designed to provide feedback on intermediate decisions without the need for ground-truth labels.
Results
StarOR consistently outperformed existing optimization modeling methods and frontier large language models across five benchmarks, demonstrating its effectiveness in achieving robust and reliable formulations even with a 4B parameter backbone.
Implications
The advancements presented in StarOR could significantly enhance automated optimization modeling across various industries, enabling non-experts to leverage complex operations research tools more effectively. The framework's ability to adaptively refine policies could lead to more accurate and efficient decision-making processes in real-world applications.
PowerOPD: Stabilizing On-Policy Distillation with Bounded Power Transformation
NLP
Large Language Models
Reinforcement Learning
- PowerOPD introduces bounded, sign-consistent rewards to stabilize on-policy distillation.
- The method significantly reduces sample inefficiency and training instability compared to standard OPD.
- PowerOPD achieves benchmark-averaged accuracy gains of up to +6.37 over vanilla OPD.
- The approach reduces wall-clock time by 59.2% and peak GPU memory by 23.1% compared to full-vocabulary OPD.
Read more
PowerOPD: Stabilizing On-Policy Distillation with Bounded Power Transformation
Summary
The paper introduces PowerOPD, a novel approach to on-policy distillation (OPD) for large language models that addresses significant training inefficiencies and instabilities associated with standard OPD methods. The authors identify that the conventional OPD, which utilizes a reverse-KL objective based on student-sampled tokens, suffers from high variance gradients and sample inefficiency, leading to poor performance compared to full-vocabulary OPD. They diagnose these issues as stemming from the unbounded nature of the log-ratio reward used in OPD, which causes extreme reward variances and instability during training. To mitigate these problems, PowerOPD employs a Box-Cox power transformation to create bounded, sign-consistent rewards that stabilize training dynamics. The proposed method significantly improves accuracy and efficiency across various benchmarks, demonstrating that larger values of the transformation parameter α enhance performance while reducing computational costs. The authors validate PowerOPD on six mathematical reasoning benchmarks and multiple teacher-student pairs, showing substantial gains in accuracy and efficiency over both vanilla OPD and full-vocabulary OPD.
Methodology
PowerOPD reformulates the reward structure in on-policy distillation by utilizing the Box-Cox power transformation to create bounded rewards. This approach ensures that the rewards remain stable and consistent in sign, addressing the issues of high variance and instability present in the standard log-ratio reward. The authors evaluate PowerOPD across multiple benchmarks and teacher-student configurations to assess its effectiveness.
Results
PowerOPD demonstrates a +9.6 accuracy gain over vanilla OPD and matches the accuracy of full-vocabulary OPD with 10 times fewer training steps. Across six benchmarks, it achieves average gains of +6.37 in Avg@8 and +5.71 in Pass@8 compared to vanilla OPD, while also showing significant reductions in computational resource usage.
Implications
The findings suggest that PowerOPD can enhance the efficiency and effectiveness of training large language models, making it a valuable technique for improving model performance in various NLP tasks. Its ability to stabilize training dynamics could lead to broader applications in reinforcement learning and other areas where on-policy learning is critical.
Conservation Laws for Modern Neural Architectures
Theory
Optimization
- Development of a unified framework for characterizing conservation laws in modern neural architectures.
- Complete characterizations of conservation laws for various activation functions and architectures.
- Experimental validation of theoretical predictions regarding invariants in training dynamics.
- Insights into the implicit biases of training and their implications for optimization and convergence.
Read more
Conservation Laws for Modern Neural Architectures
Summary
This paper addresses the understanding of gradient descent dynamics in over-parameterized neural networks through the lens of conservation laws. While previous work has explored these laws in simpler architectures, this study develops a unified framework to characterize conservation laws for modern neural architectures, including feedforward networks with various activation functions (GELU, SiLU, SwiGLU), multi-head attention mechanisms, and Mixture-of-Experts (MoE) models. The authors provide a comprehensive theoretical foundation and establish complete characterizations of conservation laws, which are quantities invariant along optimization trajectories. The paper also discusses the implications of these laws for training dynamics, optimization stability, and convergence. Experimental validation supports the theoretical findings, showcasing the relevance of conservation laws in understanding the implicit biases of contemporary neural networks. The work highlights the importance of these laws in guiding the design and optimization of modern architectures, ultimately contributing to the broader understanding of deep learning dynamics.
Methodology
The authors build upon existing frameworks for conservation laws, extending them to modern architectures. They formulate the problem of identifying conservation laws as a system of partial differential equations, providing a general solution strategy. The paper includes rigorous proofs and toy examples to illustrate the concepts before delving into more complex models.
Results
The main results include the complete characterization of conservation laws for feedforward networks with GELU, SiLU, and SwiGLU activations, as well as a resolution of the open problem regarding multi-head attention. The findings reveal new invariants and establish a comprehensive understanding of conservation laws in contemporary neural architectures.
Implications
The implications of this work extend to the design and optimization of neural networks, offering insights that can enhance training efficiency and model performance. Understanding conservation laws can inform the development of new optimization methods and contribute to the theoretical foundation of deep learning.
When Generator Replay Degrades: Projected Rehearsal Orchestration for Heterogeneous Federated Class-Incremental Learning
Federated Learning
Computer Vision
Graph Learning
- Identifies limitations of generator-based FCIL under heterogeneous data streams, including modality coupling and error compounding.
- Introduces PRO, a generator-free framework for FCIL that uses projected rehearsal with class-level memories.
- Presents PRO-MAX, which enhances PRO with neighborhood-weighted memory alignment to adapt to representation drift.
- Demonstrates improved performance in heterogeneous environments compared to traditional replay methods.
Read more
When Generator Replay Degrades: Projected Rehearsal Orchestration for Heterogeneous Federated Class-Incremental Learning
Summary
This paper addresses the challenges of Federated Class-Incremental Learning (FCIL) in scenarios where clients experience heterogeneous data streams, leading to difficulties in knowledge retention and representation learning. Traditional methods often rely on generator-based replay, which can fail under diverse task conditions. The authors propose a novel framework called Projected Rehearsal Orchestration (PRO), which eliminates the need for synthetic input replay by utilizing compact class-level projected memories stored on the server. This allows clients to engage in balanced pseudo multi-task training, integrating current examples with old projected memories. To further enhance performance under significant representation drift, the authors introduce PRO-MAX, which incorporates neighborhood-weighted memory alignment. The proposed methods were evaluated across various benchmarks, including image, text, and graph data, demonstrating improved retention and utility in heterogeneous settings while remaining competitive in homogeneous scenarios. The findings indicate that merely increasing replay quantity does not address quality issues, and the proposed methods maintain better alignment with evolving representations.
Methodology
The authors developed the Projected Rehearsal Orchestration (PRO) framework, which maintains compact projected memories on the server and allows clients to perform balanced pseudo multi-task training. PRO-MAX extends this framework by introducing neighborhood-weighted memory alignment to adapt to changes in representation. The methods were evaluated across multiple benchmarks to assess their effectiveness in both homogeneous and heterogeneous FCIL scenarios.
Results
The experiments showed that both PRO and PRO-MAX significantly improved knowledge retention and overall utility in heterogeneous data streams compared to traditional generator-based replay methods. Even with expanded replay budgets, traditional methods degraded under supervision imbalance and stage misalignment, while the proposed methods maintained better alignment with evolving representations.
Implications
The findings suggest that the proposed methods can enhance the robustness of federated learning systems in real-world applications where data heterogeneity is prevalent. This could lead to more effective and reliable models in privacy-sensitive environments, such as healthcare and finance, where data cannot be centralized.
Taylor-Calibrate: Principled Initialization for Hybrid Linear Attention Distillation
NLP
Large Language Models
Efficient ML
- Taylor-Calibrate offers a principled approach to initializing hybrid linear attention models, focusing on recurrent dynamics rather than just projection copying.
- The method utilizes Taylor-derived statistics from teacher attention to set key parameters for GDN students.
- Empirical evaluations show substantial improvements in zero-shot performance and training efficiency.
- The approach highlights the importance of proper initialization in the conversion process of pretrained models to hybrid architectures.
Read more
Taylor-Calibrate: Principled Initialization for Hybrid Linear Attention Distillation
Summary
The paper introduces Taylor-Calibrate, a novel initialization method for hybrid linear attention models, specifically targeting Gated DeltaNet (GDN) students. Hybrid linear attention models aim to improve long-context inference by reducing the quadratic complexity of traditional softmax attention while maintaining performance. The authors highlight that converting a pretrained Transformer to a hybrid model often leads to poor initialization, which hampers the learning process. Taylor-Calibrate addresses this issue by using Taylor-guided statistics from teacher attention to set crucial parameters such as value projections, memory timescales, and gating dynamics. The method consists of two stages: first, it initializes GDN parameters based on teacher attention statistics; second, it aligns each converted layer to match the teacher's output. The results demonstrate that Taylor-Calibrate significantly enhances zero-shot performance and reduces the number of training tokens required for effective distillation compared to naive conversion methods.
Methodology
The authors propose a two-stage initialization method called Taylor-Calibrate. The first stage involves extracting statistics from teacher attention maps to initialize GDN parameters, including value scale, recurrent decay timescale, and write gates. The second stage applies a short layer-local alignment step to ensure that each converted GDN layer matches the output of the corresponding teacher layer.
Results
Taylor-Calibrate achieves up to an 88× improvement in zero-shot performance across various teacher settings and layer-selection policies. It also reduces the number of training tokens needed for effective model recovery by 4.9× to 9.2× compared to naive initialization methods.
Implications
The findings suggest that effective initialization is crucial for the successful conversion of pretrained models to hybrid architectures, potentially leading to more efficient and capable models for long-context tasks in natural language processing.
Physics-conforming Latent Twins
Theory
Efficient ML
Time Series
- Introduces a framework for learning surrogate models that conform to physical principles.
- Develops a constraint-transfer viewpoint linking physical structures in state and latent spaces.
- Proves structure-preservation bounds that enhance control over physical defects.
- Derives conditions for latent flow maps to preserve invariants and enforce dissipative structures.
Read more
Physics-conforming Latent Twins
Summary
The paper introduces the Physics-conforming Latent Twins framework, which aims to create surrogate models for time-dependent physical systems that not only interpolate training data accurately but also adhere to fundamental physical principles such as conservation laws and invariants. The authors build upon the Latent Twin formulation, which jointly learns an encoder, decoder, and latent flow map, while ensuring that the latent dynamics respect prescribed physical structures. A novel constraint-transfer viewpoint is developed to connect physical constraints in the original state space with compatible constraints in the latent space. The authors prove structure-preservation bounds that show how enforcing physical constraints in latent space enhances control over physical defects in decoded outputs. They derive algebraic conditions for latent flow maps that maintain linear and quadratic invariants and enforce dissipative inequalities. The framework is validated through numerical experiments on ordinary and partial differential equations, demonstrating improved constraint satisfaction, structural fidelity, and long-term behavior, while maintaining accurate surrogate predictions.
Methodology
The methodology involves a joint learning approach where an encoder, decoder, and latent flow map are trained together. The latent flow map is constrained to preserve or dissipate specific physical quantities, ensuring that the learned dynamics respect the underlying physical principles. The authors also derive algebraic conditions for the latent flow maps to maintain invariants and dissipative inequalities.
Results
The numerical experiments conducted on various ODE and PDE benchmarks showed that the Physics-conforming Latent Twins framework significantly improved constraint satisfaction and structural fidelity. The results indicated that the surrogate models maintained accurate predictions while demonstrating desirable long-time behavior, including stability in linear systems and appropriate dynamics in nonlinear systems.
Implications
The implications of this work extend to various fields where accurate modeling of physical systems is crucial, such as engineering, physics, and applied mathematics. The framework can enhance the reliability of surrogate models used in simulations and predictions, leading to better decision-making in complex physical scenarios.
From Drift to Coherence: Stabilizing Beliefs in LLMs
NLP
Large Language Models
Theory
- LLMs exhibit early-stage belief drift, violating the martingale property during initial predictions.
- Prompted predictive resampling (PPR) allows for the observation of belief stabilization over multiple resampling steps.
- A seed-answer prompting strategy accelerates the stabilization of predictive beliefs.
- A self-consistency loss can be used to fine-tune LLMs, amortizing early-stage belief drift.
Read more
From Drift to Coherence: Stabilizing Beliefs in LLMs
Summary
This paper investigates the belief dynamics of large language models (LLMs) in the context of multiple-choice question answering, particularly focusing on the martingale property of predictive beliefs. The authors introduce a novel technique called prompted predictive resampling (PPR), which involves generating a sequence of answers to the same question to observe belief stabilization over time. Initial findings indicate that LLMs exhibit early-stage belief drift, violating the martingale property. However, after sufficient resampling, the beliefs converge to a coherent predictive distribution. To enhance this stabilization process, the authors propose a seed-answer prompting strategy and a self-consistency loss for fine-tuning. Experiments demonstrate that these methods significantly reduce belief drift and improve predictive coherence without compromising accuracy, suggesting that LLMs can be effectively aligned with Bayesian principles in practical applications.
Methodology
The authors employed a prompted predictive resampling (PPR) approach, where LLMs generate multiple answers to the same question in sequence. They analyzed the belief dynamics through empirical experiments on multiple-choice question answering benchmarks, introducing techniques to accelerate stabilization and fine-tune the models to reduce early-stage drift.
Results
The experiments showed that after sufficient resampling steps, the predictive distributions of LLMs stabilized into coherent sequences, demonstrating improved uncertainty calibration. The seed-answer prompting and self-consistency loss strategies were effective in reducing belief drift and enhancing predictive coherence, leading to better performance on QA benchmarks.
Implications
The findings suggest that LLMs can be aligned more closely with Bayesian inference principles, potentially improving their reliability and interpretability in various applications, particularly in contexts requiring coherent decision-making and uncertainty quantification.
An Exploratory Study of Blood Glucose Estimation from Photoplethysmography Signals using Machine Learning
Time Series
- Development of a non-invasive method for continuous glucose monitoring using PPG signals.
- Creation of a paired dataset from PPG and glucose measurements over two weeks.
- Application of machine learning techniques for feature extraction and prediction.
- Preliminary results suggest potential predictive capabilities of PPG signals for glucose estimation.
Read more
An Exploratory Study of Blood Glucose Estimation from Photoplethysmography Signals using Machine Learning
Summary
This paper addresses the pressing health issue of diabetes management by exploring non-invasive methods for continuous glucose monitoring (CGM) using Photoplethysmography (PPG) signals. Traditional CGM methods are invasive and can lead to complications, highlighting the need for alternative approaches. The authors propose a machine learning-based system that utilizes PPG signals collected from smartwatches alongside glucose levels recorded by a CGM device. They created a paired dataset from five volunteers over a two-week period, capturing continuous PPG data at a high sampling rate and glucose measurements at 15-minute intervals. The study involved preprocessing the data through cubic spline interpolation to align the sampling rates and feature extraction using a sliding window approach to derive 918 features from the PPG signals. Preliminary results indicate that predictive signals may exist, although further exploration with larger datasets is necessary to validate the findings. The dataset is made publicly accessible for future research.
Methodology
The study involved collecting PPG data from smartwatches and glucose levels from CGM devices simultaneously. Data preprocessing included cubic spline interpolation to align the sampling rates and feature extraction using a sliding window method to derive a fixed number of features from the PPG signals.
Results
Preliminary experimental results indicate that there are predictive signals within the PPG data that may correlate with blood glucose levels. However, the authors emphasize the need for further research with a larger dataset to confirm these findings.
Implications
This research could lead to the development of more accessible and user-friendly glucose monitoring systems, potentially improving diabetes management and patient outcomes through non-invasive technology.
Online LLM Selection via Constrained Bandits with Time-Varying Demand
Large Language Models
Reinforcement Learning
Optimization
- Introduces a constrained stochastic bandit framework for online LLM selection.
- Addresses both hard budget constraints and soft latency SLAs in model selection.
- Develops the COPAC-UCB algorithm, which balances performance and feasibility under uncertainty.
- Demonstrates theoretical guarantees of sublinear regret and constraint violations.
Read more
Online LLM Selection via Constrained Bandits with Time-Varying Demand
Summary
This paper addresses the challenge of selecting appropriate Large Language Models (LLMs) in edge-cloud inference systems, where tasks vary in accuracy, latency, and cost. The authors propose a constrained stochastic bandit framework to dynamically select LLMs while adhering to hard and soft resource constraints. The hard constraint relates to budget limits on token-based inference costs, while the soft constraint pertains to latency service-level agreements (SLAs). The proposed algorithm, COPAC-UCB, utilizes confidence-bound estimates and demand predictions to optimize model selection under uncertainty. The study demonstrates that the algorithm achieves sublinear regret and constraint violations compared to an offline benchmark. Experimental results on synthetic workloads validate the effectiveness of the approach in dynamic environments, showcasing its robustness in managing resource constraints and adapting to time-varying task demands.
Methodology
The authors formulate the online LLM selection problem as a constrained stochastic multi-armed bandit problem, integrating both packing-type (hard budget) and covering-type (soft latency) constraints. The COPAC-UCB algorithm employs confidence-guided estimation, utilizing upper confidence bounds for rewards and covering constraints, and lower confidence bounds for packing constraints, along with Lagrangian regularization. A forecasting component is included to estimate cumulative task loads, enhancing responsiveness to demand variability.
Results
The proposed COPAC-UCB algorithm achieves sublinear regret and minimizes constraint violations compared to an offline benchmark. Experimental results indicate that the algorithm effectively adapts to dynamic task demands and resource constraints, outperforming static selection strategies.
Implications
This research has significant implications for real-world applications of LLMs in edge-cloud systems, particularly in optimizing resource utilization and ensuring service quality under varying task demands. The findings can inform the design of adaptive inference systems that leverage multiple LLMs efficiently.
RECTOR: Masked Region-Channel-Temporal Modeling for Affective and Cognitive Representation Learning
Time Series
- RECTOR introduces a novel self-supervised learning framework for EEG/sEEG data.
- The framework evolves static anatomical definitions into adaptive functional regions.
- It employs a unified approach with three complementary objectives for robust representation learning.
- RECTOR sets new benchmarks in EEG emotion recognition and sEEG task engagement classification.
Read more
RECTOR: Masked Region-Channel-Temporal Modeling for Affective and Cognitive Representation Learning
Summary
The paper introduces RECTOR, a self-supervised framework designed to enhance representation learning from EEG/sEEG data for diagnosing affective and cognitive disorders. Traditional methods struggle with the dynamic nature of brain activity, often relying on fixed anatomical models that fail to adapt to functional changes during cognitive processing. RECTOR addresses these challenges through a hierarchical, block-sparse self-attention mechanism called RECTOR-SA, which evolves static anatomical definitions into adaptive functional regions. The framework employs Masked Topology and Representation Learning (MTRL), optimizing three objectives: Masked Predictive Modeling (MPM), Topological Structure Modeling (TSM), and Cross-View Consistency (CVC). This unified approach allows for robust feature reconstruction and improves the model's ability to generalize across different EEG/sEEG setups. The results demonstrate that RECTOR achieves state-of-the-art performance in EEG emotion recognition and sEEG task engagement classification, showcasing its robustness to missing channels and its potential for large-scale pre-training on heterogeneous data.
Methodology
RECTOR utilizes a hierarchical, block-sparse self-attention mechanism driven by Adaptive Functional Partitioning (AFP) to capture region-channel-temporal dynamics. It incorporates Masked Topology and Representation Learning (MTRL), which combines Masked Predictive Modeling, Topological Structure Modeling, and Cross-View Consistency into a single forward pass to enhance representation learning.
Results
RECTOR outperforms existing state-of-the-art methods in EEG emotion recognition and sEEG task engagement classification across various benchmarks, achieving superior performance while maintaining computational efficiency.
Implications
The findings suggest that RECTOR can serve as a powerful tool for clinical diagnosis of affective and cognitive disorders, providing objective biomarkers and enabling the development of personalized brain-computer interfaces (BCIs). Its robustness to data variability positions it well for large-scale applications in diverse clinical settings.
The Critical Role of Model Selection in Causal Inference: A Comparative Analysis of Classification Models within the InferBERT Framework for Pharmacovigilance
NLP
Large Language Models
Theory
- Model selection is critical for causal inference in pharmacovigilance.
- BioBERT outperforms other models in predictive accuracy and causal term identification.
- Domain-specific pre-training is a decisive factor for model success.
- Probability calibration improves ECE but can negatively affect accuracy.
Read more
The Critical Role of Model Selection in Causal Inference: A Comparative Analysis of Classification Models within the InferBERT Framework for Pharmacovigilance
Summary
This paper addresses the challenge of distinguishing causal adverse drug events (ADEs) from spurious correlations in pharmacovigilance using the InferBERT framework, which integrates transformer models with Do-calculus. The authors conduct a systematic evaluation of various classification models to determine their impact on causal inference within this framework. They assess simpler statistical models, the benefits of domain-specific pre-training, and the effects of scaling to large language models (LLMs). The study utilizes two pharmacovigilance benchmarks: Analgesics-induced Acute Liver Failure (AILF) and Tramadol-related Mortalities (TRAM), evaluating four models: XGBoost, ALBERT, BioBERT, and Med-LLaMA. Through a rigorous comparative study involving 5-fold cross-validation over 20 runs, the authors measure predictive accuracy, Expected Calibration Error (ECE), and the concordance of causal terms with traditional methods. The findings reveal that BioBERT significantly outperforms other models in predictive accuracy and causal term identification, emphasizing the importance of domain-specific pre-training. The results suggest that investing in manageable-sized, domain-aware models is more effective than scaling model size for computational pharmacovigilance.
Methodology
The study employs a two-stage InferBERT process, where a classification model predicts clinical outcomes based on patient reports, followed by causal analysis using Do-calculus. Four models are evaluated: XGBoost, ALBERT, BioBERT, and Med-LLaMA, with performance assessed through predictive accuracy, probability calibration, and concordance with traditional pharmacovigilance methods using a 5-fold cross-validation design repeated over 20 runs.
Results
BioBERT demonstrated statistically significant superiority in predictive accuracy across both datasets (p < 0.0001). The larger Med-LLaMA model consistently underperformed, ranking last. Domain-specific pre-training was identified as a crucial factor for success, while probability calibration improved ECE but had inconsistent effects on accuracy and causal discovery.
Implications
The findings underscore the importance of model selection and domain-specific pre-training in causal inference for pharmacovigilance, suggesting that smaller, specialized models may be more effective than larger, general-purpose models. This has implications for the design of future pharmacovigilance systems and the development of machine learning methodologies in healthcare.
AI for Social Good: An Investigation of the Causal Relationship Between Environmental Regulations and Their Effects on Air Pollution in London, UK
Time Series
- Development of an uncertainty-aware Bayesian deep learning framework for causal inference in air pollution regulation.
- Estimation of a 12.35% reduction in PM2.5 levels due to regulatory measures in London.
- Identification of stronger regulatory effects post-2013, with peak improvements in 2018-2019.
- Demonstration of how causal AI can support evidence-based environmental policy-making.
Read more
AI for Social Good: An Investigation of the Causal Relationship Between Environmental Regulations and Their Effects on Air Pollution in London, UK
Summary
This study investigates the causal relationship between environmental regulations and air pollution levels, specifically PM2.5 concentrations, in London from 2010 to 2020. The authors develop a Bayesian deep learning framework that incorporates various data sources, including PM2.5 measurements, meteorological data, socioeconomic indicators, and regulatory status for 32 policy measures. By employing a Bayesian Long Short-Term Memory (LSTM) model, the framework captures temporal dependencies and adjusts for non-random policy implementation through a regulation status prediction branch. The study estimates the regulatory effects by comparing observed PM2.5 levels with counterfactual predictions under a hypothetical no-regulation scenario. The results indicate that London's air pollution regulations were associated with an average PM2.5 reduction of 1.88 μg/m3, representing a 12.35% relative reduction, particularly evident after 2013 and peaking in 2018 and 2019. The findings underscore the importance of sustained regulatory interventions for improving air quality and demonstrate how uncertainty-aware causal AI can enhance environmental governance and public health protection.
Methodology
The study employs a Bayesian LSTM model to analyze daily PM2.5 concentrations alongside meteorological and socioeconomic data. It incorporates a regulation status prediction branch to adjust for non-random policy implementation, allowing for counterfactual comparisons to estimate regulatory effects.
Results
The analysis reveals that London's air pollution regulations led to an average PM2.5 reduction of 1.88 μg/m3, translating to a 12.35% relative decrease. The regulatory effects were minimal before 2013 but became more pronounced from 2013 to 2017, with the strongest impacts observed in 2018 and 2019.
Implications
The findings provide a robust basis for policymakers to assess the effectiveness of air pollution regulations, advocating for sustained and cumulative regulatory efforts. The methodology can be applied to other urban settings for evaluating environmental policies and enhancing public health governance.
Scalar-Stepsize Nonuniform Monte Carlo Optimistic Policy Iteration: A Certified Counterexample
Reinforcement Learning
Theory
Optimization
- Establishes a certified counterexample for nonuniform-state-selection in scalar-stepsize Monte Carlo OPI.
- Demonstrates that nonuniform update frequencies can lead to nonconvergence in a specific MDP setup.
- Identifies the distortion of residual dynamics as a key factor in creating stable nonoptimal cycles.
- Utilizes computer-assisted methods to rigorously certify the counterexample.
Read more
Scalar-Stepsize Nonuniform Monte Carlo Optimistic Policy Iteration: A Certified Counterexample
Summary
This paper presents a certified counterexample to the convergence of the scalar-stepsize nonuniform Monte Carlo optimistic policy iteration (OPI) in the context of a three-state, two-action discounted Markov Decision Process (MDP). Building on Tsitsiklis' foundational work, which established convergence under uniform update structures, the author demonstrates that nonuniform update frequencies introduce significant challenges that can prevent convergence. The study constructs a specific MDP and a nonuniform update distribution that leads to a failure of the stochastic recursion to converge with positive probability. The analysis reveals that the nonuniform sampling distorts the residual dynamics, resulting in a stable nonoptimal cycle rather than convergence to the Bellman fixed point. The paper rigorously certifies this counterexample using a combination of computer-assisted orbit certificates and martingale trapping arguments, highlighting the geometric obstructions caused by nonuniform sampling. The findings underscore the importance of uniform sampling in ensuring convergence in Monte Carlo OPI and provide insights into the limitations of scalar-stepsize updates in asynchronous settings.
Methodology
The paper constructs a three-state, two-action discounted MDP and defines a nonuniform update distribution. It analyzes the limiting mean field dynamics and employs a differential inclusion approach to study the behavior of the policy-tie surfaces. The proof involves a combination of theoretical analysis and computer-assisted certification to demonstrate the nonconvergence of the stochastic recursion.
Results
The main result is a certified negative answer to the convergence of the scalar-stepsize nonuniform Monte Carlo OPI, showing that the recursion fails to converge with positive probability due to the introduction of nonuniform update frequencies. The analysis reveals a diagonally scaled greedy-policy mean field that leads to an attracting cycle, preventing convergence to the optimal policy.
Implications
The findings have significant implications for the design and analysis of Monte Carlo methods in reinforcement learning, particularly in understanding the conditions under which convergence can be guaranteed. This work emphasizes the necessity of careful consideration of update structures in policy iteration algorithms and may influence future research on improving convergence properties in nonuniform settings.
Assessing Reliability of Symbol Detection in Concept Bottleneck Models
Interpretability
- High task accuracy in CBMs does not guarantee reliable symbol detection.
- Swapping independently trained concept detectors and classification heads reveals reliability issues.
- A reliability-aware training strategy mitigates information leakage and improves symbol detection reliability.
- Concept-level metrics and uncertainty estimates are crucial for assessing symbol reliability.
Read more
Assessing Reliability of Symbol Detection in Concept Bottleneck Models
Summary
This paper investigates the reliability of symbol detection in Concept Bottleneck Models (CBMs), which are designed for explainable AI by utilizing human-interpretable symbols for predictions. The authors highlight that high accuracy in tasks does not necessarily indicate reliable symbol detection, as CBMs may inadvertently encode task-specific shortcuts. To assess this reliability, the authors propose a method of swapping independently trained concept detectors and classification heads that share the same symbolic vocabulary. They analyze the resulting performance degradation and use concept-level metrics and symbol-wise uncertainty estimates to identify concepts prone to spurious activations. The paper introduces a reliability-aware training strategy that optimizes a shared concept detector with multiple classification heads while penalizing reliance on unreliable symbols. Experimental results on the CUB-200-2011 dataset show that with full concept supervision, the detectors and heads can be interchanged with minimal accuracy loss. However, in scenarios with reduced concept supervision, while task accuracy remains high, the reliability of the symbol detection collapses. The proposed training strategy significantly improves reliability, effectively doubling swap accuracy in cases of information leakage.
Methodology
The authors employed a method of swapping independently trained concept detectors and classification heads to assess the reliability of symbol detection in CBMs. They analyzed performance degradation and used concept-level metrics and uncertainty estimates to identify unreliable concepts. A reliability-aware training strategy was proposed, optimizing a shared concept detector with multiple classification heads while penalizing reliance on unreliable symbols.
Results
The study found that on the CUB-200-2011 dataset, concept detectors and classification heads could be interchanged with minimal accuracy loss under full supervision. However, in a controlled synthetic task with reduced concept supervision, task accuracy remained high while reliability metrics collapsed to chance levels. The proposed reliability-aware training strategy significantly improved swap accuracy, effectively doubling it in scenarios with information leakage.
Implications
The findings suggest that improving the reliability of symbol detection in CBMs can enhance the trustworthiness of explainable AI systems. This has potential applications in critical areas such as healthcare, finance, and autonomous systems, where understanding model decisions is essential.
Looped World Models
Reinforcement Learning
Efficient ML
Robotics
- Introduces Looped World Models (LoopWM) for efficient world modeling.
- Achieves up to 100× parameter efficiency over traditional models.
- Utilizes iterative refinement of latent states through shared transformer blocks.
- Establishes iterative latent depth as a new scaling axis for world simulation.
Read more
Looped World Models
Summary
The paper introduces Looped World Models (LoopWM), a novel architecture for world modeling that addresses the limitations of current models in simulating long-horizon environments. Traditional world models often struggle with deep computation requirements, leading to high costs and compounding errors. LoopWM employs a parameter-shared transformer block to iteratively refine latent environment states, achieving up to 100× parameter efficiency compared to conventional methods. This approach allows for adaptive computation, scaling the model's depth according to the complexity of each prediction step. By establishing iterative latent depth as a new scaling axis, LoopWM presents a significant advancement in the field of world simulation, potentially enhancing the performance of reinforcement learning and embodied intelligence applications.
Methodology
LoopWM employs a looped architecture using parameter-sharing in transformer blocks to iteratively refine latent environment states. This method allows for adaptive computation, where the depth of the model can scale according to the complexity of the prediction task, thus optimizing resource usage and improving simulation fidelity.
Results
The implementation of LoopWM demonstrated significant improvements in parameter efficiency and the ability to maintain simulation quality over extended horizons. The adaptive computation mechanism effectively reduced the computational burden while enhancing the model's predictive capabilities.
Implications
LoopWM has the potential to advance the field of reinforcement learning by providing more efficient and accurate world models. Its architecture could be applied to various domains requiring long-horizon predictions, such as robotics, autonomous driving, and complex decision-making tasks.
SoftMoE: Soft Differentiable Routing for Mixture-of-Experts in LLMs
NLP
Large Language Models
Efficient ML
- Introduction of a differentiable soft top-k routing mechanism for MoE models.
- Implementation of a learnable, globally constrained expert budget for adaptive expert allocation.
- SoftMoE achieves competitive or superior performance compared to standard sparse MoE while activating fewer experts.
- The model reveals structured expert allocation, with later layers requiring more experts.
Read more
SoftMoE: Soft Differentiable Routing for Mixture-of-Experts in LLMs
Summary
The paper introduces SoftMoE, a novel approach to Mixture-of-Experts (MoE) architectures that addresses the limitations of traditional sparse MoE models, which rely on non-differentiable top-k routing. SoftMoE replaces the discrete routing mechanism with a truncated soft top-k LapSum relaxation, enabling gradient-based optimization of expert routing. This method allows for a learnable allocation of expert capacity across layers while imposing a global budget constraint on the number of active experts. The authors demonstrate that SoftMoE maintains compatibility with autoregressive language modeling and achieves competitive performance on language modeling tasks and downstream benchmarks, while activating significantly fewer experts on average. The learned expert allocation is shown to be non-uniform, with later layers utilizing more experts, providing insights into the computational needs of large language models.
Methodology
SoftMoE employs a truncated soft top-k mechanism based on the LapSum relaxation to replace hard expert selection. This allows for gradient-based optimization and adaptive allocation of expert capacity across layers. The model parameterizes the mean number of active experts per layer and enforces a global constraint on the total number of active experts, facilitating efficient computation.
Results
SoftMoE consistently matches or outperforms standard sparse MoE models on language modeling tasks using datasets like C4 and OpenWebText. The model activates significantly fewer experts on average, demonstrating improved efficiency. The allocation of experts is highly structured, with later layers activating more experts, indicating a strategic distribution of computational resources.
Implications
The findings suggest that SoftMoE can enhance the efficiency of large language models by optimizing expert allocation, potentially leading to more scalable and effective models in natural language processing tasks. This approach could influence future research on adaptive computation in deep learning architectures.
Rethinking Groups in Critic-Free RLVR
Reinforcement Learning
Large Language Models
NLP
- Critic-free RL methods often rely on multiple rollouts, leading to data inefficiency and training instability.
- The authors propose negative token filtering to enhance single-rollout training stability.
- Empirical results show that group-free methods outperform group-based counterparts in specific tasks.
- The study reveals that grouping primarily serves to protect shared useful tokens from being over-penalized.
Read more
Rethinking Groups in Critic-Free RLVR
Summary
This paper addresses the inefficiencies in existing critic-free reinforcement learning (RL) methods for post-training large language models (LLMs), particularly focusing on the role of 'groups' in estimating value baselines for advantage computation. The authors argue that the primary function of these groups is not just to estimate baselines but to mitigate false penalties on negative samples. They introduce a novel strategy called negative token filtering, which allows for stable single-rollout training. This method retains only the lowest-probability tokens from negative samples, thus preventing over-penalization of useful tokens shared between positive and negative rollouts. The authors empirically validate their approach on two batch-level advantage computation methods, demonstrating that their group-free methods achieve comparable performance on reasoning tasks and superior performance on agentic tasks compared to traditional group-based RL techniques.
Methodology
The authors reverse-engineer the functional mechanism of rollout groups to understand their role in training stability. They propose a filtering strategy that retains only the Top-10% lowest-probability tokens from negative samples, thereby reducing harmful updates during training. The effectiveness of this approach is tested on two batch-level advantage computation methods.
Results
The proposed negative token filtering method leads to stable training with single rollouts, achieving performance comparable to group-based methods on reasoning tasks and superior performance on agentic tasks. The analysis confirms that the grouping mechanism is crucial for training stability, particularly in mitigating the negative impact of incorrect samples.
Implications
This work has significant implications for the design of reinforcement learning algorithms, particularly in the context of large language models. By improving training efficiency and stability, the proposed methods could enhance the performance of LLMs in various applications, including natural language understanding and generation.
PHINN: Persistent Homology Inspired Neural Network for Rare-Event Time Series Generation
Generative Models
Time Series
Theory
- PHINN combines persistent homology with flow matching for rare-event time series synthesis.
- Dynamic Betti curves serve as continuous conditioning signals for improved generative modeling.
- The framework demonstrates superior performance in topological fidelity and structural shape fidelity on benchmark datasets.
- Cross-domain meta-learning and a natural-language interface enhance usability and adaptability.
Read more
PHINN: Persistent Homology Inspired Neural Network for Rare-Event Time Series Generation
Summary
The paper introduces PHINN, a novel neural network framework designed for generating rare-event time series data, such as financial crises and infrastructure failures. Traditional generative models struggle with these rare events due to data scarcity and their inability to capture the underlying structural dynamics. PHINN leverages persistent homology to analyze the topological features of time series data, specifically focusing on the transitions in Betti numbers (β0, β1, β2) that characterize rare events. The framework employs dynamic Betti curves as continuous conditioning signals and incorporates a persistence landscape loss to ensure higher-order homology consistency. PHINN is capable of handling multivariate and multi-modal data through joint Vietoris–Rips filtrations and includes a natural-language interface for translating practitioner intent into Betti-curve targets. The model also features cross-domain meta-learning for efficient transfer across different rare-event datasets and retrieval-augmented topological memory for few-shot generation. The authors validate PHINN's performance across various benchmarks, demonstrating significant improvements in topological fidelity and structural shape fidelity compared to existing statistical and diffusion-based models.
Methodology
PHINN employs a conditional flow-matching framework that integrates dynamic Betti curves for conditioning. It utilizes a persistence landscape loss to enforce higher-order homology consistency and implements joint Vietoris–Rips filtrations for multivariate data. The model also features a learned translation layer for natural-language input and incorporates cross-domain meta-learning for efficient adaptation to different datasets.
Results
On benchmarks related to financial crises and multi-modal time series, PHINN outperformed all statistical and diffusion-based baselines, achieving a 41-63% reduction in topological error (β-RMSE) and an 84% improvement in transition accuracy. It matched the performance of practitioner-grade jump-diffusion models in statistical tail coverage while significantly exceeding them in structural shape fidelity.
Implications
PHINN's ability to generate structurally faithful scenarios for rare events can enhance decision-making in various domains, including finance, infrastructure management, and cybersecurity. Its topological approach provides a new perspective on modeling complex time series data, potentially leading to more robust predictive tools.
Noise-Driven Escape from Metastable Phases explains Grokking in Deep Neural Networks
Theory
Optimization
Efficient ML
- Grokking in DNNs is linked to hysteresis in first-order L2 phase transitions.
- Metastable states trap models in low-accuracy phases until SGD noise facilitates escape.
- Escape times from metastable states follow Arrhenius kinetics, indicating sensitivity to hyperparameters.
- The number of metastable states corresponds to the number of learnable features in the model.
Read more
Noise-Driven Escape from Metastable Phases explains Grokking in Deep Neural Networks
Summary
This paper investigates the phenomenon of grokking in deep neural networks (DNNs), characterized by a sudden transition to generalization after prolonged overfitting. The authors demonstrate that grokking can be explained through first-order phase transitions induced by L2 regularization, which leads to the existence of coexisting metastable states in the learning process. They show that these metastable states can trap models, preventing convergence until stochastic gradient descent (SGD) noise enables escape across energy barriers. By manipulating L2 regularization, the authors reproduce grokking behavior and establish that escape times follow Arrhenius scaling, highlighting the sensitivity of this process to hyperparameters such as learning rate and batch size. The findings suggest that the number of metastable states corresponds to the number of learnable features, with implications for understanding the dynamics of learning in DNNs and improving learning efficiency.
Methodology
The authors utilize deep linear networks as a minimal model to analytically explore the loss landscape and identify metastable states and energy barriers. They employ L2 regularization to engineer deliberate trapping of models in low-accuracy phases and analyze the escape dynamics using stochastic gradient descent (SGD) as a Langevin process.
Results
The study confirms that first-order L2 phase transitions produce multiple metastable states, with escape from these states governed by Arrhenius-type kinetics. The authors successfully reproduce the characteristics of grokking, including delayed convergence and sensitivity to initialization, across various experimental setups.
Implications
The insights gained from this research could lead to the development of more efficient learning algorithms by understanding and manipulating the dynamics of metastable states in DNNs. This could enhance generalization capabilities and reduce overfitting in complex models.
When Dynamics Models Read the Wrong Time Steps: Label-Free Event Credit Re-Anchoring for Robust Global Readouts
Time Series
Theory
Robotics
- Identification of temporal credit dilution in dynamics models, where models misallocate credit away from critical events.
- Introduction of Credit-in-Event as a method to measure credit allocation in pooled representations.
- Development of CREST, a label-free and training-free method for re-anchoring credit to transient events.
- Demonstration of improved robustness and accuracy in out-of-distribution scenarios using CREST.
Read more
When Dynamics Models Read the Wrong Time Steps: Label-Free Event Credit Re-Anchoring for Robust Global Readouts
Summary
This paper addresses a critical issue in learned dynamics models that predict global physical quantities from temporal sequences of measurements. The authors identify a phenomenon termed 'temporal credit dilution,' where models assign insufficient credit to the brief physical events that determine the target output, instead relying on smooth correlates that do not causally influence the predictions. This misallocation of credit is not detectable through standard training loss metrics, leading to poor performance under distribution shifts. To tackle this problem, the authors introduce 'Credit-in-Event,' a probe that quantifies how much credit is assigned to event steps during the pooling process. They propose a novel method called CREST (Credit RE-anchoring through Sparse Transient readout), which is both training-free and label-free, allowing for the estimation of a transient event core from learned features. This method contrasts event features against background information without requiring additional labels. The effectiveness of CREST is demonstrated across various simulated systems and datasets, showing a significant reduction in out-of-distribution error and restoration of event credit, thereby confirming the importance of correctly anchoring credit to transient events.
Methodology
The authors propose a new interface-level probe, Credit-in-Event, to analyze credit allocation in pooled representations. They develop CREST, which estimates a transient event core from learned features and contrasts it against background information without requiring additional labels. The methodology is validated through experiments on simulated gear and impact systems, recurrent and attention encoders, and a public bearing vibration dataset.
Results
CREST significantly reduces out-of-distribution error and restores event credit across various datasets and models. The ablation studies indicate that the improvements are specifically due to the re-anchoring of credit to transient events, rather than other common techniques like stable-step selection or receptive-field shrinking.
Implications
The findings suggest that addressing temporal credit dilution can enhance the robustness of dynamics models in real-world applications, particularly in scenarios where the operating conditions differ from training conditions. This could have significant implications for fields such as predictive maintenance, fault detection, and safety assessments in engineering systems.
Edge Flow: A Tractable and Predictive Continuous-Time Model for Gradient Descent at the Edge of Stability
Optimization
Theory
- Edge Flow is a new model for understanding gradient descent dynamics at the edge of stability.
- The model decomposes GD dynamics into three components: center, oscillation direction, and oscillation magnitude.
- It requires minimal computational resources, needing only two gradient evaluations and one Hessian-vector product per iteration.
- Edge Flow effectively captures the oscillation of sharpness and the dynamics of GD, outperforming previous continuous-time models.
Read more
Edge Flow: A Tractable and Predictive Continuous-Time Model for Gradient Descent at the Edge of Stability
Summary
The paper introduces Edge Flow, a continuous-time model designed to capture the dynamics of gradient descent (GD) in deep learning when operating at the edge of stability (EoS). This regime is characterized by the largest eigenvalue of the loss Hessian being close to the stability threshold, which complicates traditional analysis methods. Edge Flow consists of three coupled ordinary differential equations that model the dynamics of GD by decomposing it into a center, an oscillation direction, and an oscillation magnitude. The center follows a modified gradient flow, while the direction tracks a top eigenvector of the Hessian. The oscillation magnitude is adjusted based on the sharpness of the loss landscape, which stabilizes through a self-regulating feedback loop. The model is shown to require only two gradient evaluations and one Hessian-vector product per iteration, making it computationally efficient. Empirical results demonstrate that Edge Flow accurately tracks GD dynamics and resolves sharpness oscillations, providing a framework for understanding and mitigating instabilities in the EoS regime.
Methodology
The methodology involves formulating a system of three coupled ordinary differential equations to represent the dynamics of gradient descent at the edge of stability. The model decomposes the dynamics into components that track the center of the iterates, the direction of oscillation, and the magnitude of oscillation, incorporating a self-stabilization feedback mechanism.
Results
Empirical evaluations show that Edge Flow tracks the dynamics of gradient descent with high fidelity, particularly during the oscillation of sharpness at the onset of edge of stability. The model demonstrates improved performance in predicting GD behavior compared to existing continuous-time models, effectively capturing the oscillatory nature of the sharpness and the overall descent dynamics.
Implications
The findings suggest that Edge Flow can be used as a tool for better understanding and controlling the behavior of gradient descent in deep learning, particularly in regimes where traditional methods fail. This could lead to more stable training processes and improved performance in neural network optimization.
Sign-Rank, Index, and List Replicability: Connections and Separations
Theory
- Establishes a strong separation between sign rank and Z2-index.
- Demonstrates that Z2-index is upper-bounded by a linear function of list replicability.
- Introduces upper bounds for list replicability based on height and minimum star number.
- Proves a composition result for list replicability in product classes.
Read more
Sign-Rank, Index, and List Replicability: Connections and Separations
Summary
This paper investigates the relationships between three complexity measures in learning theory: sign rank, Z2-index, and list replicability. The sign rank of a binary concept class indicates the smallest dimension for its representation using points and halfspaces. The authors establish a connection between the Z2-index and list replicability, demonstrating that the Z2-index is upper-bounded by a linear function of the list replicability number. This finding leads to a significant separation between sign rank and Z2-index, addressing a question posed by previous researchers. The study further explores list replicability, presenting upper bounds based on combinatorial measures such as height and minimum star number. A key composition result is also proven, indicating that the list replicability number of the product of two concept classes is bounded by the sum of their individual list replicability numbers. Overall, the paper provides new insights into the structure of these complexity measures and their interrelations, enhancing the understanding of lower bounds in sign rank.
Methodology
The authors utilize a combination of topological and combinatorial techniques to analyze the relationships between sign rank, Z2-index, and list replicability. They establish bounds and separations through theoretical proofs and by leveraging existing results in learning theory.
Results
The main results include a strong separation between sign rank and Z2-index, the establishment of upper bounds for list replicability, and a fundamental composition theorem for list replicability in product classes. These findings contribute to a deeper understanding of the complexity measures in learning theory.
Implications
The results have significant implications for learning theory, particularly in understanding the limitations and capabilities of different complexity measures. They may influence future research directions in developing algorithms and frameworks for analyzing binary concept classes and their representations.
Topological Flow Matching
Generative Models
Graph Learning
Theory
- Introduction of Topological Flow Matching (TFM) as a generalization of Flow Matching (FM).
- Incorporation of topological information through a Laplacian-derived drift in the reference process.
- TFM maintains a stable, simulation-free objective and deterministic sample paths.
- Demonstrated effectiveness on diverse datasets, including brain fMRI, ocean currents, and traffic flows.
Read more
Topological Flow Matching
Summary
The paper introduces Topological Flow Matching (TFM), a novel approach that enhances the traditional Flow Matching (FM) framework by incorporating topological information relevant to structured data domains, such as fMRI scans, ocean currents, and traffic flows. The authors argue that conventional FM overlooks the rich topological features of these domains by treating them as points in Euclidean space. TFM addresses this limitation by interpreting FM as a solution to a degenerate Schrödinger bridge problem and augmenting the reference process with a Laplacian-derived drift. This modification retains the desirable properties of FM, including a stable, simulation-free objective and deterministic sample paths, making TFM a seamless replacement for standard FM. The effectiveness of TFM is demonstrated through evaluations on various structured datasets, showing significant performance improvements over both FM and existing topological Schrödinger bridge methods.
Methodology
The authors leverage the relationship between Flow Matching and the Schrödinger bridge problem to develop TFM. They augment the reference process with a Laplacian-derived drift to incorporate topological information, allowing for the modeling of distributions over signals defined on finite graphs and simplicial complexes. The methodology ensures that TFM retains the computational advantages of FM while enhancing its applicability to structured domains.
Results
TFM was evaluated on various structured datasets, including brain fMRI data, ocean currents, seismic events, and traffic flows. The results indicated that TFM outperformed both standard Flow Matching and existing topological Schrödinger bridge approaches, demonstrating its effectiveness in capturing the underlying topological structure of the data.
Implications
The introduction of TFM has significant implications for generative modeling in fields where data is structured, such as neuroscience, environmental science, and urban studies. By effectively utilizing topological features, TFM can lead to better modeling and understanding of complex phenomena represented in structured datasets.
pFedUL: Layer-Aware Federated Unlearning for Personalized Federated Learning
Federated Learning
- pFedUL addresses the unique challenges of federated unlearning in personalized federated learning settings.
- The framework incorporates layer-aware strategies to balance unlearning completeness and personalization preservation.
- New metrics (PPS and CFI) are introduced to evaluate unlearning quality in pFL contexts.
- Experimental results indicate pFedUL's effectiveness in maintaining high personalized accuracy while achieving efficient unlearning.
Read more
pFedUL: Layer-Aware Federated Unlearning for Personalized Federated Learning
Summary
The paper introduces pFedUL, a novel framework for federated unlearning (FU) tailored for personalized federated learning (pFL). Traditional FU methods primarily target the FedAvg paradigm, which does not account for the unique architecture of pFL that separates shared global layers from client-specific personalized layers. This separation creates a challenge in achieving unlearning completeness while preserving personalization for remaining clients. The authors formalize the unlearning process in the pFL context and propose a layer-aware selective unlearning approach that includes three key components: (1) gradient-based layer-wise contribution attribution to assess the influence of a target client's data on both shared and personalized parameters, (2) adaptive selective unlearning that employs different forgetting strategies for each layer type, and (3) a lightweight recalibration protocol to help remaining clients restore their personalization with minimal overhead. Additionally, the authors introduce two new evaluation metrics—Personalization Preservation Score (PPS) and Cross-client Fairness Index (CFI)—to assess the quality of unlearning in pFL. Experimental results demonstrate that pFedUL achieves unlearning effectiveness comparable to full retraining while maintaining an impressive average personalized accuracy of 97.3% for remaining clients. Compared to six state-of-the-art FU methods adapted for pFL, pFedUL shows superior personalization preservation, improving PPS by 6.3% on average and achieving an 8.4× speedup across various architectures and datasets.
Methodology
The methodology involves a layer-aware selective unlearning framework that includes gradient-based contribution attribution, adaptive selective unlearning strategies for different layer types, and a recalibration protocol for remaining clients. The authors also introduce new metrics to evaluate the effectiveness of their approach.
Results
pFedUL achieves unlearning effectiveness comparable to full retraining while maintaining an average personalized accuracy of 97.3% for remaining clients. It outperforms six state-of-the-art FU methods in personalization preservation, with an average improvement of 6.3% in PPS and an 8.4× speedup across tested architectures and datasets.
Implications
The findings suggest that pFedUL can be effectively utilized in applications requiring compliance with data protection regulations, such as healthcare and finance, where personalized models are critical. The framework's ability to efficiently unlearn data while preserving model performance could enhance user privacy and regulatory compliance in federated learning systems.
Benchmarking Instance-Dependent Label Noise with Controlled Corruptions
Computer Vision
Theory
- CILN framework allows for controlled generation of instance-dependent label noise through input corruptions.
- The benchmarks created with CILN provide clearer insights into the sources and severity of label noise.
- Corruption-mediated IDN can reveal failure modes in existing noisy-label learning methods.
- The study emphasizes the significance of noise structure over noise rate in evaluating algorithm performance.
Read more
Benchmarking Instance-Dependent Label Noise with Controlled Corruptions
Summary
This paper introduces CILN (Corruption-Induced Label Noise), a novel framework for generating benchmarks of instance-dependent label noise (IDN) through controlled input corruptions. Unlike existing methods that rely on imperfect annotators or classifiers to create noise, CILN explicitly manipulates the input data to produce label uncertainty. The authors construct 90 benchmark settings using datasets like CIFAR-10, MNIST, and Adult, varying corruption types and severity levels. The results indicate that CILN-generated benchmarks exhibit genuine instance-dependent noise and provide a more realistic representation of human uncertainty compared to previous synthetic benchmarks. Additionally, the study reveals that corruption-mediated IDN can expose failure modes in popular noisy-label learning methods, highlighting the importance of noise structure in benchmark difficulty and algorithm performance. By making the ambiguity generation process explicit, CILN offers a valuable tool for systematically studying the effects of different corruption mechanisms on learning behavior.
Methodology
The authors developed the CILN framework, which applies controlled corruptions to clean instances to create noisy labels. A diverse voter pool evaluates the corrupted instances, producing label distributions that reflect the ambiguity introduced by the corruptions. The framework was tested across various corruption families and severity levels on multiple datasets.
Results
The benchmarks generated using CILN demonstrated genuine instance-dependent noise and provided diverse confusion structures. On CIFAR-10, the label distributions from CILN were found to be closer to human uncertainty compared to existing benchmarks. The study also identified that popular noisy-label learning methods like Co-Teaching and DivideMix exhibited failure modes under corruption-mediated IDN that were not apparent under traditional rater-fallibility noise.
Implications
CILN provides a robust framework for evaluating and improving noisy-label learning methods by allowing researchers to systematically study the impact of different types of label noise. This can lead to the development of more resilient machine learning models capable of handling real-world label noise more effectively.