AI-generated summaries
Today's ML research,
without the noise.
Daily summaries of the latest machine learning papers from arXiv, processed every 8 hours.
24
Papers today
8h
Update frequency
7
Days of history
Balanced LoRA: Removing Parameter Invariance to Accelerate Convergence
NLP
Large Language Models
Efficient ML
- BaLoRA improves convergence rates by enforcing balanced low-rank adapters during optimization.
- Theoretical analysis shows that balanced minimizers have optimal conditioning, leading to faster convergence.
- Empirical results demonstrate that BaLoRA outperforms standard LoRA and matches or exceeds state-of-the-art LoRA variants.
- The method is computationally efficient and compatible with existing fine-tuning frameworks.
Read more
Balanced LoRA: Removing Parameter Invariance to Accelerate Convergence
Summary
This paper introduces Balanced Low-Rank Adaptation (BaLoRA), a novel extension of the widely used Low-Rank Adaptation (LoRA) method for fine-tuning large language models. The authors identify that LoRA is inherently overparameterized, leading to multiple pairs of low-rank factors that can yield the same adapted weight matrix but exhibit different condition numbers. This variance in conditioning affects the convergence rates of optimization. BaLoRA addresses this issue by projecting the low-rank adapters onto a balanced manifold during training, which enhances the conditioning of the loss landscape while maintaining the adapted matrix. The authors provide both theoretical and empirical evidence that BaLoRA converges faster than standard LoRA and achieves superior performance across various fine-tuning tasks. The method is computationally lightweight and integrates seamlessly into existing fine-tuning pipelines, making it a practical choice for researchers and practitioners in the field.
Methodology
The authors analyze the asymptotic behavior of LoRA's training dynamics and establish bounds on the convergence rate by examining the conditioning of the loss landscape. They introduce BaLoRA, which incorporates a projection step onto a balanced manifold after each optimization step, ensuring that the low-rank adapters maintain optimal conditioning throughout training.
Results
BaLoRA consistently outperformed standard LoRA in various experiments involving large language models and datasets, demonstrating faster convergence and improved performance metrics. The authors also reformulated BaLoRA iterations as an intrinsic optimization scheme, providing a clearer geometric interpretation of the algorithm.
Implications
The findings suggest that BaLoRA can significantly enhance the efficiency of fine-tuning large language models, making it a valuable tool for researchers and practitioners in natural language processing and related fields. Its compatibility with existing frameworks allows for easy adoption in practical applications.
A Unifying View of Variational Generative Wasserstein Flows
Generative Models
Optimization
Theory
- Introduction of Generative Wasserstein Flows (GWF) as a unified framework for generative modeling.
- Derivation of various generative methods as instances of parametric JKO schemes for f-divergences.
- Extension of the JKO framework to Integral Probability Metrics and squared Maximum Mean Discrepancy.
- Empirical analysis of JKO regularization effects on generative model training.
Read more
A Unifying View of Variational Generative Wasserstein Flows
Summary
This paper presents a unified theoretical framework for generative modeling based on Wasserstein gradient flows, termed Generative Wasserstein Flows (GWF). The authors demonstrate that a wide class of existing generative methods can be derived from parametric Jordan–Kinderlehrer–Otto (JKO) schemes for f-divergence objectives. They establish equivalences between various recently proposed algorithms and extend the framework to Integral Probability Metrics and squared Maximum Mean Discrepancy, leading to new JKO-based generative algorithms. The paper also empirically studies the impact of JKO regularization across a range of objectives and analyzes parametric Wasserstein flows, where the dynamics are constrained to distributions induced by parameterized maps. This work aims to clarify the connections between different generative modeling approaches, including GANs, and provides a comprehensive understanding of their underlying geometric structures.
Methodology
The authors utilize Wasserstein gradient flows and the Jordan–Kinderlehrer–Otto (JKO) scheme to derive generative algorithms from f-divergence minimization. They extend their framework to include Integral Probability Metrics and squared Maximum Mean Discrepancy, and conduct empirical studies to assess the impact of JKO regularization on generative models.
Results
The paper establishes theoretical equivalences between various generative modeling methods and demonstrates that the proposed GWF framework can encompass and clarify these connections. Empirical results indicate that JKO regularization positively influences the training of generative models across multiple objectives.
Implications
This work provides a comprehensive understanding of generative modeling techniques, potentially guiding the design of more effective generative algorithms. The unification of various methods under the GWF framework may lead to improved performance and faster sampling in generative tasks.
What changes after deployment? A survey on On-device Learning in TinyML
Efficient ML
- ODL enables machine learning models to adapt to distribution changes post-deployment directly on devices.
- The survey categorizes distribution changes into three regimes: single-change, concept drift, and continual learning.
- There is a significant gap between theoretical benchmarks and real-world applications in ODL.
- Understanding the nature of distribution changes is crucial for developing effective ODL solutions.
Read more
What changes after deployment? A survey on On-device Learning in TinyML
Summary
This paper presents a comprehensive survey of On-device Learning (ODL) in the context of Tiny Machine Learning (TinyML), focusing on the challenges posed by distribution changes after deployment. Traditional machine learning models, once deployed on microcontroller-class devices, often fail to perform effectively due to shifts in data distribution that occur in real-world scenarios. ODL aims to address this issue by enabling learning processes to occur directly on the device, allowing models to adapt to new data distributions. The authors categorize the existing literature into three distinct distribution change regimes: single-change, concept drift, and continual learning. Each regime presents unique challenges and requirements for the learning algorithms and hardware used. The survey analyzes approximately 70 ODL works, highlighting the persistent gap between methodological benchmarks and practical deployment scenarios. By emphasizing the importance of understanding distribution changes, the paper provides a structured framework for evaluating and comparing ODL solutions, ultimately contributing to the advancement of adaptive TinyML systems.
Methodology
The authors conducted a systematic survey of the existing literature on ODL in TinyML, categorizing works based on the type of distribution change they address. They analyzed approximately 70 studies, focusing on how different change types influence applications, hardware, and solution structures.
Results
The survey identified three main distribution change regimes and highlighted the varying demands each regime places on applications and learning algorithms. It also revealed a gap between the ideal performance of ODL methods in controlled settings and their effectiveness in real-world deployments.
Implications
The findings suggest that future research in TinyML should prioritize understanding and addressing distribution changes to enhance the adaptability and performance of on-device learning systems. This could lead to more robust applications in areas such as wearables, industrial sensors, and other embedded systems.
Fixed Universal Transformers
Theory
- Introduces the notion of universal transformers that can simulate any transformer in a class via input embeddings.
- Provides explicit constructions of sparse universal transformers and shows that randomly initialized transformers are universally capable.
- Establishes lower bounds on the embedding dimensions required for universality, particularly for transformers with multiple heads.
- Empirical evaluations demonstrate the effectiveness of universal transformers in specific algorithmic tasks.
Read more
Fixed Universal Transformers
Summary
This paper introduces the concept of universal transformers, which are fixed transformers capable of simulating any transformer within a specific class through appropriate input embeddings. The authors draw an analogy to universal Turing machines, where the input embedding serves as a program that encodes the parameters of the target transformer while keeping the internal parameters of the universal transformer fixed. The paper presents explicit sparse constructions that achieve universality when the embedding dimension is sufficiently large and demonstrates that randomly initialized transformers are almost surely universal. Empirical validation is conducted on tasks such as parenthesis balancing and multi-hop reasoning, suggesting that a significant portion of a transformer's expressive power may derive from its input representation rather than its learned weights.
Methodology
The authors formalize the concept of universal transformers and provide explicit constructions with fixed parameters. They analyze the conditions under which these transformers can simulate target transformers and establish theoretical lower bounds. Empirical evaluations are conducted on specific tasks to validate the theoretical claims.
Results
The paper shows that a fixed universal transformer can simulate any target transformer with appropriate embeddings, achieving universality under certain conditions. The empirical results indicate high accuracy in tasks like parenthesis balancing and multi-hop reasoning, supporting the theoretical findings.
Implications
The findings suggest that universal transformers can significantly enhance model reprogramming techniques and expand the potential for transfer learning in deep learning applications. The emphasis on input representation may lead to new approaches in designing transformer architectures.
Graph-Conditioned Mixture of Graph Neural Network Experts for Traffic Forecasting
Graph Learning
Time Series
- GC-MoE introduces a dual-pathway router that combines static topology features with dynamic input representations for expert selection.
- The framework leverages frozen pretrained experts, allowing for low-parameter training while utilizing a diverse set of models.
- An optional output refinement layer can enhance performance at minimal additional parameter cost.
- The study includes an ablation analysis to evaluate the effectiveness of lightweight extensions and their interaction with routing mechanisms.
Read more
Graph-Conditioned Mixture of Graph Neural Network Experts for Traffic Forecasting
Summary
The paper presents GC-MoE, a novel framework for spatio-temporal forecasting on sensor graphs, particularly for traffic prediction. Traditional approaches often apply a single backbone architecture uniformly across all nodes, which may not capture the distinct dynamics exhibited by different road segments. GC-MoE addresses this limitation by employing a graph-conditioned mixture of experts strategy, where each node is assigned a personalized combination of frozen forecasting experts based on the graph's topology and recent traffic inputs. The framework integrates multiple pretrained spatio-temporal graph neural network (GNN) experts with a lightweight, input-aware routing mechanism that adapts to current traffic conditions. The authors also explore an optional output refinement layer and node-adaptive ST-LoRA adapters for further performance enhancement. Experimental results across four standard benchmarks demonstrate that GC-MoE significantly improves mean absolute error (MAE) over a zero-parameter ensemble baseline while maintaining competitive results in root mean square error (RMSE) and mean absolute percentage error (MAPE), all while training only approximately 17,000 parameters on top of 1.5 million frozen expert weights.
Methodology
GC-MoE employs a modular framework that pretrains multiple diverse spatio-temporal GNN experts, freezes them, and learns a routing mechanism that assigns expert weights based on both static and dynamic inputs. The routing mechanism is designed to adapt to current traffic conditions, enhancing the model's ability to handle varying dynamics across different nodes.
Results
The experimental evaluation on four benchmarks (PEMS04, PEMS07, METR-LA, and PEMS-BAY) shows that GC-MoE outperforms a zero-parameter ensemble baseline in terms of MAE, while also achieving competitive RMSE and MAPE scores. The model effectively utilizes only about 17,000 trainable parameters in conjunction with 1.5 million frozen expert weights.
Implications
The findings suggest that personalized expert selection based on graph topology and traffic conditions can significantly improve traffic forecasting accuracy. This approach could be applied to other domains where heterogeneous dynamics exist, enhancing predictive performance in urban analytics and beyond.
Chem-PerturBridge: a harmonized compendium of small molecule perturbation transcriptomic effects
Theory
- Chem-PerturBridge integrates a vast amount of transcriptomic data from diverse sources, providing a unified resource for small-molecule perturbation studies.
- The study reveals that while fine-grained logFC agreement across datasets is weak, the direction of logFC is more consistent.
- Embeddings pretrained on Chem-PerturBridge significantly improve performance in compound representation learning compared to existing methods.
- The resource supports both diagnostic evaluations of cross-dataset agreement and model-oriented reuse of heterogeneous data.
Read more
Chem-PerturBridge: a harmonized compendium of small molecule perturbation transcriptomic effects
Summary
The paper introduces Chem-PerturBridge, a comprehensive and harmonized resource designed to facilitate the training and evaluation of small-molecule transcriptomic perturbation models. It integrates over 37,000 compounds, 136 cellular contexts, and 1.25 million transcriptomic samples across various assay types, including bulk RNA-seq and single-cell data. The authors address the fragmentation of existing transcriptomic resources, which differ in technologies, metadata conventions, and preprocessing methods, making it challenging to compare and utilize these datasets effectively. Chem-PerturBridge standardizes compound identifiers, cellular contexts, doses, and other metadata, allowing for a more coherent analysis of perturbation effects. The study evaluates the agreement of matched conditions across datasets and finds that while fine-grained log fold change (logFC) rankings show weak agreement, the direction of logFC is more stable. Additionally, the resource is tested for its utility in pretraining models for compound representation learning, demonstrating that embeddings trained on Chem-PerturBridge outperform those trained on other datasets. This work not only provides a valuable resource for researchers but also highlights the importance of harmonization in transcriptomic data for improving predictive modeling.
Methodology
The authors constructed Chem-PerturBridge by harmonizing multiple datasets, standardizing metadata, and performing differential gene expression analysis to create condition-level perturbation effects. They evaluated dataset agreement through matched-condition benchmarks and tested the utility of the resource for pretraining models in compound representation learning.
Results
The analysis showed that matched same-compound conditions exhibited weak agreement in logFC rankings across datasets, while the direction of logFC was more stable. Models trained on Chem-PerturBridge outperformed or matched those trained on other datasets in various evaluations, indicating the resource's effectiveness for improving predictive modeling.
Implications
Chem-PerturBridge provides a critical tool for researchers in pharmacology and systems biology, enabling better integration and analysis of transcriptomic data across different assays. It can enhance the development of predictive models for drug response and toxicity, ultimately aiding in therapeutic discovery.
Spatio-temporal stochastic graph-based learning for infectious disease forecasting
Graph Learning
Time Series
- Introduces a spatio-temporal stochastic graph-based model for infectious disease forecasting.
- Addresses the limitations of traditional models by incorporating stochastic processes.
- Demonstrates improved forecasting accuracy using real-world datasets for COVID-19 and chickenpox.
- Shows the model's adaptability to various geographical scales and population sizes.
Read more
Spatio-temporal stochastic graph-based learning for infectious disease forecasting
Summary
This paper presents a novel spatio-temporal stochastic graph-based architecture aimed at improving the forecasting of infectious disease cases, specifically COVID-19 and chickenpox. The authors highlight the limitations of existing spatio-temporal models, which often overlook stochastic processes and fail to account for the variability inherent in real-world disease spread across large geographical networks. The proposed model integrates a stochastic formulation and uncertainty approximation, allowing it to adapt to both large and small population networks. The authors validate their approach using real-world datasets, demonstrating enhanced predictive performance for COVID-19 in the US and chickenpox in Hungary. The results indicate that the model can effectively capture epidemic progression, although it exhibits a one-step delay in predictions and reduced sensitivity to high-frequency variability. This work emphasizes the importance of incorporating stochastic elements into epidemic forecasting models to better reflect the complexities of disease transmission.
Methodology
The authors developed a spatio-temporal stochastic graph-based learning model that organizes temporal epidemic data as features of graph nodes across geographical networks. The model incorporates stochastic outcomes and uses ensemble methods to estimate uncertainty, simulating multiple potential prediction trajectories.
Results
The proposed model outperformed four benchmark spatio-temporal graph-based models, achieving competitive weekly forecasting performance for all 3,218 US counties and 20 Hungarian counties. The model effectively represented epidemic progression relative to baselines, albeit with a one-step prediction delay.
Implications
This research has significant implications for public health planning and response, as it provides a more accurate tool for forecasting infectious disease spread. By integrating stochastic elements, the model can better inform decision-making processes related to epidemic management and resource allocation.
TASER: Task-Aware Stein Regularisation for Geometry-Driven Robustness
Theory
- TASER introduces a geometry-aware regularisation framework that penalises model sensitivity based on the data distribution.
- The method provides a principled alternative to isotropic gradient regularisation by aligning sensitivity with the structure of the data.
- Theoretical insights link Stein residual minimisation to reduced sensitivity under distributional perturbations.
- TASER enhances adversarial robustness by controlling sensitivity in directions that diverge from high-density regions.
Read more
TASER: Task-Aware Stein Regularisation for Geometry-Driven Robustness
Summary
The paper introduces TASER (Task-Aware Stein Regularisation), a novel training-time regularisation framework aimed at enhancing the robustness of deep neural networks against distribution shifts and adversarial perturbations. Traditional regularisation methods often treat input sensitivity uniformly, which can lead to vulnerabilities in directions that deviate from high-density regions of the data distribution. TASER addresses this by penalising pointwise Stein residuals, which are derived from Langevin Stein operators, thereby promoting a geometry-aware smoothness that aligns model sensitivity with the underlying data structure. The authors establish a theoretical connection between Stein regularisation and reduced first-order sensitivity to distributional shifts, demonstrating that TASER can effectively suppress sensitivity in directions that lead away from high-density areas. The method is scalable, architecture-agnostic, and can be integrated with existing training frameworks, including adversarial training. Experimental results on CIFAR-10 show that TASER significantly improves adversarial robustness without causing a statistically significant drop in clean accuracy.
Methodology
TASER employs pointwise Stein residuals derived from Langevin Stein operators to impose geometry-aware constraints on model sensitivity. The total loss function combines the task-specific loss with a regularisation term that penalises the Stein residuals, effectively shaping the sensitivity of the model according to the data distribution. The method requires access to input gradients and an estimate of the score field, which can be obtained from modern score-matching techniques.
Results
In experiments conducted on CIFAR-10, TASER consistently outperformed established training methods in terms of adversarial robustness, while maintaining comparable clean accuracy. The results indicate that the geometry-aware regularisation effectively reduces sensitivity to adversarial perturbations without compromising the model's performance on clean data.
Implications
The introduction of TASER has significant implications for the development of more robust machine learning models, particularly in applications where adversarial attacks and distribution shifts are prevalent. By integrating geometry-aware regularisation into training pipelines, practitioners can enhance model stability and reliability in real-world scenarios.
Bounded Behavioral Indistinguishability for Black-Box LLM Distillation
Large Language Models
NLP
Theory
- Introduction of bounded behavioral indistinguishability for black-box LLM distillation.
- Development of an empirical evaluation methodology combining various tests to assess behavioral indistinguishability.
- Demonstration that LoRA distillation improves semantic similarity but does not fully eliminate distinguishability.
- Identification of residual behavioral artifacts in style, format, and domain-specific prompts.
Read more
Bounded Behavioral Indistinguishability for Black-Box LLM Distillation
Summary
This paper introduces the concept of bounded behavioral indistinguishability for black-box LLM distillation, emphasizing that mere output similarity between a teacher model and a student model does not guarantee behavioral indistinguishability. The author formalizes this concept as (ϵ, q, t, A)-behavioral indistinguishability, where ϵ represents the distinguishing advantage, q the oracle query budget, t the computational budget, and A the adversary class. The methodology involves evaluating teacher-student pairs from Qwen and Llama using a controlled behavioral probe suite of 5,000 prompts. The study finds that while LoRA distillation improves semantic similarity, it does not eliminate behavioral differences, as evidenced by adversarial evaluations. The results indicate that learned discriminators still retain some distinguishing advantage, particularly in areas such as style, format, and domain-specific prompts. The paper concludes that while semantic fidelity is important, it is insufficient for ensuring indistinguishability in black-box LLM distillation, necessitating a more comprehensive evaluation approach that includes adversarial and category-aware assessments.
Methodology
The methodology involves formalizing bounded behavioral indistinguishability and employing a suite of 5,000 controlled prompts to evaluate teacher-student pairs. The evaluation combines learned discriminators, semantic similarity metrics, category-wise probes, policy-level measurements, and pairwise teacher-identification judges to assess behavioral indistinguishability.
Results
LoRA distillation increased semantic similarity scores for Qwen from 0.788 to 0.862 and for Llama from 0.814 to 0.874. However, adversarial evaluations revealed that learned discriminators still maintained a non-zero advantage, indicating residual behavioral differences. The distinguishing advantage for Qwen dropped from 0.158 for the base student to 0.081 after LoRA distillation, showing improved indistinguishability but not complete elimination.
Implications
The findings suggest that while distillation techniques can enhance the performance of smaller models, they must be evaluated through a lens that considers behavioral indistinguishability to ensure that critical behavioral characteristics are preserved. This has implications for the deployment of LLMs in sensitive applications where behavioral fidelity is crucial.
Survival Reinforcement Learning: Toward Scalable Self-Supervised RL
Reinforcement Learning
Robotics
- Introduction of Survival Reinforcement Learning (SRL) as a scalable self-supervised RL method.
- SRL maximizes dwell time at goals, addressing limitations of existing contrastive methods.
- Demonstrated superior performance of SRL on long-horizon locomotion tasks compared to state-of-the-art CRL.
- Empirical evidence supports the effectiveness of classification-based objectives in scaling RL.
Read more
Survival Reinforcement Learning: Toward Scalable Self-Supervised RL
Summary
This paper introduces Survival Reinforcement Learning (SRL), a novel online classification-based approach that enhances the survival value learning framework by maximizing the agent's dwell time at target goals. While previous self-supervised Contrastive Reinforcement Learning (CRL) has demonstrated impressive scaling capabilities, it struggles with long-horizon goal-conditioned planning due to the uniformity-tolerance dilemma associated with contrastive losses. SRL circumvents the structural limitations of CRL and addresses the undesirable 'bang-bang' control solutions typical of survival frameworks. Through extensive evaluations on various robotic benchmarks, SRL achieves competitive performance on manipulation tasks and significantly outperforms CRL by 2x to 8x on stable, long-horizon locomotion tasks. The findings suggest that classification-based methods could be pivotal in advancing scalable reinforcement learning.
Methodology
The authors extend the survival learning framework to develop SRL, which focuses on maximizing the agent's time spent at goal states. The methodology involves classifying state-action pairs based on their trajectories towards goals and employing a dwell time at goal formulation to stabilize the agent's position after reaching the goal. The architecture is built upon previous work that emphasizes depth-scaling behavior, and the performance is evaluated across various robotic environments.
Results
SRL achieves competitive results on challenging goal-reaching tasks, particularly excelling in AntMaze environments. It matches the performance of scaled CRL on manipulation tasks and significantly outperforms it on long-horizon locomotion tasks, demonstrating a 2x to 8x improvement. These results highlight the potential of classification-based methods in enhancing the scalability of reinforcement learning.
Implications
The development of SRL suggests new pathways for scalable self-supervised reinforcement learning, potentially impacting various applications in robotics and autonomous systems. The findings advocate for the integration of classification-based objectives in RL frameworks, which could lead to more robust and efficient learning algorithms.
Zero Collapse: A Failure Mode of Policy Gradient Methods in Discontinuous Reward Environments
Reinforcement Learning
Optimization
Theory
- Identification of 'zero collapse' as a failure mode in policy gradient methods due to discontinuous reward landscapes.
- Mechanistic explanation of how flat zero-reward regions lead to vanishing gradient signals and sample inefficiency.
- Empirical demonstration of zero collapse across multiple policy gradient methods.
- Proposed mitigation strategies to enhance stability and learning speed in reinforcement learning.
Read more
Zero Collapse: A Failure Mode of Policy Gradient Methods in Discontinuous Reward Environments
Summary
This paper investigates a critical failure mode of policy gradient methods in reinforcement learning, termed 'zero collapse', which occurs in environments characterized by discontinuous reward landscapes. The authors focus on bidding in repeated auctions as a case study, where rewards are structured in a thresholded manner. In such environments, agents receive no reward until their actions exceed a certain threshold, leading to large flat regions of zero reward separated by sharp transitions to high-reward areas. The authors demonstrate that policy gradient methods, driven by stochastic exploration, can overshoot optimal regions and become trapped in these flat zero-reward areas, resulting in ineffective learning dynamics. The paper provides a mechanistic explanation for this phenomenon, highlighting the interaction between policy stochasticity and step size, and empirically validates the occurrence of zero collapse across various policy gradient methods, including REINFORCE and actor-critic variants. The authors propose practical strategies to mitigate this issue, such as improved initialization schemes and architectural choices, and introduce a formal framework for reinforcement learning in auction environments, emphasizing the unique structural properties of these settings.
Methodology
The authors conducted theoretical analysis and empirical experiments to explore the zero collapse phenomenon. They examined the interaction between policy stochasticity and step size, and tested various policy gradient methods, including REINFORCE and actor-critic approaches, in environments with discontinuous reward structures. They also proposed practical strategies to mitigate the identified issues.
Results
The study found that policy gradient methods are susceptible to zero collapse, particularly in environments with discontinuous rewards. The empirical results showed that once policies enter flat zero-reward regions, recovery is highly sample-inefficient, leading to stalled learning. The proposed mitigation strategies improved the stability and learning speed of the agents in these challenging environments.
Implications
The findings have significant implications for the design of reinforcement learning algorithms, particularly in applications involving auction environments and other decision-making scenarios with discontinuous rewards. The proposed strategies can help improve the robustness and efficiency of learning in such settings, potentially enhancing performance in real-world applications like digital advertising.
Scalable Inference-Time Annealing with Surrogate Likelihood Estimators
Generative Models
Efficient ML
- Introduction of SITA, a scalable method for inference-time annealing in molecular sampling.
- Utilization of surrogate likelihood estimators to bypass expensive divergence calculations.
- Demonstration of state-of-the-art performance on alanine dipeptide and alanine tripeptide.
- Integration of a BoltzNCE-style surrogate into a temperature annealing framework.
Read more
Scalable Inference-Time Annealing with Surrogate Likelihood Estimators
Summary
This paper addresses the challenge of efficiently sampling the Boltzmann distribution of molecular configurations, a fundamental task in computational chemistry and biophysics. Traditional methods like Markov Chain Monte Carlo (MCMC) and molecular dynamics (MD) simulations are often computationally expensive and struggle with high-dimensional systems. The authors propose Scalable Inference-Time Annealing (SITA), a novel approach that retrains flow-based models to generate samples at progressively lower temperatures using an energy-based model for fast surrogate likelihoods. This method eliminates the need for costly divergence computations typically required in existing importance sampling techniques. SITA is empirically validated on benchmark systems, specifically alanine dipeptide and alanine tripeptide, demonstrating state-of-the-art performance while avoiding the computational overhead associated with traditional methods. The authors provide their implementation in a publicly available code repository.
Methodology
SITA combines flow-based generative models with surrogate likelihood estimators to facilitate efficient inference-time annealing. The method involves generating proposals from a high-temperature Boltzmann distribution and using these to train an energy-based model. Importance-weighted resampling with learned surrogate likelihoods allows for sampling at lower temperatures without the need for complex Jacobian computations.
Results
SITA achieves state-of-the-art results in sampling efficiency and accuracy on alanine dipeptide and alanine tripeptide, outperforming existing methods while avoiding the computational burdens associated with divergence evaluations in traditional importance sampling techniques.
Implications
The proposed method has significant implications for computational chemistry and biophysics, enabling faster and more efficient sampling of molecular configurations. This could enhance the ability to analyze complex molecular systems and facilitate high-throughput studies in drug discovery and materials science.
Convergence of Steepest Descent and Adam under Non-Uniform Smoothness
Optimization
Theory
- Generalizes non-uniform smoothness assumptions for better modeling of loss landscapes.
- Establishes convergence rates for steepest descent and adaptive methods like Adam and RMSProp.
- Demonstrates that Sign GD converges faster than traditional gradient descent for logistic regression.
- Shows that RMSProp and Adam can achieve linear convergence rates for certain neural networks.
Read more
Convergence of Steepest Descent and Adam under Non-Uniform Smoothness
Summary
This paper investigates the convergence properties of first-order optimization methods, specifically steepest descent and adaptive methods like Adam and RMSProp, under a generalized non-uniform smoothness (NS) assumption. The authors extend the NS assumption to include objectives where the curvature is an affine function of the objective value, applicable to various machine learning problems such as logistic regression and certain neural networks. They establish convergence rates for steepest descent and diagonal variants of RMSProp and Adam, demonstrating that under their assumptions, these methods can achieve linear convergence rates without requiring convexity or bounded gradient conditions. The results indicate that Sign GD can outperform traditional gradient descent in specific scenarios, and that RMSProp and Adam can converge linearly with constant step sizes for two-layer neural networks. The paper also presents a lower bound showing that these methods are faster than other adaptive methods like AdaGrad and AMSGrad, highlighting their efficiency in practical applications.
Methodology
The authors derive convergence guarantees based on the (H0, H1)-NS and non-uniform Łojasiewicz (NL) assumptions. They analyze the structural properties of functions satisfying these assumptions and apply them to derive convergence rates for various optimization methods, including steepest descent and its normalized variants. The analysis is conducted without relying on dimension dependence, making the results broadly applicable.
Results
The paper establishes that steepest descent methods can achieve dimension-free linear convergence rates under the proposed assumptions. For logistic regression and softmax policy gradient objectives, Sign GD is shown to converge faster than traditional GD. Additionally, RMSProp and Adam are proven to converge linearly with constant step sizes for a class of two-layer neural networks, outperforming other adaptive methods like AdaGrad and AMSGrad.
Implications
The findings suggest that adopting the generalized non-uniform smoothness assumptions can lead to more efficient optimization strategies in machine learning tasks, particularly in scenarios involving logistic regression and neural networks. The results may influence the design of optimization algorithms in practice, promoting the use of adaptive methods that leverage the properties identified in this research.
Parallel Tempering Initial Sampling in Inference-Time Reward Alignment
Generative Models
- PATHS improves initialization for inference-time reward alignment in generative models.
- The method utilizes parallel tempering to explore complex reward landscapes effectively.
- Periodic Metropolis swaps between chains enhance the sampling of high-reward states.
- Experiments show consistent performance gains over existing SMC-based methods.
Read more
Parallel Tempering Initial Sampling in Inference-Time Reward Alignment
Summary
This paper introduces PATHS (PArallel Tempering for High-complexity reward Sampling), a novel initialization method for inference-time reward alignment in generative models. The authors identify limitations in existing Sequential Monte Carlo (SMC) methods, which often initialize particles from standard priors, leading to poor performance in complex reward landscapes characterized by rare high-reward regions and multi-modal distributions. PATHS addresses these issues by employing parallel tempering, which maintains multiple sampling chains at different temperatures. This approach allows for efficient exploration of the reward landscape through periodic Metropolis swaps, enabling the transfer of high-reward states from exploratory chains to more stable chains. The authors demonstrate that PATHS significantly improves the sampling of rare, high-reward regions, enhancing the alignment quality in tasks such as layout-to-image generation and quantity-aware generation. Experimental results show that PATHS consistently outperforms prior methods, highlighting the importance of robust initialization and cross-mode exploration in complex reward settings.
Methodology
The proposed PATHS method leverages parallel tempering to run multiple sampling chains at varying temperatures. Higher-temperature chains explore the reward landscape more freely, while lower-temperature chains focus on stable reward-aware posteriors. Metropolis swaps are periodically performed to exchange high-reward states between chains, facilitating better exploration and initialization.
Results
PATHS was evaluated on layout-to-image and quantity-aware generation tasks, demonstrating significant improvements in alignment quality compared to existing methods like TDS, DAS, and Ψ-Sampler. The results indicate that PATHS effectively addresses the challenges posed by rare and multi-modal reward landscapes.
Implications
The findings suggest that robust initialization and exploration strategies are crucial for effective inference-time reward alignment in generative models. This work could lead to advancements in various applications requiring high-quality generative outputs aligned with user-specified rewards.
Benchmarking Machine Learning Uncertainty Quantification Methodologies for Predicting Turbine Gas Temperature Degradation
Time Series
- The paper benchmarks five UQ methodologies for TGT prediction in engine health management.
- A unified experimental framework is used for hyperparameter selection and performance evaluation.
- Distinct trade-offs in interval coverage, width, and stability are identified among the methods.
- The results provide practical guidance for selecting UQ methods in real-world applications.
Read more
Benchmarking Machine Learning Uncertainty Quantification Methodologies for Predicting Turbine Gas Temperature Degradation
Summary
This paper addresses the critical need for accurate turbine gas temperature (TGT) predictions and robust uncertainty quantification (UQ) methodologies in the context of engine health management (EHM). The authors benchmark five prominent UQ approaches—Delta method, Bayesian Monte Carlo Dropout, Bootstrap method, Lower–Upper Bound Estimation, and Mean–Variance Estimation—within a unified experimental framework. The study employs cross-validation for hyperparameter tuning and multiple metrics, including Coverage Probability, Normalized Mean Prediction Interval Width, and Coverage Width-based Criterion, to evaluate the performance of each method. The experiments utilize a representative dataset of turbine gas temperatures, revealing distinct trade-offs among the methods in terms of interval coverage, width, and stability. The findings serve as a practical guide for selecting and tuning prediction interval methods, enhancing the interpretability and precision of TGT predictions in real-world applications, particularly in aerospace operations where safety and reliability are paramount.
Methodology
The authors implemented five UQ methods within a unified framework that included cross-validation for hyperparameter selection and repeated train-test splits for robustness. They evaluated the methods using metrics such as Coverage Probability, Normalized Mean Prediction Interval Width, and Coverage Width-based Criterion to assess the reliability and sharpness of prediction intervals.
Results
The experiments demonstrated that each UQ method exhibited unique strengths and weaknesses regarding interval coverage, width, and stability. The findings highlighted the necessity of selecting appropriate UQ methodologies based on specific operational contexts and requirements.
Implications
The study's findings have significant implications for engine health management, particularly in aerospace, where accurate TGT predictions and uncertainty quantification are crucial for ensuring safety and reliability. The insights can guide practitioners in making informed maintenance decisions and risk assessments.
DARTS: Distribution-Aware Active Rollout Trajectory Shaping for Accelerating LLM Reinforcement Learning
Reinforcement Learning
Large Language Models
Efficient ML
- Identifies intra-prompt long tails as a significant source of inefficiency in RL for LLMs.
- Introduces DARTS, a novel framework for active distribution shaping to improve rollout efficiency.
- Employs a dual-end length sampling strategy and adaptive redundancy allocation to optimize trajectory selection.
- Demonstrates significant acceleration in RL training processes without degrading model performance.
Read more
DARTS: Distribution-Aware Active Rollout Trajectory Shaping for Accelerating LLM Reinforcement Learning
Summary
This paper addresses the inefficiencies in Reinforcement Learning (RL) for Large Language Models (LLMs) caused by long-tail response length distributions. While previous approaches have focused on scheduling to mitigate the impact of long tails, this work identifies the root cause of inefficiency as the distribution itself. The authors characterize the long-tail distribution at a finer granularity, revealing intra-prompt long tails that often consist of verbose and ineffective responses. To tackle this issue, they propose DARTS (Distribution-Aware Active Rollout Trajectory Shaping), a novel paradigm that actively shapes the rollout distribution towards conciseness and certainty. DARTS employs a distribution-aware trajectory sampling mechanism that selects trajectories from a redundant exploration space and an adaptive redundancy allocation scheme to optimize both shaping effectiveness and system efficiency. The proposed method significantly accelerates the RL training process without compromising model performance, achieving up to 1.77× acceleration over state-of-the-art systems.
Methodology
The authors developed DARTS, which includes a distribution-aware trajectory sampling mechanism that selects optimal trajectories from a redundant exploration space. They also implemented a variance-based adaptive redundancy allocation scheme to balance shaping effectiveness with system efficiency. Additionally, system-level optimizations such as variance-guided tail pruning and a token-level streaming pipeline were introduced to enhance performance.
Results
Experiments showed that DARTS can accelerate the RL training process by up to 1.77× compared to existing state-of-the-art systems, while maintaining model performance. The distribution shaping effectively reduced the overhead caused by long-tail distributions, leading to improved computational resource utilization.
Implications
The findings suggest that addressing the distribution characteristics of rollout trajectories can lead to more efficient RL training processes, which is crucial for the development of advanced LLMs. This approach could be applied to various RL tasks, enhancing the performance of models in complex reasoning and decision-making scenarios.
Improving Selective Classification with Pairwise Queries for Binary Classification
NLP
Large Language Models
Theory
- Selective classification can waste expert resources if confidence estimates are unreliable.
- Pairwise queries provide a more accurate measure of sample quality than confidence estimates.
- The proposed method improves accuracy on non-rejected samples while reducing costs.
- Theoretical conditions for the effectiveness of pairwise queries are established.
Read more
Improving Selective Classification with Pairwise Queries for Binary Classification
Summary
This paper addresses the challenges of selective classification in binary classification tasks, particularly when using large language models (LLMs). Selective classification allows models to predict labels for samples they are confident about while abstaining from uncertain predictions, which are then labeled by experts at a cost. The authors identify that the confidence estimates from models can often be inconsistent with actual predictions, leading to high error rates on non-rejected samples. To mitigate this issue, the authors propose utilizing pairwise queries, where the model is asked to compare two unlabeled samples and determine which is closer to a specific label. This method is shown to be more reliable than using raw confidence estimates. Theoretical foundations are established to demonstrate the conditions under which pairwise queries outperform traditional confidence estimates. Extensive experiments on synthetic and real datasets confirm that the proposed approach yields a better accuracy-cost tradeoff compared to existing methods that rely solely on confidence estimates.
Methodology
The authors propose a pairwise query approach for selective classification in binary classification tasks. They establish theoretical conditions under which this method outperforms traditional confidence-based approaches. The methodology involves sending pairs of unlabeled data points to the model and asking which label is closer to the target label, thereby leveraging the model's comparative judgment rather than relying on potentially flawed confidence scores.
Results
The experiments conducted on one synthetic dataset and four real-world binary classification datasets show that the pairwise query method significantly improves the accuracy-cost tradeoff compared to methods that use raw confidence estimates. The results indicate that the proposed approach effectively reduces errors on non-rejected samples, validating the theoretical claims made by the authors.
Implications
The findings suggest that pairwise queries can enhance selective classification strategies, particularly in applications involving large language models where traditional confidence measures may fail. This approach could be beneficial in domains such as healthcare, finance, and any area where expert labeling is costly and selective classification is critical.
Automating Formal Verification with Reinforcement Learning and Recursive Inference
Reinforcement Learning
Large Language Models
Theory
- Introduces RLVR to improve LLM generation of verified programs and proofs.
- Achieves significant increases in verified rewards and pass rates through structured training.
- Identifies and addresses issues of specification hacking in model training.
- Develops a verifier-guided inference scaffold that enhances proof generation.
Read more
Automating Formal Verification with Reinforcement Learning and Recursive Inference
Summary
This thesis addresses the challenges of automating formal verification for large language models (LLMs), particularly in the context of generating verified programs and proofs. The author proposes a novel approach that combines reinforcement learning from verifiable rewards (RLVR) and verifier-guided inference-time search. The study begins by training open-source models in Dafny using RLVR, which significantly improves the verified reward from 2.2% to 58.1%. However, issues such as specification hacking were identified, where models exploit weak formal specifications. To mitigate this, the author filters out underspecified tasks and employs multi-turn RLVR, resulting in an improved verified pass rate from 9.7% to 31.1%. Additionally, a verifier-guided inference scaffold in Lean is developed, treating proof generation as a structured search over subgoals, leading to an increase in the pass rate from 46.2% to 69.2% on a pilot set. The study also introduces Dalek-Bench, a benchmark derived from the Rust curve25519-dalek verification project, although preliminary results indicate the need for stronger evaluation methods. Overall, the findings suggest that formal verifiers can enhance LLM performance when utilized as sources of reward and feedback, emphasizing the importance of clean data and robust specifications.
Methodology
The methodology involves training models using reinforcement learning techniques, specifically RLVR, to optimize the generation of verified programs. The author employs Group Relative Policy Optimization (GRPO) and filters tasks to eliminate vulnerabilities. Additionally, a verifier-guided inference scaffold is created to facilitate structured proof generation.
Results
The initial experiments showed an increase in verified rewards from 2.2% to 58.1%, and after refining the task set, the verified pass rate improved from 9.7% to 31.1%. The verifier-guided scaffold improved the pass rate from 46.2% to 69.2% on a pilot set, and the new benchmark Dalek-Bench was established, although results indicated room for improvement.
Implications
The findings suggest that integrating formal verification processes with LLMs can significantly enhance their ability to generate correct and verified outputs. This has potential applications in fields requiring high assurance in software correctness, such as cybersecurity and critical systems development.
Revisiting Padded Transformer Expressivity: Which Architectural Choices Matter and Which Don't
Theory
- Padded transformers are robust to changes in attention type, model width, and uniformity.
- Numeric precision and model depth are the main factors affecting expressivity.
- Polynomially padded L-uniform constant-precision transformers are equivalent to L-uniform AC0.
- Increasing width or precision beyond logarithmic levels does not enhance expressivity.
Read more
Revisiting Padded Transformer Expressivity: Which Architectural Choices Matter and Which Don't
Summary
This paper investigates the expressivity of padded transformers, which utilize filler symbols in their input to enhance computational capabilities. The authors analyze how various architectural choices, such as attention type, model width, and uniformity, affect the expressivity of these transformers. They find that padded transformers exhibit surprising robustness to these changes, with numeric precision and model depth emerging as the primary factors influencing expressivity. The study establishes that polynomially padded L-uniform constant-precision transformers are equivalent to L-uniform AC0, while those with growing precision achieve L-uniform TC0, independent of width. Additionally, the paper reveals that increasing width or precision beyond logarithmic levels does not enhance expressivity. The authors also demonstrate that looping mechanisms allow for sequential processing akin to circuits, leading to significant expressivity results. Overall, the findings suggest that certain architectural choices may not significantly impact expressivity, simplifying theoretical analyses and potentially guiding practical implementations.
Methodology
The authors conducted a comprehensive analysis of padded transformers by examining various architectural configurations, including attention types (softmax and average hard attention), numeric precision, model width, and uniformity. They established theoretical equivalences to boolean circuit classes and explored the implications of these configurations on expressivity through mathematical proofs and comparisons.
Results
The study found that padded transformers maintain expressivity across different configurations, with specific results indicating that constant-precision transformers are limited to AC0, while growing-precision transformers achieve TC0. The research also highlighted that log-precision padded transformers consistently outperform constant-precision ones, and that expressivity is not significantly affected by changes in attention type or model width once logarithmic precision is reached.
Implications
The findings have significant implications for both theoretical research and practical applications of transformers. They suggest that researchers can focus on simpler models for analysis without losing expressivity insights. Practitioners may also benefit from understanding which architectural choices are critical for performance, potentially leading to more efficient transformer designs in real-world applications.
CoMem: Context Management with A Decoupled Long-Context Model
NLP
Large Language Models
Efficient ML
- COMEM decouples memory management from reasoning, allowing for specialized models for efficient history compression.
- The k-step-off asynchronous pipeline significantly reduces decoding overhead by overlapping memory summarization with agent execution.
- A novel reward-driven training methodology aligns the memory model to ensure effective decision-making.
- COMEM achieves a 1.4x latency improvement over traditional long-context solutions while preserving performance.
Read more
CoMem: Context Management with A Decoupled Long-Context Model
Summary
The paper presents COMEM, a novel framework designed to enhance context management in agentic models, particularly for long-horizon tasks. Traditional methods often face significant decoding overhead due to the need for summarization of extensive interaction histories, which adversely affects response latency. COMEM addresses this by decoupling memory management from the primary agent workflow, allowing these processes to operate in parallel. The authors propose a k-step-off asynchronous pipeline that overlaps memory summarization with agent inference, effectively masking the latency associated with context processing. To ensure the memory model captures essential statistics for decision-making, a reward-driven training strategy is introduced. Theoretical analysis indicates that COMEM achieves a superior efficiency-effectiveness trade-off compared to coupled architectures. Experimental results on SWE-Bench-Verified demonstrate that COMEM can reduce latency by 1.4 times compared to standard long-context solutions while maintaining competitive performance. This framework not only enhances the efficiency of context processing but also scales favorably with increased system throughput, paving the way for independent optimization of agent reasoning and memory compression.
Methodology
The authors developed COMEM by creating an asynchronous pipeline that allows memory summarization to occur in parallel with agent inference. They employed a reward-driven training strategy to align the memory model with the agent's decision-making needs, ensuring that the compressed memory captures sufficient statistics for effective reasoning.
Results
Extensive experiments on SWE-Bench-Verified showed that COMEM provides a 1.4x reduction in latency compared to vanilla long-context models while maintaining competitive performance levels. The results indicate that the framework scales well with increased system throughput.
Implications
COMEM's approach to decoupling memory management from reasoning can significantly enhance the efficiency of agentic systems, particularly in applications requiring long-context processing. This framework could lead to improved user experiences in real-time systems and facilitate the development of more sophisticated autonomous agents capable of handling complex tasks.
Density-Guided Robust Counterfactual Explanations on Tabular Data under Model Multiplicity
Generative Models
Interpretability
Optimization
- DensityFlow provides a novel approach to generating robust counterfactual explanations by focusing on high-density data regions.
- The framework utilizes Neural ODEs and a density score learned via Noise Contrastive Estimation to guide counterfactual generation.
- A local proxy distillation mechanism enhances efficiency in black-box settings by minimizing redundant queries.
- Experimental results show significant improvements in robustness and validity compared to traditional ensemble methods.
Read more
Density-Guided Robust Counterfactual Explanations on Tabular Data under Model Multiplicity
Summary
This paper addresses the challenge of generating reliable counterfactual explanations (CEs) in machine learning models, particularly in low-density regions where classifiers exhibit high variance. The authors introduce DensityFlow, a generative framework that constructs robust CEs by focusing on high-confidence data manifolds. The framework employs a continuous-time dynamics model parameterized by Neural ODE, guided by a differentiable density score learned through Noise Contrastive Estimation. This approach effectively avoids uncertain low-density areas during counterfactual generation. Additionally, for black-box models, a local proxy distillation mechanism is proposed to align a lightweight surrogate model with the target model, optimizing the generation process with minimal queries. Experimental results demonstrate that DensityFlow outperforms existing ensemble-based methods in terms of validity and query efficiency, confirming its effectiveness in generating robust counterfactuals under model multiplicity.
Methodology
The authors propose DensityFlow, a generative framework that models counterfactual generation as continuous-time dynamics using Neural ODEs. A differentiable density score is learned through Noise Contrastive Estimation, which helps navigate the high-density regions of the data manifold. For black-box models, a local proxy distillation strategy is employed to align a lightweight surrogate model with the target model during the counterfactual generation process.
Results
The experiments conducted on synthetic and real-world datasets indicate that DensityFlow achieves state-of-the-art performance in terms of robustness and validity of counterfactual explanations while significantly reducing the number of queries required compared to existing ensemble-based approaches.
Implications
The findings suggest that DensityFlow can enhance the interpretability and reliability of machine learning models, particularly in high-stakes decision-making scenarios. Its ability to generate robust counterfactuals efficiently could be beneficial in fields such as healthcare, finance, and any domain requiring transparent algorithmic recourse.
Calibrated Preference Learning: The Case of Label Ranking
Theory
Reinforcement Learning
- Introduces calibration notions specifically for probabilistic label ranking, extending beyond multi-class classification.
- Establishes a theoretical framework showing the relationships between different calibration notions.
- Empirically evaluates the calibration properties of popular label ranking models, revealing significant calibration issues.
- Finds a strong correlation between calibration and benchmark accuracy in RLHF reward models.
Read more
Calibrated Preference Learning: The Case of Label Ranking
Summary
This paper addresses the issue of calibration in probabilistic label ranking (ProLR), which has not been formally studied despite its importance for reliable decision-making. Calibration ensures that predicted probabilities align with true outcome frequencies, a concept well-explored in classification and regression but lacking in label ranking. The authors introduce a hierarchy of calibration notions that encompass full rankings, sub-rankings, and top-k rankings, proving that full-rank calibration implies the others but not vice versa. They empirically demonstrate that popular label ranking models often exhibit poor calibration, highlighting significant differences between sub-ranking and top-k metrics. The study also applies its calibration framework to reinforcement learning from human feedback (RLHF) reward models, revealing a strong correlation between calibration and benchmark accuracy, indicating that calibration captures a meaningful quality dimension beyond mere top-1 accuracy. These findings underscore the need for further research into the effects of miscalibration and the development of correction methods.
Methodology
The authors develop a hierarchy of calibration notions for ProLR, theoretically investigating the relationships between these notions. They conduct empirical evaluations of popular label ranking models, assessing their calibration properties and comparing sub-ranking and full-ranking metrics. The framework is applied to RLHF reward models to analyze the correlation between calibration and accuracy.
Results
The study finds that popular label ranking models are often poorly calibrated, with substantial differences observed between sub-ranking and top-k calibration metrics. The empirical analysis shows that calibration correlates strongly with benchmark accuracy, suggesting that it captures a significant quality dimension beyond just top-1 accuracy.
Implications
The findings suggest that improving calibration in label ranking models could enhance their reliability and effectiveness in applications such as reinforcement learning from human feedback. Understanding miscalibration's downstream effects could lead to better decision-making processes in various machine learning applications.
idSCD: Identifying Training Datasets through Semantic Correlation Descriptors
NLP
Theory
Interpretability
- Introduces a semantic approach to dataset-level membership inference, moving beyond behavioral evidence.
- Develops Semantic Correlation Descriptors (SCDs) to capture and compare semantic correlation structures across datasets.
- Proposes a practical membership score that does not require leave-one-dataset-out models.
- Achieves superior performance compared to existing black-box and white-box methods in various experimental settings.
Read more
idSCD: Identifying Training Datasets through Semantic Correlation Descriptors
Summary
The paper introduces a novel approach for identifying training datasets based on the semantic correlation structures that models learn during training. The authors propose Semantic Correlation Descriptors (SCDs) as a method to capture these structures, which can reveal dataset-specific traces in a model's behavior. Unlike traditional methods that rely on behavioral evidence such as confidence scores or prediction margins, the SCD approach focuses on the internal semantic associations learned by the model. The authors demonstrate that SCDs can effectively distinguish between matching and non-matching dataset pairs in a controlled setting. They also propose a practical membership score that utilizes SCDs to determine if a target dataset was part of the training mixture of a model, without the need for leave-one-dataset-out models. The effectiveness of this approach is validated across three diverse experimental settings: natural language inference, emotion classification, and medical text classification, showing significant improvements over existing methods.
Methodology
The authors developed Semantic Correlation Descriptors (SCDs) to summarize the semantic correlation structures learned by models. They conducted a controlled leave-one-dataset-out diagnostic to validate the effectiveness of SCDs in recovering dataset-specific changes. A practical membership score was then proposed, which only requires the model's SCD and the standalone SCD of the target dataset to assess membership.
Results
The idSCD classifier, based on the proposed membership score, achieved the highest average performance and lowest standard deviation across three experimental settings, outperforming black-box baselines (RMIA, Attack-P, LiRA) and the white-box SIF baseline. The largest relative gain in ROC-AUC exceeded 60% when dataset groups exhibited distinct semantic characteristics.
Implications
This work has significant implications for model auditing, privacy, and accountability in machine learning. By enabling the identification of training datasets, it can help mitigate issues related to benchmark contamination and enhance reproducibility in research.
Learning Multi-Agent Coordination via Sheaf-ADMM
Optimization
Graph Learning
Robotics
- Introduces Sheaf-ADMM for multi-agent coordination with limited local views.
- Utilizes cellular sheaf theory to define inter-agent constraints for heterogeneous consensus.
- Demonstrates improved performance on tasks like maze pathfinding, image classification, and Sudoku.
- Enhances robustness to distribution shifts in MNIST classification compared to standard CNNs.
Read more
Learning Multi-Agent Coordination via Sheaf-ADMM
Summary
This paper introduces Sheaf-ADMM, a differentiable optimization framework designed for multi-agent coordination in scenarios where agents have limited local views of input data. The framework decomposes input into overlapping local views, allowing each agent to solve a convex subproblem using a neural encoder. Coordination among agents is achieved through the Alternating Direction Method of Multipliers (ADMM), with inter-agent constraints defined by a cellular sheaf. This sheaf structure allows agents to agree on specific aspects of their solutions, facilitating heterogeneous global consensus. The authors demonstrate the effectiveness of Sheaf-ADMM on various tasks, including maze pathfinding, image classification, and Sudoku, showing that agents can learn to coordinate effectively even with insufficient local information. Notably, the method improves robustness to distribution shifts in MNIST classification compared to standard CNNs and achieves higher solve rates in Sudoku compared to matched MPNN baselines. The ADMM structure also enables distinct analysis of primal, consensus, and dual state variables, offering insights into coordination dynamics not available in traditional message-passing architectures.
Methodology
The methodology involves formulating coordination as a constrained optimization problem solved using ADMM. Each agent independently solves local subproblems parameterized by a neural network encoder, followed by a consensus step that projects their proposals towards global consistency. The entire process is differentiable, allowing for backpropagation through the optimization trajectory.
Results
The evaluation of Sheaf-ADMM on tasks such as maze pathfinding, image classification (MNIST), and Sudoku shows that agents can effectively coordinate to produce correct global outputs despite limited local views. The method outperforms standard CNNs in robustness to distribution shifts and achieves significantly higher solve rates in Sudoku compared to parameter-matched MPNN baselines.
Implications
The findings suggest that Sheaf-ADMM can be applied to various multi-agent systems where coordination is essential, particularly in environments with limited local information. The framework's interpretability and distinct state variable structure may also facilitate further research into multi-agent dynamics and optimization.