AI-generated summaries
Today's ML research,
without the noise.
Daily summaries of the latest machine learning papers from arXiv, processed every 8 hours.
44
Papers today
8h
Update frequency
7
Days of history
OpenG2G: A Simulation Platform for AI Datacenter-Grid Runtime Coordination
Optimization
- OpenG2G is an open-source library for simulating AI datacenter-grid coordination.
- The platform allows for the comparison of various control strategies and their impacts on both AI and grid performance.
- Realistic simulations demonstrate the potential for AI datacenters to provide power flexibility to the grid.
- The modular architecture supports easy integration of new AI workloads and grid configurations.
Read more
OpenG2G: A Simulation Platform for AI Datacenter-Grid Runtime Coordination
Summary
The paper presents OpenG2G, an open-source simulation platform designed to address the challenges posed by the increasing energy demands of AI datacenters on the electricity grid. As AI workloads grow, they create significant capacity and reliability issues for the grid, leading to delays in datacenter interconnections and bottlenecks in AI development. OpenG2G enables users to explore various control strategies for coordinating AI datacenters with grid operations, allowing for real-time adjustments in power consumption based on grid signals. The platform features a modular architecture that integrates a datacenter backend based on real AI service measurements, a grid backend utilizing high-fidelity simulators, and a flexible controller interface. This setup allows for the implementation and comparison of different control paradigms, including classical and learning-based approaches. The authors demonstrate OpenG2G's capabilities through realistic scenarios involving AI inference workloads, revealing how different AI model choices impact datacenter flexibility and coordination outcomes. The findings suggest that effective coordination can lead to favorable trade-offs between AI performance metrics and grid operational requirements.
Methodology
The authors developed OpenG2G by creating a simulation framework that includes a datacenter backend driven by real-world AI service data, a grid backend utilizing traditional grid simulators, and a generic controller interface. This architecture allows users to implement various control strategies and assess their effectiveness in coordinating AI workloads with grid operations.
Results
The simulations conducted using OpenG2G revealed that different AI model and deployment choices significantly affect the datacenter's power flexibility and coordination outcomes. The platform successfully demonstrated the ability to implement and compare classical feedback controllers and learning-based controllers, showcasing the trade-offs between AI performance metrics (like throughput and latency) and grid operational metrics (like voltage stability).
Implications
OpenG2G has the potential to inform design decisions for future AI datacenter projects by providing insights into how to effectively coordinate AI workloads with grid operations. This could lead to more efficient energy use, reduced delays in datacenter deployment, and improved reliability of electricity grids in the face of growing AI demands.
QuadraSHAP: Stable and Scalable Shapley Values for Product Games via Gauss-Legendre Quadrature
Efficient ML
Interpretability
Theory
- QuadraSHAP provides a stable and scalable method for computing Shapley values in product games.
- The method utilizes a Gauss-Legendre quadrature scheme to achieve high precision with fewer nodes.
- Numerical stability is enhanced through log-space evaluation, reducing overflow and underflow issues.
- QuadraSHAP matches the performance of existing methods while significantly improving computational efficiency.
Read more
QuadraSHAP: Stable and Scalable Shapley Values for Product Games via Gauss-Legendre Quadrature
Summary
This paper presents QuadraSHAP, a novel method for efficiently computing Shapley values in product games, which are cooperative games where the coalition value is a product of individual player contributions. The authors derive an exact one-dimensional integral representation for the Shapley value, allowing for the application of Gauss-Legendre quadrature to compute these values. This approach significantly reduces the computational complexity associated with traditional methods, which often require evaluating exponentially many feature coalitions. QuadraSHAP achieves high precision with a relatively small number of quadrature nodes, making it suitable for high-dimensional problems. The method is implemented in a numerically stable manner using log-space evaluations and is optimized for parallel computation, resulting in a total work complexity of O(d mq) and parallel time of O(log d). Experimental results demonstrate that QuadraSHAP outperforms existing methods in terms of speed and stability, particularly in large ensemble sizes and high-dimensional settings.
Methodology
The authors reformulate the Shapley value for product games as an integral of a polynomial over the interval [0, 1]. They employ Gauss-Legendre quadrature for efficient computation, which allows for exact results when the number of quadrature nodes is sufficiently large. The implementation is optimized for numerical stability using log-space evaluations and is designed for parallel processing using associative scan primitives.
Results
QuadraSHAP is shown to be the fastest numerically stable method for computing Shapley values across various configurations. It outperforms existing methods, such as Linear TreeSHAP, by 3-5 times in large ensemble sizes while maintaining stability in high-dimensional scenarios. The method also reduces computational costs significantly compared to previous approaches for product-kernel methods.
Implications
The development of QuadraSHAP has significant implications for machine learning explainability, particularly in scenarios involving complex models with multiplicative structures. Its efficiency and stability make it a valuable tool for practitioners seeking to interpret model predictions in high-dimensional settings, enhancing the transparency and trustworthiness of AI systems.
FedFrozen: Two-Stage Federated Optimization via Attention Kernel Freezing
Federated Learning
Optimization
Theory
- Introduction of FedFrozen, a two-stage federated optimization framework.
- First analysis of federated linear attention by decomposing it into query/key and value blocks.
- The warm-up phase allows for learning a stable attention kernel, while the frozen phase optimizes the value block.
- Demonstrated improvements in stability and effectiveness of Transformer models in federated learning.
Read more
FedFrozen: Two-Stage Federated Optimization via Attention Kernel Freezing
Summary
The paper introduces FedFrozen, a novel two-stage federated optimization framework designed to address the challenges of heterogeneous clients in federated learning, particularly the issue of client drift caused by inconsistent local updates. The authors leverage insights from Transformer-based architectures, which have shown robustness in federated settings. FedFrozen operates in two stages: the first stage involves a warm-up training period where the full attention module is trained, allowing the query/key block to establish a stable attention kernel. In the second stage, this query/key block is frozen, and only the value block is optimized, thus reducing the impact of client drift. The methodology includes a theoretical analysis of the two stages, revealing that the warm-up phase performs inexact descent on a regularized kernel-profile objective, while the frozen phase focuses on optimizing the value block under a fixed attention kernel. The authors validate their approach through simulations and real-data experiments, demonstrating improved stability and effectiveness of Transformer models in heterogeneous federated learning scenarios.
Methodology
FedFrozen employs a two-stage approach: first, a warm-up training phase where the full attention module is trained to establish a useful shared kernel, followed by a second phase where the query/key block is frozen and only the value block is updated. The authors provide a theoretical framework to analyze the optimization behavior during both stages.
Results
The experiments conducted show that FedFrozen significantly enhances the stability and performance of Transformer models in heterogeneous federated learning environments. The simulations support the theoretical predictions regarding bias-drift behavior, confirming the effectiveness of the proposed method.
Implications
FedFrozen has the potential to improve federated learning applications in various domains where data is distributed across heterogeneous clients, such as healthcare, finance, and mobile applications. Its focus on optimizing attention mechanisms can lead to more robust models in real-world scenarios.
Neural Co-state Policies: Structuring Hidden States in Recurrent Reinforcement Learning
Reinforcement Learning
Optimization
Robotics
- Establishes a formal link between recurrent policies and the Pontryagin minimum principle (PMP).
- Introduces Neural Co-state Policies (NCP) to structure hidden states in RNNs.
- Proposes a co-state loss to align training with optimal control dynamics.
- Demonstrates improved performance and robustness in partially observable tasks.
Read more
Neural Co-state Policies: Structuring Hidden States in Recurrent Reinforcement Learning
Summary
This paper addresses the challenge of partial observability in reinforcement learning (RL) by establishing a formal connection between recurrent neural networks (RNNs) and the Pontryagin minimum principle (PMP) from optimal control theory. The authors introduce Neural Co-state Policies (NCP), which align the hidden states of RNNs with the co-state dynamics dictated by PMP. This alignment allows for the interpretation of the readout layer as performing Hamiltonian minimization. To facilitate this structure, the authors propose a co-state loss derived from the Hamilton-Jacobi-Bellman (HJB) equation, which helps guide the training of the recurrent architectures. The empirical evaluation on partially observable tasks from the DeepMind Control Suite demonstrates that NCPs either match or exceed the performance of traditional recurrent policies trained with proximal policy optimization (PPO). Additionally, the structured internal dynamics of NCPs show robustness against out-of-distribution sensor masking, suggesting a more reliable approach to continuous control tasks. Overall, this work bridges classical optimal control with deep RL, providing a theoretically grounded framework for designing robust policies.
Methodology
The authors develop Neural Co-state Policies (NCP) by mapping the hidden states of standard recurrent architectures (like CT-RNNs and GRUs) to optimal co-states as defined by the Pontryagin minimum principle. They introduce a co-state loss derived from the Hamilton-Jacobi-Bellman equation to guide the training process, ensuring that the hidden states reflect optimality conditions. The approach is empirically tested on continuous control tasks from the DeepMind Control Suite.
Results
The application of the co-state loss to CT-RNN and GRU architectures resulted in performance that matches or exceeds that of recurrent policies trained with proximal policy optimization (PPO). The structured internal dynamics of NCPs also exhibited robustness to zero-shot out-of-distribution sensor masking, indicating a significant improvement in handling partial observability.
Implications
This work has implications for the design of intelligent agents capable of robust performance in real-world environments characterized by partial observability. By providing a principled framework that connects deep reinforcement learning with optimal control theory, it opens avenues for developing more interpretable and reliable continuous control policies.
Hybrid Quantum-Classical GANs for the Generation of Adversarial Network Flows
Generative Models
- Introduction of a hybrid quantum-classical GAN framework for generating adversarial network traffic.
- Utilization of variational quantum circuits to enhance the expressiveness of latent representations.
- Evaluation of generated traffic against classical IDS models to test evasion capabilities.
- Highlighting the implications of quantum computing in improving attack flow generation.
Read more
Hybrid Quantum-Classical GANs for the Generation of Adversarial Network Flows
Summary
This paper introduces a novel hybrid quantum-classical generative adversarial network (QC-GAN) framework aimed at generating synthetic network traffic flows that mimic malicious activity. Traditional GANs face challenges such as mode collapse, high computational demands, and the necessity for large datasets. The proposed QC-GAN addresses these issues by utilizing a variational quantum generator to create latent representations encoded as quantum states, which enhances expressiveness and reduces computational overhead. A classical discriminator is trained on real-world datasets (UNSW-NB15) alongside the generated traffic to differentiate between real and fake flows. The iterative training process involves the generator minimizing the discriminator's ability to classify correctly, while the discriminator seeks to maximize its accuracy. The study assumes an attacker with limited quantum computing capabilities, contrasting with the classical nature of the discriminator, reflecting real-world scenarios. The generated flows are evaluated against classical intrusion detection systems (IDS) using models like random forests and convolutional neural networks to assess their effectiveness in evading detection. Additionally, the impact of hardware-based noise on these attacks is examined, emphasizing the need for quantum-resilient defense mechanisms in IDS.
Methodology
The methodology involves using a variational quantum generator to create synthetic network traffic flows, which are then evaluated against a classical discriminator trained on real-world datasets. The generator and discriminator are engaged in an adversarial training process to optimize their respective performances.
Results
The results demonstrate that the QC-GAN can effectively generate adversarial network flows that successfully evade detection by classical IDS models. The study also reveals that the expressiveness of quantum-generated traffic can lead to more sophisticated attack patterns, challenging existing detection systems.
Implications
The findings suggest significant potential for quantum machine learning in cybersecurity, particularly in generating advanced attack flows that can stress-test classical intrusion detection systems. This highlights the necessity for developing quantum-resilient defense strategies to counteract evolving threats in network security.
Two-Stage Learned Decomposition for Scalable Routing on Multigraphs
Optimization
Reinforcement Learning
Graph Learning
- Introduces Node-Edge Policy Factorization (NEPF) for scalable routing on multigraphs.
- Utilizes a pre-encoding edge aggregation scheme to reduce memory and computational costs.
- Employs a non-autoregressive architecture for efficient edge selection.
- Demonstrates superior performance in solution quality and speed compared to existing methods.
Read more
Two-Stage Learned Decomposition for Scalable Routing on Multigraphs
Summary
This paper addresses the challenges of applying neural methods to Vehicle Routing Problems (VRPs) on multigraphs, where multiple edges represent different travel options. Traditional approaches struggle with scalability due to the dense representation of multigraphs. The authors propose a Node-Edge Policy Factorization (NEPF) approach that decomposes the routing policy into two stages: a node permutation stage and an edge selection stage. This decomposition is enabled by a pre-encoding edge aggregation scheme that reduces the memory footprint and a non-autoregressive architecture for edge selection. The proposed method is trained using hierarchical reinforcement learning, allowing for joint optimization of both stages. Experimental results across six VRP variants show that NEPF not only matches but often outperforms state-of-the-art methods in solution quality while being significantly faster in both training and inference. This work represents a significant step towards scalable neural routing solutions for complex multigraph scenarios.
Methodology
The authors propose a two-stage decomposition of the routing problem into a node permutation stage and an edge selection stage. They introduce a pre-encoding edge aggregation mechanism to summarize parallel edges into a compact representation, avoiding the need for a dense multigraph during encoding. A lightweight non-autoregressive architecture is used for the edge selection stage, and both stages are trained jointly using hierarchical reinforcement learning.
Results
The NEPF approach matches or outperforms state-of-the-art methods across six VRP variants in terms of solution quality. Additionally, it demonstrates significant improvements in training and inference speed, making it a more efficient alternative for routing on multigraphs.
Implications
This work has the potential to enhance the efficiency of routing solutions in real-world transportation networks, where multigraph representations are common. The NEPF framework could serve as a foundational model for future research in routing problems and may facilitate the development of more generalizable routing models.
On Semantic Loss Fine-Tuning Approach for Preventing Model Collapse in Causal Reasoning
Theory
Graph Learning
Large Language Models
- Identification of catastrophic model collapse in causal reasoning fine-tuning with a 100% occurrence rate.
- Introduction of a semantic loss function with graph-based constraints to prevent model collapse.
- Achieved significant performance improvements on causal reasoning tasks compared to collapsed baselines.
- Comprehensive evaluation across 200,000+ samples validating the necessity of semantic loss for stable predictions.
Read more
On Semantic Loss Fine-Tuning Approach for Preventing Model Collapse in Causal Reasoning
Summary
This paper addresses a critical issue in the fine-tuning of transformer models for causal reasoning tasks, specifically the phenomenon of catastrophic model collapse. The authors demonstrate that standard fine-tuning methods lead to models that predict trivial outputs ('Yes' or 'No') regardless of the input structure, resulting in misleadingly high accuracy rates. They identify a 100% collapse rate when fine-tuning the Gemma 270M model on transitivity and d-separation tasks without incorporating a semantic loss function. To mitigate this issue, the authors propose a novel semantic loss function that integrates graph-based logical constraints and employs dynamic lambda scheduling. Their approach significantly improves model performance, achieving 70.4% accuracy on transitivity tasks and 68.6% on d-separation tasks, representing a 42.7% improvement over collapsed baselines. The paper includes comprehensive evaluations on over 200,000 samples and adversarial testing, confirming that models utilizing the semantic loss function maintain stable, context-dependent predictions, while collapsed models exhibit severe performance degradation.
Methodology
The authors developed a semantic loss function that incorporates logical constraints based on causal graphs and utilizes dynamic lambda scheduling to balance stability and learning during fine-tuning. They conducted extensive experiments to evaluate the performance of their approach against standard fine-tuning methods.
Results
The proposed semantic loss function led to a 42.7% improvement in accuracy on causal reasoning tasks, achieving 70.4% accuracy on transitivity and 68.6% on d-separation tasks. In adversarial evaluations, semantic models achieved 67-70% accuracy, while collapsed models performed poorly, with accuracies ranging from 43-71%. The findings were validated through benchmarking on over 200,000 evaluation samples.
Implications
This research highlights the importance of incorporating semantic constraints in fine-tuning transformer models for causal reasoning, suggesting that such approaches could enhance the reliability and robustness of AI systems in understanding complex causal relationships.
Attribution-Guided Continual Learning for Large Language Models
Large Language Models
NLP
- Introduces an attribution-guided framework for continual learning in LLMs.
- Estimates task-specific parameter importance to modulate gradient updates.
- Addresses the limitations of existing methods that lack semantic awareness.
- Demonstrates superior performance in retaining knowledge from previous tasks.
Read more
Attribution-Guided Continual Learning for Large Language Models
Summary
This paper addresses the challenge of catastrophic forgetting in large language models (LLMs) during continual learning, where the model's performance on previously learned tasks deteriorates after learning new tasks. The authors propose an innovative framework called attribution-guided continual fine-tuning, which estimates task-specific, element-wise parameter importance in each Transformer layer. By modulating gradients based on these importance scores, the method ensures that parameters critical to earlier tasks receive smaller updates, while allowing less important parameters to adapt to new tasks. The authors demonstrate that existing methods, such as data replay, parameter freezing, and regularization, fail to account for the semantic distribution of knowledge across model parameters. Their approach provides a mechanistic understanding of parameter importance, leading to improved retention of old task performance while maintaining competitive results on new tasks. Experiments on continual learning benchmarks show that the proposed method consistently outperforms baseline approaches, highlighting its effectiveness in mitigating forgetting in LLMs.
Methodology
The proposed framework utilizes an attribution-based importance estimation procedure to identify critical parameters for previously learned tasks. It selectively constrains updates to these important parameters during the fine-tuning process, allowing less critical parameters to adapt to new tasks. This approach is evaluated under both full supervised fine-tuning and LoRA-based adaptation settings.
Results
The experiments conducted on continual learning benchmarks reveal that the attribution-guided continual fine-tuning framework consistently outperforms existing methods, achieving better retention of performance on old tasks while still maintaining competitive performance on new tasks.
Implications
The findings suggest that incorporating parameter importance into the continual learning process can significantly enhance the adaptability of LLMs to new tasks without sacrificing previously acquired knowledge. This has potential applications in various domains where LLMs are deployed for sequential task learning.
Reward Shaping and Action Masking for Compositional Tasks using Behavior Trees and LLMs
Reinforcement Learning
Robotics
Large Language Models
- Introduction of the Masking Reward Behavior Tree (MRBT) for compositional tasks.
- Automated design of reactive and modular rewards and action masks using LLMs.
- Successful generation and refinement of MRBTs leading to improved training efficiency.
- Demonstrated advantages of MRBTs in terms of transferability and modularity.
Read more
Reward Shaping and Action Masking for Compositional Tasks using Behavior Trees and LLMs
Summary
This paper presents a novel approach to enhancing the learning efficiency of autonomous agents in compositional tasks by utilizing a Masking Reward Behavior Tree (MRBT) framework. The authors argue that decomposing complex tasks into simpler subtasks can significantly improve reinforcement learning (RL) outcomes, particularly when effective reward shaping and action masking are employed. While previous methods have leveraged large language models (LLMs) for automating these processes, they often fall short in addressing reactivity to subtask failures and modularity across varying task objects. The MRBT framework is introduced as a symbolic structure that provides a reactive and modular reward and action mask function. The authors develop an MRBT template and logical specifications to ensure correctness in task execution. An automated pipeline is created to generate MRBTs using LLMs, verify them with an SMT-solver, and integrate them into a neurosymbolic RL loop for training agents. Experimental results demonstrate that the MRBTs consistently outperform baseline methods, achieving higher training efficiency and task success rates. The paper highlights the advantages of MRBTs, including transferability, modularity, and verifiability, making them a promising solution for complex robotic tasks.
Methodology
The authors developed a Masking Reward Behavior Tree (MRBT) template for sequential object-interaction subtasks, which outputs rewards and action masks based on executed leaf behaviors. They utilized an LLM to generate MRBTs that are robust to varying task objects, and an SMT-solver to verify the correctness of the logical specifications. The MRBTs were integrated into a neurosymbolic RL training loop to optimize agent performance across a defined task space.
Results
The experiments showed that the MRBTs consistently improved training efficiency and task success rates compared to baseline methods and MRBTs without action masking. In the most complex task space, agents achieved an average success rate of over 80%, compared to less than 70% for baseline approaches. The MRBTs also demonstrated advantages in transferability and modularity, particularly in more complex tasks.
Implications
The findings suggest that MRBTs can significantly enhance the performance of autonomous agents in robotic applications, particularly in tasks requiring complex object interactions. The automated design process for rewards and action masks could streamline the development of RL systems in various domains, including robotics and automation.
Normalized Architectures are Natively 4-Bit
Large Language Models
Efficient ML
Optimization
- nGPT architecture is natively robust to 4-bit quantization, eliminating the need for overhead interventions.
- Robustness arises from effective signal accumulation rather than noise suppression, enhancing SNR per layer.
- Training dynamics under the hypersphere constraint promote distributed alignments across dimensions, ensuring signal coherence.
- Empirical validation shows nGPT maintains stability and lower relative error compared to standard transformers across diverse architectures.
Read more
Normalized Architectures are Natively 4-Bit
Summary
This paper presents nGPT, a novel transformer architecture that constrains weights and hidden representations to the unit hypersphere, enabling robust training of large language models at 4-bit precision without the need for traditional quantization interventions. The authors demonstrate that this architectural choice enhances the model's ability to maintain signal coherence during low-precision arithmetic, leading to a higher effective signal-to-noise ratio (SNR) and a flatter loss landscape. The study validates the approach across various model sizes, including a 1.2B dense model and hybrid Mamba-Transformer Mixture-of-Experts (MoE) models with up to 30B parameters. The findings suggest a paradigm shift in low-precision model design, emphasizing the importance of architecture in achieving quantization robustness, rather than relying solely on post-hoc corrections.
Methodology
The authors conducted a structural analysis of the nGPT architecture, focusing on the effects of the hypersphere constraint on signal accumulation and quantization noise. They validated their findings through experiments on various model configurations, including dense and hybrid MoE architectures, using NVIDIA's NVFP4 quantization format.
Results
The results indicate that nGPT achieves stable end-to-end NVFP4 training without requiring traditional fixes like randomized Hadamard transforms or dynamic per-tensor scaling. The architecture demonstrated a higher effective SNR and maintained lower relative error compared to standard transformer models across multiple configurations.
Implications
The findings suggest that future model designs should prioritize architectural features that enhance quantization robustness, potentially leading to more efficient training and deployment of large language models at lower precision levels. This could significantly impact the scalability and performance of AI systems in resource-constrained environments.
A Robust Foundation Model for Conservation Laws: Injecting Context into Flux Neural Operators via Recurrent Vision Transformers
Theory
Efficient ML
Time Series
- Introduction of a context-conditioned Flux Neural Operator using recurrent Vision Transformers.
- Formulation of an in-context flux-learning problem for parametric conservation laws.
- Demonstration of improved autoregressive stability and out-of-distribution robustness.
- Ability to infer latent numerical flux operators from short observed trajectories.
Read more
A Robust Foundation Model for Conservation Laws: Injecting Context into Flux Neural Operators via Recurrent Vision Transformers
Summary
This paper presents a novel architecture that enhances the Flux Neural Operator (Flux NO) by incorporating context through a recurrent Vision Transformer (ViT). The proposed model functions as a hypernetwork that captures solution dynamics over a finite temporal window, encoding them with the ViT to generate parameters for a context-conditioned neural operator. This approach allows the model to solve conservation laws without needing explicit knowledge of the governing equations or partial differential equation (PDE) coefficients. The authors demonstrate that their method maintains the robustness and generalization capabilities of Flux NO while providing reliable numerical solutions across various conservative systems, including those with previously unseen fluxes. The architecture is particularly beneficial for long-time predictions and autoregressive stability, outperforming standard neural operators in terms of out-of-distribution robustness.
Methodology
The authors propose a hypernetwork architecture that utilizes a recurrent Vision Transformer to encode solution dynamics from short trajectories. This encoded context is then used to condition a neural operator that predicts numerical fluxes, ensuring that the model adheres to the conservative structure necessary for solving conservation laws. The methodology integrates classical finite volume methods with modern neural operator techniques, allowing for effective adaptation to various conservation problems.
Results
The proposed model shows significant improvements in robustness and generalization compared to traditional neural operators. Experiments on one-dimensional conservation-law benchmarks and a diffusive Burgers-type problem indicate that the context-conditioned Flux NO can reliably produce accurate numerical solutions, even for unseen flux functions. The model's autoregressive stability is enhanced by enforcing a conservative flux-difference update.
Implications
This work has potential implications for scientific computing and numerical simulations, particularly in fields where conservation laws are critical, such as fluid dynamics and material science. The ability to adaptively learn from context without explicit retraining could streamline the modeling of complex systems and enhance predictive capabilities in real-world applications.
COPYCOP: Ownership Verification for Graph Neural Networks
Graph Learning
- COPYCOP is the first fingerprinting method for GNNs that is robust against a wide range of adversarial transformations.
- The algorithm uses stationary points of the embedding function as fingerprints, which are invariant to transformations.
- COPYCOP is architecture-agnostic, allowing it to detect surrogates regardless of differences in model architecture or parameters.
- Extensive experiments validate the effectiveness of COPYCOP across multiple datasets and GNN architectures.
Read more
COPYCOP: Ownership Verification for Graph Neural Networks
Summary
The paper introduces COPYCOP, a novel algorithm designed to verify the ownership of Graph Neural Networks (GNNs) by detecting surrogate models that may have been trained to mimic the embeddings of a victim GNN. The challenge arises from the fact that adversaries can train surrogate GNNs using the outputs of the victim GNN, applying various transformations such as rotation, scaling, or changes in architecture, which can obscure the relationship between the two models. COPYCOP is unique in its ability to identify these copycat GNNs despite such transformations, providing theoretical guarantees for its effectiveness. The authors demonstrate the robustness of COPYCOP through extensive experiments on 14 datasets and 5 different GNN architectures, showing that it can accurately detect surrogates even under a wide range of adversarial attacks. The method is architecture-agnostic and employs stationary points of the embedding function as fingerprints, ensuring that the detection remains invariant to transformations applied by adversaries. This work addresses a critical gap in existing methods, which are often vulnerable to simple cosmetic changes in embeddings.
Methodology
COPYCOP employs a fingerprinting approach that utilizes stationary points of the embedding function from the victim GNN. The algorithm samples these stationary points and tests whether they remain stationary for a candidate GNN, allowing for the determination of whether the candidate is a surrogate model. This method is designed to be robust against various transformations that an adversary might apply to the embeddings.
Results
The experiments conducted on 14 datasets and 5 GNN architectures demonstrate that COPYCOP is highly accurate and robust against a variety of adversarial attacks, including model extraction and embedding transformations. The results indicate that COPYCOP can reliably identify surrogate models, maintaining high confidence in its detections.
Implications
COPYCOP has significant implications for the security and integrity of GNNs in applications where ownership verification is crucial, such as in Embeddings-as-a-Service models. By providing a reliable method for detecting surrogate models, it enhances the protection of intellectual property in machine learning models and promotes trust in GNN deployments.
Shortcut Solutions Learned by Transformers Impair Continual Compositional Reasoning
NLP
Large Language Models
Theory
- BERT learns shortcut solutions that impair generalization and forward transfer in continual learning.
- ALBERT exhibits a more effective algorithmic solution, leading to better performance in continual learning tasks.
- Both models fail in tasks requiring compositional reasoning across experiences, but ALBERT can be improved with specific training strategies.
- Architectural choices significantly influence the performance of Transformer models in continual learning settings.
Read more
Shortcut Solutions Learned by Transformers Impair Continual Compositional Reasoning
Summary
This paper investigates the ability of Transformer models, specifically BERT and ALBERT, to perform continual compositional reasoning, which is essential for learning new experiences by leveraging previously acquired knowledge. The authors expand the Learning Equality and Group Operations (LEGO) framework to a continual learning (CL) setting, termed 'continual LEGO'. They find that BERT tends to learn shortcut solutions that hinder its generalization capabilities and forward transfer to new experiences. In contrast, ALBERT demonstrates a more effective 'For loop-esque' solution that enhances its CL performance. However, both models struggle with tasks requiring composition across experiences. The study reveals that while ALBERT's performance can be improved through training strategies that combine data from different experiences, BERT's entrenched shortcut solutions limit its adaptability. The findings highlight the importance of Transformer architecture in determining the models' learning strategies and their implications for continual learning.
Methodology
The authors developed a continual learning extension of the LEGO framework to systematically analyze the performance of BERT and ALBERT models in compositional reasoning tasks. They conducted experiments to evaluate how architectural choices affect generalization accuracy and forward transfer, and they analyzed the emergent attention patterns to identify the nature of the solutions learned by each model.
Results
The study found that BERT's shortcut solutions limit its ability to generalize and adapt to new experiences, while ALBERT's recurrent architecture allows for better continual learning performance. However, both models struggled with tasks requiring composition across experiences. ALBERT's performance could be improved by training strategies that combine data across experiences, unlike BERT, which remained hindered by its initial training.
Implications
The findings suggest that Transformer architectures can significantly influence the learning strategies employed by models, impacting their ability to perform continual learning and compositional reasoning. This research could inform the design of future models and training strategies to enhance their adaptability and generalization capabilities in dynamic environments.
Crafting Reversible SFT Behaviors in Large Language Models
NLP
Large Language Models
Optimization
- Introduces the concept of sparse behavioral carriers for SFT-induced behaviors in LLMs.
- Presents Loss-Constrained Dual Descent (LCDD) for constructing these carriers.
- Demonstrates the effectiveness of SFT-Eraser for behavior reversal without modifying model weights.
- Provides evidence that the sparse structure is crucial for causal necessity of behaviors.
Read more
Crafting Reversible SFT Behaviors in Large Language Models
Summary
This paper addresses the challenge of controlling behaviors induced by supervised fine-tuning (SFT) in large language models (LLMs). Traditional methods for behavior interpretation do not guarantee causal necessity, making it difficult to selectively control these behaviors at inference time. The authors propose a novel approach to construct sparse, mechanistically necessary subnetworks, termed 'carriers', that encapsulate SFT-induced behaviors. They introduce two key methods: Loss-Constrained Dual Descent (LCDD), which optimizes routing masks and model weights to create these carriers, and SFT-Eraser, a soft prompt that reverses SFT-induced behaviors through activation matching. The results demonstrate that the constructed carriers preserve target behaviors while allowing for effective reversal, providing evidence that these carriers are causally necessary for the behaviors. This work opens new avenues for systematically localizing and controlling SFT-induced behaviors in deployed models.
Methodology
The authors developed Loss-Constrained Dual Descent (LCDD) to create sparse carriers by optimizing routing masks and model weights under a utility budget. They also introduced SFT-Eraser, which utilizes activation matching to reverse SFT-induced behaviors, allowing for targeted suppression of these behaviors at inference time.
Results
The experiments showed that LCDD successfully constructs sparse carriers that maintain the desired behaviors while enabling effective reversal through SFT-Eraser. The results confirmed that the sparse structure is essential for causal control, as similar trigger optimizations failed on standard SFT models.
Implications
This research has significant implications for the deployment of large language models, allowing for more modular control over SFT-induced behaviors. It provides a framework for behavior auditing and selective suppression, enhancing the safety and reliability of LLMs in real-world applications.
CLAD: A Clustered Label-Agnostic Federated Learning Framework for Joint Anomaly Detection and Attack Classification
Federated Learning
- CLAD combines Clustered Federated Learning with a Dual-Mode Micro-Architecture for enhanced anomaly detection and attack classification.
- The framework effectively utilizes both labeled and unlabeled data, maximizing the learning potential from diverse IoT devices.
- Dynamic clustering of devices improves model accuracy by preserving distinct operational patterns.
- CLAD demonstrates significant performance improvements over state-of-the-art methods, particularly in environments with high proportions of unlabeled data.
Read more
CLAD: A Clustered Label-Agnostic Federated Learning Framework for Joint Anomaly Detection and Attack Classification
Summary
The paper presents CLAD, a novel federated learning framework designed to enhance security in IoT and IIoT environments by addressing the challenges of device heterogeneity and label scarcity. Traditional centralized Intrusion Detection Systems (IDS) struggle with the diverse behaviors of IoT devices and the vast amounts of unlabeled data. CLAD integrates Clustered Federated Learning (CFL) with a Dual-Mode Micro-Architecture (DM2A) to enable simultaneous unsupervised anomaly detection and supervised attack classification. The DM2A features a shared encoder that branches into two tasks, allowing the framework to utilize both labeled and unlabeled data effectively. By dynamically clustering devices with similar traffic patterns, CLAD prevents model divergence and ensures that all operational patterns are preserved. Extensive evaluations show that CLAD significantly outperforms existing methods, achieving a 30% relative improvement in detection performance in scenarios with 80% unlabeled clients while reducing communication costs by half.
Methodology
The methodology involves a Dual-Mode Micro-Architecture that features a shared encoder with two branches for unsupervised anomaly detection and supervised attack classification. The framework employs Clustered Federated Learning to group devices with similar traffic patterns, allowing for specialized model training and improved performance. This hybrid approach enables the integration of both labeled and unlabeled data, facilitating a more comprehensive learning process.
Results
The evaluations indicate that CLAD achieves a 30% relative improvement in detection performance in scenarios with 80% unlabeled clients, while also reducing communication costs by 50% compared to existing methods.
Implications
CLAD's approach can significantly enhance the security of IoT and IIoT systems by providing a robust and efficient means of detecting anomalies and classifying attacks, making it particularly valuable in environments where labeled data is scarce. This framework could be applied in various sectors, including smart cities, industrial automation, and critical infrastructure protection.
Training Transformers for KV Cache Compressibility
NLP
Large Language Models
Efficient ML
- Introduces the concept of KV compressibility as a property of learned representations.
- Proposes KV-Compression Aware Training (KV-CAT) to guide transformers towards compressible representations during training.
- Demonstrates that KV-CAT improves the effectiveness of post-hoc KV cache compression methods.
- Empirical evaluations show enhanced performance across various long-context tasks.
Read more
Training Transformers for KV Cache Compressibility
Summary
This paper addresses the challenge of Key-Value (KV) cache compressibility in long-context language modeling, which is crucial for efficient memory usage and decoding in transformer models. The authors introduce the concept of KV compressibility, emphasizing that it is a property of the learned representations rather than the context itself. They propose a novel training approach called KV-Compression Aware Training (KV-CAT), which encourages transformers to develop internal representations that are more amenable to post-hoc compression. This is achieved through a training-time KV sparsification policy that masks certain KV slots, compelling the model to optimize its representation for compressibility while maintaining performance. The empirical results demonstrate that KV-CAT significantly enhances the quality-budget tradeoff of downstream compression methods, improving performance in tasks such as retrieval, long-context question answering, and perplexity evaluation of compressed outputs.
Methodology
The authors formalize KV compressibility and propose KV-CAT, which involves a continued pretraining procedure that incorporates a KV sparsification policy. This policy masks a fraction of KV slots during training, compelling the model to learn compressed representations. The training objective combines self-distillation and NTP loss to ensure that the model retains performance while adapting to the compression constraints.
Results
The implementation of KV-CAT on QWEN2.5 models showed significant improvements in the quality-budget tradeoff for KV cache compression. The evaluation metrics included suffix perplexity under prefix compression, retrieval accuracy from compressed prefixes, and performance on long-context question answering tasks. The results indicated that models trained with KV-CAT outperformed those without it across all measured axes.
Implications
The findings suggest that training transformer models with a focus on KV cache compressibility can lead to more efficient long-context language models, which are crucial for applications requiring extensive context processing, such as document understanding and conversational agents. This approach could pave the way for more scalable and cost-effective deployment of large language models in real-world scenarios.
Do Neural Operators Forget Geometry? The Forgetting Hypothesis in Deep Operator Learning
Theory
- Introduction of the Geometric Forgetting Hypothesis, highlighting the loss of geometric information in deep neural operators.
- Demonstration of systematic geometric information decay through layer-wise geometric probing in spectral and attention-based operators.
- Identification of a structural limitation in transformer-based models termed the Geometric Shortcut, which leads to feature collapse when geometry is injected too late.
- Proposal of a Geometry Memory Injection mechanism that restores geometric information flow with minimal architectural changes.
Read more
Do Neural Operators Forget Geometry? The Forgetting Hypothesis in Deep Operator Learning
Summary
This paper investigates the limitations of neural operators in handling irregular geometries, proposing the Geometric Forgetting Hypothesis. The authors argue that the depth of neural operator architectures leads to a systematic loss of geometric information due to the Markovian structure of operator layers and their reliance on global mixing mechanisms. Through layer-wise geometric probing, they demonstrate that both spectral and attention-based operators experience a decay in geometric fidelity, which negatively impacts accuracy, stability, and generalization. To address this issue, the authors introduce a lightweight Geometry Memory Injection mechanism that reintroduces geometric constraints at intermediate depths, effectively mitigating the forgetting phenomenon. Their findings reveal a structural requirement for geometric retention in transformer-based operators, highlighting the necessity of early geometric integration in the design of neural operators.
Methodology
The authors employed layer-wise geometric probing and spectral analysis to investigate the propagation of geometric information through deep neural operators. They introduced a Geometry Memory Injection mechanism to counteract geometric forgetting and conducted control studies to differentiate between intrinsic and extrinsic geometric memory.
Results
The study found that neural operators progressively lose geometric fidelity as depth increases, leading to decreased accuracy and generalization. The Geometry Memory Injection mechanism effectively mitigated this forgetting, demonstrating that early integration of geometric information is crucial for maintaining performance in transformer-based operators.
Implications
The findings suggest that neural operator architectures need to explicitly incorporate geometric information throughout their layers to enhance performance on irregular geometries. This has implications for the design of future neural operators and their applications in solving complex physical systems modeled by partial differential equations.
SNAPO: Smooth Neural Adjoint Policy Optimization for Optimal Control via Differentiable Simulation
Reinforcement Learning
Optimization
Theory
- SNAPO integrates a neural policy into a differentiable simulator for optimal control.
- It computes exact gradients and sensitivities in a single adjoint pass, significantly improving efficiency.
- Demonstrated effectiveness in three diverse domains with rapid training times and high sensitivity computation speedup.
Read more
SNAPO: Smooth Neural Adjoint Policy Optimization for Optimal Control via Differentiable Simulation
Summary
The paper introduces SNAPO (Smooth Neural Adjoint Policy Optimization), a novel framework designed for optimal control problems that require sequential decision-making under uncertainty. Traditional methods like dynamic programming struggle with high-dimensional state spaces, while black-box reinforcement learning techniques are slow and do not provide sensitivity information. SNAPO addresses these challenges by embedding a neural policy within a known differentiable simulator, allowing for the replacement of hard constraints with smooth approximations. This enables the computation of exact gradients of the objective with respect to all policy parameters and inputs in a single adjoint pass. The authors demonstrate SNAPO's effectiveness across three domains: natural gas storage, pension fund asset-liability management, and pharmaceutical manufacturing. In each case, SNAPO achieves rapid training times and produces multiple sensitivities at no additional cost, showcasing its efficiency and scalability compared to existing methods.
Methodology
SNAPO utilizes a differentiable simulation framework where a neural network policy is embedded within the simulation. It replaces hard constraints with smooth approximations and employs reverse-mode automatic differentiation to compute gradients and sensitivities simultaneously during a single backward pass.
Results
SNAPO was tested in three domains: natural gas storage (training in under a minute with 365 sensitivities), pension fund management (6.5xβ200x speedup in sensitivity computation), and pharmaceutical manufacturing (20 regulatory sensitivities computed in 74.5 milliseconds). In all cases, SNAPO matched or exceeded existing baselines while producing all required sensitivities at a constant cost.
Implications
SNAPO has the potential to revolutionize optimal control in various industries by providing a faster, more efficient method for training policies and computing sensitivities. This could lead to improved decision-making processes in fields such as finance, energy management, and manufacturing.
Perceive, Route and Modulate: Dynamic Pattern Recalibration for Time Series Forecasting
Time Series
- Introduction of Dynamic Pattern Recalibration (DPR) to address static pattern responses in time series forecasting.
- DPR is a backbone-agnostic mechanism that enhances various forecasting architectures with minimal parameter overhead.
- The 'Perceive-Route-Modulate' pipeline allows for continuous token-level recalibration of features.
- DPRNet, a minimalist model based on DPR, achieves competitive performance across multiple benchmarks.
Read more
Perceive, Route and Modulate: Dynamic Pattern Recalibration for Time Series Forecasting
Summary
This paper addresses the challenge of time series forecasting in the presence of continuously shifting local temporal patterns. Traditional deep forecasting models utilize fixed weight matrices that apply uniformly across all temporal tokens, leading to a static pattern response that compromises performance during varying local dynamics. The authors propose a novel mechanism called Dynamic Pattern Recalibration (DPR), which enables token-level recalibration of features to adapt to changing dynamics. DPR operates through a lightweight 'Perceive-Route-Modulate' pipeline, allowing for the computation of a soft-routing distribution over a learned basis of adaptive response patterns. This results in a time-aware modulation vector that recalibrates hidden states effectively. The DPR mechanism is designed to be backbone-agnostic, enhancing forecasting performance across various architectures with minimal overhead. The authors validate their approach through a minimalist instantiation, DPRNet, which achieves competitive results across twelve benchmarks, demonstrating the effectiveness of dynamic recalibration compared to traditional parameter scaling methods.
Methodology
The methodology involves a three-step process: 'Perceive' uses multi-scale convolutions to sense local dynamics; 'Route' computes a soft-routing distribution over a learned basis of adaptive response patterns; and 'Modulate' applies token-level recalibration through a residual Hadamard product. This approach allows for dynamic adjustment of feature sensitivity to local temporal changes without altering the backbone architecture.
Results
DPR significantly improves forecasting performance across seven mainstream architectures (including attention, convolution, MLP, and GNN) with minimal overhead. The minimalist model, DPRNet, demonstrates competitive performance across twelve benchmarks, validating the effectiveness of dynamic recalibration against traditional parameter scaling methods.
Implications
The findings suggest that dynamic recalibration can enhance the adaptability of forecasting models in various applications, including finance, energy, and climate prediction, where local dynamics frequently change. This approach could lead to more robust decision-making systems in volatile environments.
Uncertainty Estimation via Hyperspherical Confidence Mapping
Theory
Efficient ML
Interpretability
- HCM provides a novel, geometric interpretation of uncertainty in neural network predictions.
- The method is sampling-free and distribution-free, making it efficient for real-time applications.
- HCM shows superior performance in calibration and confidence-error alignment compared to existing methods.
- The framework is applicable to both regression and classification tasks, enhancing its versatility.
Read more
Uncertainty Estimation via Hyperspherical Confidence Mapping
Summary
This paper introduces Hyperspherical Confidence Mapping (HCM), a novel framework for uncertainty estimation in neural networks that is both sampling-free and distribution-free. HCM decomposes model outputs into a magnitude and a normalized direction vector constrained to a unit hypersphere. This geometric approach interprets uncertainty as the degree of violation of this constraint, providing deterministic and interpretable estimates applicable to both regression and classification tasks. The authors demonstrate that HCM outperforms existing methods such as ensemble and evidential approaches in terms of calibration and confidence-error alignment while significantly reducing inference costs. Extensive experiments across various benchmarks and real-world applications, particularly in semiconductor manufacturing, validate the effectiveness and scalability of HCM, positioning it as a versatile alternative to traditional uncertainty estimation techniques.
Methodology
HCM reformulates the prediction problem by decomposing the output into a magnitude term and a unit-norm direction vector. This approach treats the prediction as a constrained optimization problem, where the violation of the unit-norm constraint is interpreted as a measure of uncertainty. The method integrates seamlessly into end-to-end training, allowing for efficient and interpretable uncertainty estimates without relying on distributional assumptions or repeated sampling.
Results
The experiments conducted show that HCM matches or exceeds the performance of ensemble and evidential methods in terms of uncertainty calibration and confidence-error alignment. It also demonstrates lower inference costs, making it suitable for real-time applications. The method's effectiveness is validated across diverse benchmarks and real-world tasks, particularly in semiconductor manufacturing.
Implications
HCM's ability to provide reliable uncertainty estimates without heavy computational requirements makes it suitable for high-stakes applications in fields such as autonomous driving, healthcare, and manufacturing. Its geometric approach could inspire further research into uncertainty estimation techniques and their applications in various domains.
INEUS: Iterative Neural Solver for High-Dimensional PIDEs
Theory
Efficient ML
- INEUS effectively addresses the curse of dimensionality in high-dimensional PIDEs.
- The method reformulates PIDE solving into recursive regression problems, enhancing efficiency.
- INEUS combines the strengths of PINNs and Feynman-Kac methods for better handling of nonlocal terms.
- A contraction-based convergence proof is established for linear PIDEs.
Read more
INEUS: Iterative Neural Solver for High-Dimensional PIDEs
Summary
The paper introduces INEUS, an innovative meshfree iterative neural solver designed for partial integro-differential equations (PIDEs). Traditional numerical methods face significant challenges in high-dimensional settings due to the curse of dimensionality, which INEUS aims to address. Unlike existing methods, INEUS reformulates the solution of PIDEs into a series of recursive regression problems, effectively handling nonlocal terms without the need for explicit numerical integration. This approach combines the global approximation strengths of Physics-Informed Neural Networks (PINNs) with the efficiency of Feynman-Kac and backward stochastic differential equation methods. The authors provide a contraction-based convergence proof for linear PIDEs, demonstrating that INEUS can achieve accurate and scalable solutions across various high-dimensional examples. The numerical experiments conducted show that INEUS outperforms traditional PINNs and deep BSDE methods, particularly in terms of computational efficiency and solution accuracy.
Methodology
INEUS employs an iterative approach to solve PIDEs by reformulating the problem into a series of recursive regression tasks. It utilizes single-jump sampling to avoid the explicit evaluation of nonlocal jump integrals, thus streamlining the computation. The method is built on the principles of PINNs but enhances efficiency by reducing the differentiation burden associated with traditional residual-based methods.
Results
The authors present numerical results showing that INEUS provides accurate solutions for both linear and nonlinear PIDEs. The method demonstrates scalability in high-dimensional settings and outperforms traditional PINNs and deep BSDE methods in terms of computational efficiency and accuracy.
Implications
INEUS has the potential to significantly improve the computational efficiency of solving high-dimensional PIDEs, making it applicable in various fields such as finance, engineering, and natural sciences where such equations are prevalent. Its ability to provide global solutions efficiently could lead to advancements in real-time applications and complex modeling scenarios.
Structure-Preserving Gaussian Processes Via Discrete Euler-Lagrange Equations
Robotics
Efficient ML
Theory
- Introduction of Lagrangian Gaussian Processes (LGPs) for learning dynamics models.
- Preservation of geometric structure of the Lagrange-dβAlembert principle for energy consistency.
- Ability to learn from discrete position data without requiring velocity or momentum measurements.
- Demonstrated data efficiency and generalization in synthetic and real-world applications.
Read more
Structure-Preserving Gaussian Processes Via Discrete Euler-Lagrange Equations
Summary
This paper introduces Lagrangian Gaussian Processes (LGPs) aimed at enhancing the probabilistic and data-efficient learning of dynamics through discrete forced Euler-Lagrange equations. The authors emphasize the preservation of the geometric structure dictated by the Lagrange-dβAlembert principle, which is crucial for maintaining physical consistency in the absence of external forces. This approach mitigates issues related to energy drift, enabling stable long-term predictions. A significant innovation is the use of linear operators for Gaussian process conditioning derived from discrete forced Euler-Lagrange equations, allowing the learning of dynamics solely from discrete position snapshots without requiring velocity or momentum data. This is particularly beneficial in practical scenarios such as motion capture and visual servoing. The paper validates the effectiveness of LGPs through various synthetic and real-world experiments, including a soft robot exhibiting hysteresis, demonstrating their ability to learn physically consistent dynamics and quantify uncertainty from sparse positional data.
Methodology
The authors developed two schemes for LGPs: a discrete version and a continuous version. Both schemes utilize discrete forced Euler-Lagrange equations to construct linear operators for Gaussian process conditioning. The methodology allows for the incorporation of the Lagrange-dβAlembert principle, facilitating the learning of dynamics from position data alone. The approach is validated through experiments on various systems, including a controlled double pendulum and a soft robot.
Results
The experimental results indicate that LGPs successfully learn physically consistent dynamics models, providing stable long-term predictions and effective uncertainty quantification from sparse positional data. The methods demonstrated superior data efficiency and generalization capabilities compared to existing approaches.
Implications
The proposed LGPs have significant implications for fields requiring robust dynamics modeling, such as robotics, motion capture, and control systems. By enabling learning from limited data while maintaining physical consistency, these methods can enhance model-based planning and control in complex real-world applications.
When Labels Have Structure: Improving Image Classification with Hierarchy-Aware Cross-Entropy
Computer Vision
- HACE is a drop-in replacement for standard cross-entropy that incorporates class hierarchy into the loss function.
- The method combines prediction aggregation and ancestral label smoothing to effectively utilize hierarchical information.
- HACE outperforms standard cross-entropy in 15 out of 18 architecture-dataset pairs, with an average accuracy gain of 4.66%.
- In linear probing, HACE achieves a mean improvement of 2.18% over competing methods across all datasets.
Read more
When Labels Have Structure: Improving Image Classification with Hierarchy-Aware Cross-Entropy
Summary
This paper introduces Hierarchy-Aware Cross-Entropy (HACE), a novel loss function designed to enhance image classification by incorporating class hierarchies into the training process. Traditional cross-entropy loss treats all misclassifications equally, disregarding the semantic relationships between classes. HACE addresses this by combining two key components: prediction aggregation, which ensures that the confidence of parent nodes reflects the cumulative probability of their child nodes, and ancestral label smoothing, which spreads the ground-truth signal across the hierarchy from the true class to the root. The authors evaluate HACE on three datasets (CIFAR-100, FGVC Aircraft, and NABirds) using both end-to-end training across six different architectures and linear probing on frozen DINOv2-Large features. The results demonstrate that HACE significantly improves classification accuracy over standard cross-entropy in most configurations, suggesting that incorporating hierarchical information into the loss function is a promising direction for improving image classification performance.
Methodology
The authors propose HACE, which integrates prediction aggregation and ancestral label smoothing. Prediction aggregation propagates probability mass upward through the class hierarchy, while ancestral label smoothing creates a soft ground-truth distribution that reflects the hierarchical structure. The effectiveness of HACE is evaluated through experiments on three datasets in two training regimes: end-to-end training with various architectures and linear probing on frozen features.
Results
HACE improves accuracy over standard cross-entropy in 15 out of 18 architecture-dataset pairs, achieving a mean gain of 4.66%. In linear probing, HACE outperforms all competing methods on the three datasets, with a mean improvement of 2.18% over the next best baseline. The results indicate that applying standard label smoothing prior to ancestral smoothing further enhances performance.
Implications
The findings suggest that incorporating hierarchical information into loss functions can lead to significant improvements in image classification tasks. This approach could be beneficial in various applications where class hierarchies are known, such as biological classification, object recognition, and any domain where semantic relationships between classes exist.
Weight-Decay Turns Transformer Loss Landscapes Villani: Functional-Analytic Foundations for Optimization and Generalization
Optimization
Theory
Large Language Models
- Weight decay is proven to be essential for satisfying Villani's differential growth conditions in Transformer models.
- The paper introduces empirical diagnostics to visualize the relationship between weight decay and curvature at infinity.
- Explicit convergence rates for Langevin-based optimizers are derived, linking them to weight decay practices.
- A reproducible experimental suite is provided for evaluating functional-analytic properties in large Transformer models.
Read more
Weight-Decay Turns Transformer Loss Landscapes Villani: Functional-Analytic Foundations for Optimization and Generalization
Summary
This paper investigates the role of weight decay as a regularizer in Transformer models, providing a rigorous functional-analytic characterization of the standard Transformer objective, which combines cross-entropy loss with L2 regularization. The authors prove that this regularized loss satisfies Villani's criteria for coercive energy functions, establishing that the loss is infinitely differentiable, grows quadratically, has Gaussian-integrable tails, and meets specific differential growth conditions. They derive explicit log-Sobolev and PoincarΓ© constants that connect the regularization strength to convergence guarantees for noisy stochastic gradient descent and PAC-Bayesian generalization bounds. To validate their theoretical findings, the authors introduce a scalable diagnostic tool to estimate the Villani diagnostic efficiently in large models. Experiments conducted on the GPT-Neo-125M model across datasets like Penn Treebank and WikiText-103 confirm the predicted quadratic growth of the diagnostic, spectral inflation of the Hessian, and exponential convergence behavior, demonstrating that weight decay enhances generalization and establishes the necessary mathematical conditions for effective optimization in deep learning.
Methodology
The authors employ functional-analytic techniques to characterize the loss landscape of Transformers, proving the conditions under which the loss function behaves as a Villani function. They also introduce a diagnostic tool for empirical validation and conduct experiments on large-scale language models to support their theoretical claims.
Results
The results demonstrate that the regularized loss function meets the necessary criteria for coercivity and integrability, leading to improved convergence rates for optimization algorithms. Empirical findings confirm the theoretical predictions regarding the growth of the Villani diagnostic and the behavior of the Hessian spectrum in large models.
Implications
The findings suggest that weight decay not only serves as a regularization technique but also plays a critical role in shaping the optimization landscape of Transformers, potentially leading to more efficient training and better generalization in large language models.
Towards Metric-Faithful Neural Graph Matching
Graph Learning
Theory
Optimization
- Introduces a theoretical framework linking encoder geometry to GED estimation quality.
- Demonstrates that bi-Lipschitz encoders yield improved GED surrogates and ranking stability.
- Establishes that node-level bi-Lipschitz geometry affects downstream alignment objectives.
- Empirically validates the framework using FSW-GNN in various neural GED architectures.
Read more
Towards Metric-Faithful Neural Graph Matching
Summary
This paper addresses the challenge of estimating Graph Edit Distance (GED), a fundamental metric for structural graph similarity that is NP-hard to compute. The authors propose a theoretical framework that connects the geometry of graph encoders to the quality of GED estimation in neural graph matching architectures. They categorize existing methods into two classes: graph similarity predictors and matching-based estimators, highlighting the importance of encoder geometry in maintaining metric fidelity. The study demonstrates that bi-Lipschitz encoders, specifically the FSW-GNN, can significantly improve the stability and accuracy of GED surrogates. By replacing standard encoders with geometry-aware variants, the authors show enhanced performance across various benchmarks, establishing a clear link between encoder design and GED estimation quality. The findings suggest that careful consideration of encoder geometry can lead to more reliable neural graph matching systems.
Methodology
The authors develop a geometric framework that analyzes the impact of encoder geometry on GED estimation. They focus on two classes of neural GED estimators: graph similarity predictors and matching-based methods. The study employs the FSW-GNN, a bi-Lipschitz encoder, as a drop-in replacement in existing architectures to empirically assess improvements in GED prediction and ranking metrics.
Results
The results indicate that geometry-aware variants of neural GED architectures significantly outperform standard models in GED prediction and ranking metrics across multiple datasets. The theoretical insights are supported by empirical evidence showing that better encoder geometry leads to improved conditioning of surrogate quantities used in GED estimation.
Implications
The findings suggest that incorporating geometric considerations into the design of graph encoders can enhance the performance of neural graph matching systems, making them more reliable for applications in molecular retrieval, program analysis, and structured search.
Verifier-Backed Hard Problem Generation for Mathematical Reasoning
Large Language Models
Generative Models
Theory
- Introduction of a three-party self-play framework for problem generation that includes a verifier.
- Demonstrated significant performance improvements in generating valid and challenging mathematical problems.
- Evaluation of both Hard and Soft verifiers to validate problem correctness and difficulty.
- VHG outperforms existing baseline methods, including state-of-the-art models, in various mathematical benchmarks.
Read more
Verifier-Backed Hard Problem Generation for Mathematical Reasoning
Summary
This paper presents a novel framework called Verifier-Backed Hard Problem Generation (VHG) aimed at enhancing the generation of valid and challenging mathematical problems using Large Language Models (LLMs). Traditional problem generation methods often rely on human expertise or simplistic self-play paradigms, which can lead to the creation of invalid problems. VHG introduces a three-party self-play mechanism that incorporates an independent verifier to assess the validity and difficulty of generated problems. The framework consists of a setter that proposes problem-solution pairs, a solver that evaluates the difficulty, and a verifier that confirms the correctness of the pairs. The authors explore two types of verifiers: a Hard symbolic verifier and a Soft LLM-based verifier. Through extensive experiments on indefinite integral tasks and general mathematical reasoning tasks, VHG demonstrates significant improvements over existing methods, producing valid and challenging problems that enhance the training of solvers. The results indicate that VHG not only boosts performance across various benchmarks but also allows weaker models to generate problems that challenge stronger models, thereby contributing to the advancement of autonomous scientific research.
Methodology
The VHG framework employs a three-party self-play approach where a setter generates problem-reference pairs, a solver evaluates their difficulty, and a verifier checks their correctness. Two types of verifiers are utilized: Hard verifiers for symbolic verification and Soft verifiers based on LLMs. The framework aims to eliminate invalid problem generation by ensuring that rewards for the setter are based on true problem difficulty as assessed by the solver.
Results
VHG significantly improves pass@1 accuracy on indefinite integral tasks by margins of 16.9%, 16.6%, and 21.4% across different benchmarks. In general mathematical reasoning tasks, it raises overall pass@1 accuracy from 56.8% to 69.0%, outperforming all baseline methods. The framework also demonstrates that even weaker models can generate problems that challenge larger models.
Implications
The VHG framework has the potential to revolutionize the training of LLMs by providing a continuous stream of valid and challenging problems, thus enhancing their capabilities in scientific research and mathematical reasoning. It reduces reliance on human experts and static datasets, paving the way for more autonomous AI systems.
PACE: Prune-And-Compress Ensemble Models
Efficient ML
Interpretability
Optimization
- PACE combines pruning and compression techniques to enhance ensemble models.
- The framework allows for active generation of new learners to improve diversity.
- Pruning is performed on an enriched ensemble, allowing for better performance.
- The method provides rigorous control over faithfulness to the original ensemble.
Read more
PACE: Prune-And-Compress Ensemble Models
Summary
The paper introduces PACE, a novel framework that integrates pruning and compression techniques for ensemble models to enhance their deployment efficiency and interpretability. Ensemble models, while achieving state-of-the-art performance, often rely on a large number of weak learners, leading to challenges in deployment, interpretability, and robustness verification. PACE addresses these issues through a two-phase strategy: first, it actively generates new learners to enrich the ensemble's diversity, and then it prunes redundant learners from this enriched ensemble. This approach allows for fine control over the faithfulness of the model to the original ensemble, ensuring that the predictive behavior is preserved. The authors demonstrate that PACE outperforms existing pruning and compression methods, providing a more efficient and interpretable model while maintaining strong predictive performance.
Methodology
PACE employs a two-phase strategy: first, it uses a column generation approach to actively generate new learners that enhance the diversity of the ensemble. Once no more relevant learners can be generated, it performs pruning on the enriched ensemble. The framework allows for a controlled trade-off between faithfulness and compression, targeting only meaningful regions for preserving predictive behavior.
Results
Experimental results indicate that PACE achieves stronger compression and pruning performance than existing methods, while also providing finer control over faithfulness guarantees. The framework demonstrates improved efficiency in terms of memory footprint and inference time, making it suitable for deployment in resource-constrained environments.
Implications
The PACE framework has significant implications for the deployment of ensemble models in real-world applications, particularly in scenarios where interpretability and resource efficiency are critical. It can be applied in various domains requiring robust predictive performance without the overhead of large ensembles, such as healthcare, finance, and autonomous systems.
Distributionally-Robust Learning to Optimize
Optimization
Theory
- Introduction of the DR-L2O framework that combines worst-case analysis with data-driven optimization.
- Establishment of a continuous trade-off between classical L2O and worst-case optimal design via a Wasserstein radius.
- Development of a scalable solution method using stochastic gradient descent with implicit differentiation.
- Proof of out-of-sample guarantees for the learned algorithms, ensuring robustness and performance.
Read more
Distributionally-Robust Learning to Optimize
Summary
This paper introduces a distributionally robust framework for learning hyperparameters in first-order methods for convex optimization, termed Distributionally-Robust Learning to Optimize (DR-L2O). The authors aim to minimize a Wasserstein distributionally robust version of the performance estimation problem (PEP) over algorithm parameters, such as step sizes, using a dataset of problem instances. The framework bridges the gap between classical learning to optimize (L2O) and worst-case optimal algorithm design, allowing for a continuous trade-off between data-driven and worst-case approaches through a Wasserstein radius. The authors employ stochastic gradient descent to solve the resulting optimization problem, differentiating through the solution of an inner semidefinite program at each iteration. They establish high-probability bounds indicating that the true risk of the learned algorithm is bounded by the in-sample L2O optimum plus a diminishing slack, and is no worse than the worst-case PEP bound. Through numerical experiments on various benchmarks, including unconstrained quadratic minimization, LASSO, and linear programming, the learned algorithms demonstrate strong out-of-sample performance and certifiable robustness, outperforming both worst-case optimal and standard L2O baselines.
Methodology
The authors formulate the DR-L2O problem by minimizing the worst-case expected loss over a Wasserstein ambiguity set centered at the empirical distribution. They solve this problem using stochastic gradient descent, where each iteration involves solving an inner semidefinite program and differentiating through its solution using implicit differentiation of the KKT conditions.
Results
The proposed DR-L2O framework achieves strong out-of-sample performance and certifiable robustness on various benchmarks, outperforming both worst-case optimal algorithms and traditional L2O methods. The results indicate that the learned algorithms maintain a balance between empirical performance and robustness, as evidenced by high-probability bounds on true risk.
Implications
The DR-L2O framework has significant implications for optimizing hyperparameters in machine learning applications, particularly in scenarios where robustness against distributional shifts is critical. This approach can enhance the reliability of optimization algorithms in practical settings, reducing the need for manual hyperparameter tuning.
When Does β2-Boosting Overfit Benignly? High-Dimensional Risk Asymptotics and the β1 Implicit Bias
Theory
Optimization
- Benign overfitting in β2-Boosting is characterized by a logarithmic decay of excess variance under isotropic noise.
- The risk under spiked-isotropic designs converges to zero at a slower logarithmic rate compared to β2 geometries.
- A deterministic early stopping rule is proposed to prevent noise interpolation and achieve optimal prediction rates.
Read more
When Does β2-Boosting Overfit Benignly? High-Dimensional Risk Asymptotics and the β1 Implicit Bias
Summary
This paper investigates the phenomenon of benign overfitting in β2-Boosting, particularly focusing on the β1 implicit bias of greedy ensembles. The authors address the challenges in analyzing the high-dimensional risk associated with β2-Boosting due to the non-linear coupling of coordinate selection thresholds. By employing the Convex Gaussian Minimax Theorem and asymptotic expansions of truncated Gaussian moments, they derive the behavior of the β1 interpolant. The study reveals that under an isotropic pure-noise model, benign overfitting fails at a linear rate, with excess variance decaying logarithmically. The authors also explore the asymptotic risk under spiked-isotropic designs, demonstrating that while risk converges to zero when the tail dimensions are significantly larger than the sample size, the decay rate is slower than that observed in β2 geometries. To address the slow convergence, they propose a tuning-free early stopping rule that achieves minimax-optimal empirical prediction rates for β1-bounded signals.
Methodology
The authors utilize the Convex Gaussian Minimax Theorem alongside asymptotic expansions of double-sided truncated Gaussian moments to analyze the high-dimensional risk of β2-Boosting. They also examine the non-smooth subdifferential dynamics of the boosting flow to derive a tuning-free early stopping criterion.
Results
The study establishes that under an isotropic pure-noise model, the excess variance decays at a logarithmic rate Ξ(ΟΒ²/log(p/n)). For spiked-isotropic designs, the risk converges to zero at a logarithmic rate Ξ(ΟΒ²/log(rΒ²/n)), which is slower than the linear decay in β2 geometries. The proposed early stopping rule effectively recovers the Lasso basic inequality and achieves minimax-optimal prediction rates.
Implications
The findings suggest that while β2-Boosting can exhibit benign overfitting, careful consideration of the implicit bias and stopping criteria is crucial for optimal performance in high-dimensional settings. This has potential applications in various machine learning tasks where overfitting is a concern.
CRAFT: Forgetting-Aware Intervention-Based Adaptation for Continual Learning
Large Language Models
NLP
Efficient ML
- CRAFT is the first framework to perform routing, adaptation, and merging entirely in representation space without modifying model weights.
- It employs deterministic routing based on KL divergence, eliminating the need for learned gating mechanisms.
- The framework effectively controls forgetting during adaptation by measuring divergence from existing knowledge.
- Empirical results show significant improvements in performance and reduced forgetting rates compared to state-of-the-art methods.
Read more
CRAFT: Forgetting-Aware Intervention-Based Adaptation for Continual Learning
Summary
The paper introduces CRAFT, a novel continual learning framework designed to mitigate catastrophic forgetting in large language models (LLMs) during fine-tuning. Unlike traditional methods that update model weights, CRAFT operates by learning low-rank interventions on hidden representations. The framework consists of three main stages: first, it routes tasks to groups of similar tasks based on output-distribution divergence; second, it fine-tunes the model using Kullback-Leibler (KL) divergence to control forgetting; and third, it merges interventions into a shared representation. This approach unifies routing, regularization, and merging through a single KL-based objective, leading to improved performance and reduced forgetting across various benchmarks and model scales. The results demonstrate that CRAFT outperforms existing LoRA-based methods while being robust to task ordering, suggesting a scalable and principled approach to continual learning in LLMs.
Methodology
CRAFT operates in three stages: (1) routing tasks to similar groups based on KL divergence, (2) fine-tuning the model while controlling forgetting through KL divergence against the group's prior state, and (3) merging updated interventions into a shared representation using the same KL signal. This approach allows for the management of representation-level interventions rather than direct weight updates.
Results
CRAFT demonstrated up to an 8.03% improvement in accuracy and a 6.49% reduction in forgetting rate compared to leading continual learning methods, while utilizing 3.75 times fewer trainable intervention parameters. The framework was validated across multiple benchmarks and three different LLMs.
Implications
The findings suggest that CRAFT could be applied to enhance continual learning in various applications involving LLMs, potentially leading to more efficient and effective models that can learn continuously without significant performance degradation.
Soft Deterministic Policy Gradient with Gaussian Smoothing
Reinforcement Learning
Robotics
Theory
- Introduction of Soft-DPG to overcome limitations of standard DPG in non-smooth environments.
- Development of a smoothed Bellman equation that ensures well-defined policy gradients.
- Establishment of analytical upper bounds for approximation errors related to the smoothing parameter.
- Implementation of Soft DDPG, a practical deep reinforcement learning algorithm.
Read more
Soft Deterministic Policy Gradient with Gaussian Smoothing
Summary
This paper addresses the limitations of the deterministic policy gradient (DPG) method in reinforcement learning, particularly when dealing with non-smooth action-value functions that arise in environments with sparse or discrete rewards. The authors propose a new framework called Soft Deterministic Policy Gradient (Soft-DPG), which is based on a smoothed Bellman equation using Gaussian smoothing. This approach allows for the derivation of a policy gradient that does not depend explicitly on the action gradients of the critic, thus ensuring stability and well-defined gradients even in challenging environments. The authors also establish analytical upper bounds for approximation errors in the action-value and state-value functions, providing formal guarantees on the bias introduced by the smoothing parameter. They implement this theoretical framework into a practical algorithm, Soft Deep Deterministic Policy Gradient (Soft DDPG), and validate its performance through extensive experiments on standard continuous control benchmarks and their discretized-reward variants, demonstrating competitive stability and improved performance in environments with irregular reward structures.
Methodology
The authors derive the Soft-DPG framework from a smoothed Bellman equation using Gaussian smoothing, which allows for the formulation of a new policy gradient theorem. They provide theoretical analysis to establish upper bounds on approximation errors and implement the framework into a deep reinforcement learning algorithm called Soft DDPG, which is tested on various benchmarks.
Results
The empirical evaluations indicate that Soft DDPG performs competitively in dense-reward settings and shows significant improvements in environments with discretized rewards, where traditional DDPG struggles due to sensitivity to irregular critic landscapes.
Implications
The proposed Soft-DPG framework and its implementation in Soft DDPG can enhance the robustness and stability of reinforcement learning algorithms in practical applications, particularly in robotics and autonomous systems where reward structures are often sparse or non-smooth.
Directional Consistency as a Complementary Optimization Signal: The GONO Framework
Optimization
- Directional consistency and loss convergence can be decoupled, revealing limitations in existing optimizers.
- GONO adapts the momentum coefficient based on directional alignment, improving optimization performance.
- The framework achieves perfect oscillation detection and competitive results on standard datasets.
- The study introduces a theoretically grounded approach to enhance optimizer design by considering directional signals.
Read more
Directional Consistency as a Complementary Optimization Signal: The GONO Framework
Summary
This paper introduces the GONO (Gradient-Oriented Norm-Adaptive Optimizer) framework, which addresses a critical gap in deep learning optimization by decoupling directional consistency from loss convergence. The author identifies that existing optimizers like Adam and SGD do not effectively utilize the temporal consistency of gradient directions, leading to scenarios where optimizers exhibit high directional consistency while the loss remains stagnant. GONO adapts the momentum coefficient of Adam based on the consecutive cosine similarity of gradients, enhancing momentum when directional consistency is high and reducing it during oscillations. The paper provides theoretical guarantees that GONO maintains the same convergence rate as Adam while demonstrating superior oscillation detection capabilities and competitive performance on benchmark datasets like MNIST and CIFAR-10. The findings suggest that incorporating directional alignment as a first-class signal can improve optimization strategies in deep learning.
Methodology
The GONO framework modifies the momentum coefficient of the Adam optimizer based on the consecutive cosine similarity of gradients (cct). This adaptation allows for increased momentum during stable gradient directions and decreased momentum during oscillations. The paper includes empirical validation across multiple datasets and architectures, alongside theoretical proofs of convergence rates.
Results
GONO achieves an F1 score of 1.00 in oscillation detection, significantly outperforming traditional gradient norm methods (F1 = 0.45). It maintains competitive performance on MNIST (98.15%) and CIFAR-10 (43.14%), demonstrating its effectiveness as an optimizer.
Implications
The findings suggest that incorporating directional consistency as an optimization signal can lead to more effective training strategies in deep learning, potentially improving convergence rates and model performance across various applications.
Knowing but Not Correcting: Routine Task Requests Suppress Factual Correction in LLMs
NLP
Large Language Models
- Correction suppression is a prevalent issue in LLMs, with suppression rates between 19% and 90%.
- Models exhibit a 'knowing but not correcting' behavior, recognizing errors internally but failing to correct them due to task context.
- Two effective training-free interventions, CDS and DPA, significantly enhance factual correction rates.
- The study introduces 'factual strictness' as a new dimension of model reliability.
Read more
Knowing but Not Correcting: Routine Task Requests Suppress Factual Correction in LLMs
Summary
This paper investigates a phenomenon termed 'correction suppression' in large language models (LLMs), where models that can accurately correct false claims in isolation fail to do so when these claims are embedded in task-oriented requests. The authors construct a benchmark of 300 false premises to evaluate this phenomenon across eight models, revealing suppression rates ranging from 19% to 90%. The study identifies that suppression is not due to a lack of knowledge; rather, the models recognize the errors internally but are diverted by task context, leading to compliance instead of correction. This is characterized as 'knowing but not correcting.' To address this issue, the authors propose two training-free interventions: Correction Direction Steering (CDS), which injects correction guidance at critical layers, and Dynamic Payload Amplification (DPA), which enhances the representation of error-related tokens. Both methods demonstrate significant improvements in correction rates while maintaining reasoning capabilities, highlighting the need for models to uphold factual accuracy against contextual pressures.
Methodology
The authors constructed a benchmark of 300 false premises and evaluated eight LLMs under isolated and contextualized conditions. Mechanistic analyses were conducted to compare hidden-state representations, prediction uncertainty, and attention patterns. Two training-free interventions, CDS and DPA, were proposed and tested for their effectiveness in improving correction rates.
Results
The results showed that suppression rates varied significantly across models, with four models exceeding 80%. CDS improved correction rates on Qwen3.5-9B from 0% to 58.2%, outperforming previous methods. DPA achieved competitive correction rates while preserving reasoning capabilities, demonstrating both methods' effectiveness with minimal calibration data and negligible inference overhead.
Implications
The findings suggest that LLMs need to be designed to maintain factual accuracy even under task-oriented pressures. The proposed interventions could be integrated into existing models to enhance their reliability in real-world applications, particularly in knowledge-intensive tasks.
Listwise Policy Optimization: Group-based RLVR as Target-Projection on the LLM Response Simplex
Reinforcement Learning
Large Language Models
Optimization
- LPO provides a unified geometric perspective on group-based RL algorithms, revealing implicit target-projections.
- The framework allows for explicit target-projection, leading to improved optimization stability and response diversity.
- LPO demonstrates higher accuracy in training performance compared to traditional policy gradient baselines.
- The decoupled structure of LPO supports flexible divergence selection, enhancing its applicability.
Read more
Listwise Policy Optimization: Group-based RLVR as Target-Projection on the LLM Response Simplex
Summary
This paper introduces Listwise Policy Optimization (LPO), a novel framework for reinforcement learning with verifiable rewards (RLVR) aimed at enhancing the performance of large language models (LLMs). The authors identify that existing group-based policy gradient methods implicitly perform target-projections on a response simplex, which can obscure the optimization process. LPO explicitly defines this target-projection, allowing for a more stable and effective optimization strategy. By constraining the proximal RL objective to the sampled responses, LPO enables exact divergence minimization, leading to bounded, zero-sum, and self-correcting gradients. The framework is evaluated across various reasoning tasks, demonstrating significant improvements over traditional policy gradient methods in terms of training performance, optimization stability, and response diversity. The findings suggest that explicit target-projection can enhance the understanding and effectiveness of RLVR in LLMs.
Methodology
The authors develop LPO by constraining the proximal RL objective to the response simplex and optimizing the policy through divergence minimization. This approach allows for exact computation of target distributions and projections, facilitating a clear separation between target construction and optimization.
Results
LPO consistently outperforms typical policy gradient baselines across diverse reasoning tasks, achieving higher expected Pass@1 and Pass@k accuracy. The method also maintains stable optimization trajectories and preserves response diversity, demonstrating the effectiveness of explicit target-projection.
Implications
The findings suggest that explicit target-projection can significantly enhance the training and performance of LLMs in complex reasoning tasks, potentially leading to more robust and versatile applications in natural language processing and AI-driven problem-solving.
Attributions All the Way Down? The Metagame of Interpretability
Interpretability
NLP
Multimodal
- Introduction of the METAGAME framework for quantifying second-order interactions in model explanations.
- Development of meta-attributions that generalize first-order attribution methods to capture directional influences.
- Theoretical proof of hierarchical decomposition of attributions into directional interactions.
- Empirical demonstration of the METAGAME's effectiveness across various machine learning interpretability applications.
Read more
Attributions All the Way Down? The Metagame of Interpretability
Summary
This paper introduces the METAGAME, a novel conceptual framework designed to quantify second-order interaction effects of model explanations in machine learning. The authors propose a method for measuring the directional influence of one feature on the attribution of another, termed meta-attribution, by treating the attribution method as a cooperative game and calculating its Shapley value. Theoretical proofs establish that attributions can be hierarchically decomposed into these meta-attributions, providing a directional extension of existing interaction indices. The METAGAME framework is empirically validated across various interpretability applications, including quantifying interactions in language models, explaining cross-modal similarities in vision-language encoders, and interpreting concepts in multimodal diffusion transformers. This work addresses the limitations of traditional attribution methods that only capture first-order influences, thereby enhancing the understanding of complex interactions within models.
Methodology
The authors employ cooperative game theory to conceptualize attribution methods as games, calculating Shapley values to derive meta-attributions. They theoretically prove the hierarchical decomposition of attributions into directional components and validate their framework through empirical applications in language models and multimodal systems.
Results
The METAGAME framework successfully quantifies second-order interactions, revealing how individual features influence each other's attributions. The empirical applications demonstrate its utility in diverse contexts, providing deeper insights into model behavior and improving interpretability.
Implications
The findings suggest that the METAGAME framework can significantly enhance the interpretability of complex machine learning models, making it easier to understand feature interactions and their contributions to model predictions. This has potential applications in fields requiring transparent AI, such as healthcare, finance, and autonomous systems.
Scaling Pretrained Representations Enables Label-Free Out-of-Distribution Detection Without Fine-Tuning
Computer Vision
NLP
Efficient ML
- Frozen pretrained representations contain sufficient geometric structure for effective label-free OOD detection.
- Two complementary detection methods (global Mahalanobis and local ReSCOPED) were evaluated across diverse tasks.
- Performance of both detection methods improves with better representation quality, reducing the importance of detector choice.
- The study provides empirical evidence supporting the use of frozen models for OOD detection without fine-tuning.
Read more
Scaling Pretrained Representations Enables Label-Free Out-of-Distribution Detection Without Fine-Tuning
Summary
This paper investigates the effectiveness of label-free out-of-distribution (OOD) detection using frozen pretrained models without the need for fine-tuning. The authors challenge the conventional belief that effective OOD detection requires class-conditional modeling or specialized models. They demonstrate that the geometric structure encoded in frozen representations from modern pretrained models is sufficient for accurate OOD detection. The study evaluates two label-free detection methods: a global Mahalanobis estimator and ReSCOPED, a local typicality estimator. The authors analyze 59 backbone-task pairings across vision and language tasks, revealing that both detection methods benefit from improved representation quality. As the scale of the pretrained models increases, the performance gap between the two detection methods diminishes, indicating that the choice of detector becomes less critical. The findings suggest that leveraging the geometry of frozen representations can facilitate efficient OOD detection directly on pretrained models, making it easier to deploy these models in real-world applications without requiring additional labels or fine-tuning.
Methodology
The authors employed two label-free OOD detection methods: a global Mahalanobis distance estimator that models the distribution of frozen representations using unlabeled in-distribution data, and ReSCOPED, a local typicality estimator that utilizes diffusion-based score and curvature information. They conducted experiments across 59 backbone-task pairings in vision and language domains to assess the performance of these methods.
Results
The study found that stronger representations significantly enhance OOD detection performance, as measured by the area under the receiver operating characteristic (AUROC) curve. The performance gap between the global Mahalanobis and local ReSCOPED methods decreased as representation quality improved, indicating that both methods converge in effectiveness with better pretrained models.
Implications
The findings suggest that pretrained models can be deployed for OOD detection without the need for additional labels or fine-tuning, facilitating their use in real-world applications where labeled data is scarce. This could lead to more robust AI systems capable of handling distribution shifts in various tasks.
Inference-Time Refinement Closes the Synthetic-Real Gap in Tabular Diffusion
Generative Models
- TARDIS framework refines synthetic tabular data at inference time using a pre-trained backbone.
- Introduces Bidirectional Chamfer Refinement (BCR) to minimize the distance between synthetic and real samples.
- Achieves a median +8.6% improvement in downstream task performance over real data models across 15 datasets.
- Demonstrates that inference-time refinement can effectively close the synthetic-real performance gap.
Read more
Inference-Time Refinement Closes the Synthetic-Real Gap in Tabular Diffusion
Summary
This paper introduces TARDIS (Tabular generation through Refinement, Distillation, and Inference-time Sampling), a novel framework for refining synthetic tabular data at inference time, aiming to close the performance gap between synthetic and real data in downstream tasks. Traditional approaches have focused on improving generator architectures during training, but TARDIS operates on a frozen pre-trained backbone, refining outputs through a process that minimizes the symmetric Chamfer functional between synthetic and real samples. The framework employs a Tree-structured Parzen Estimator to optimize hyperparameters and utilizes a Bidirectional Chamfer Refinement (BCR) principle, which combines continuous score-level perturbations during reverse diffusion with discrete post-generation filtering. Empirical evaluations across 15 diverse datasets demonstrate that TARDIS achieves a median improvement of +8.6% in downstream task performance compared to models trained on real data, while also surpassing the underlying TabDiff backbone in all cases. This indicates that the synthetic-real gap can be effectively addressed at inference time, rather than solely through training-time enhancements.
Methodology
The TARDIS framework consists of three stages: oversampling a noise pool, applying score-level Chamfer perturbation during reverse diffusion, and ranking the candidate samples through a Chamfer Sampler. Hyperparameters are tuned using a Tree-structured Parzen Estimator, and the framework employs a Bidirectional Chamfer Refinement principle to optimize both fidelity and coverage in the generated samples.
Results
TARDIS outperformed real-data utility on 11 out of 15 datasets, achieving a median improvement of +8.6% in downstream tasks. It also improved upon the TabDiff backbone across all datasets, with a mean improvement of +12.9%, while maintaining comparable manifold fidelity, diversity, and sample-level privacy.
Implications
The findings suggest that inference-time refinement can be a viable strategy for enhancing the utility of synthetic data in machine learning applications, particularly in scenarios where access to real data is limited. This approach could lead to more effective use of synthetic data in various domains, including healthcare, finance, and any field reliant on tabular data.
MinMax Recurrent Neural Cascades
Theory
Efficient ML
Time Series
- MinMax RNCs can express all regular languages and have favorable theoretical properties.
- They can be evaluated in parallel with logarithmic runtime concerning input length.
- MinMax RNCs maintain bounded states and gradients, avoiding issues of vanishing or exploding gradients.
- Empirical results show superior performance on synthetic tasks compared to state-of-the-art models.
Read more
MinMax Recurrent Neural Cascades
Summary
This paper introduces MinMax Recurrent Neural Cascades (RNCs), a novel architecture that leverages MinMax algebra to create recurrent neural networks that are resilient to the issues of vanishing and exploding gradients. The authors demonstrate that MinMax RNCs possess significant theoretical advantages, including the ability to express all regular languages, efficient parallel evaluation, and bounded state and activation values. The MinMax recurrence is defined as xt = (At β xtβ1) β bt, where standard addition and multiplication are replaced with max and min operations, respectively. The paper outlines several key properties of MinMax RNCs, such as their expressivity, complexity, stability, and gradient behavior. Empirical evaluations show that these models can perfectly solve various synthetic tasks and outperform existing state-of-the-art recurrent neural networks. Additionally, a MinMax RNC with 127M parameters was trained for next-token prediction, achieving competitive performance, thus indicating the architecture's potential for real-world applications.
Methodology
The authors develop MinMax RNCs by cascading layers of neurons that utilize MinMax recurrence. The recurrence is defined using MinMax algebra, allowing for efficient evaluation and maintaining bounded states. Theoretical properties are established through theorems, while empirical performance is assessed through experiments on synthetic tasks and next-token prediction.
Results
MinMax RNCs are shown to perfectly solve star-free tasks and outperform existing models like Mamba, mLSTM, and sLSTM. The architecture's next-token prediction model, with 127M parameters, achieves a loss comparable to GPT-2, indicating its effectiveness and scalability.
Implications
The findings suggest that MinMax RNCs could be a viable alternative to traditional recurrent architectures and transformers, particularly in applications requiring robust sequence processing without the drawbacks of gradient issues. Their competitive performance in real-world tasks opens avenues for further exploration in various domains.
Online Bayesian Calibration under Gradual and Abrupt System Changes
Theory
Optimization
Time Series
- Introduces Bayesian Recursive Projected Calibration (BRPC) for online Bayesian calibration.
- Separates parameter updates from discrepancy modeling to improve identifiability.
- Integrates a restart mechanism to handle abrupt regime shifts effectively.
- Demonstrates improved calibration accuracy and robustness through empirical evaluations.
Read more
Online Bayesian Calibration under Gradual and Abrupt System Changes
Summary
This paper addresses the challenges of Bayesian model calibration in the context of digital twins and computer experiments, particularly under nonstationary conditions where systems may experience gradual drifts and abrupt regime shifts. The authors propose a novel framework called Bayesian Recursive Projected Calibration (BRPC) that enhances traditional Bayesian calibration methods by enabling online updates of calibration parameters and systematic bias modeling. BRPC employs a two-stage update strategy: first, it updates model parameters to align with evolving system dynamics, and second, it updates a Gaussian process-based discrepancy model conditionally based on these parameters. This separation improves identifiability and allows for effective adaptation to gradual changes. To manage abrupt changes, BRPC incorporates a restart mechanism that detects regime shifts and resets the calibration process, thus mitigating bias from previous data. Theoretical guarantees are provided for the performance of both the calibration updates and the restart mechanisms. Empirical evaluations on synthetic data and plant simulation benchmarks demonstrate that BRPC significantly improves calibration accuracy under gradual changes and enhances robustness and predictive performance during abrupt shifts compared to existing methods like sliding-window Bayesian calibration and data assimilation.
Methodology
The methodology involves a two-stage update process where calibration parameters are first updated to align with the evolving system dynamics, followed by a conditional update of a Gaussian process-based discrepancy model. The restart mechanism is employed to detect abrupt changes and reset the calibration process, ensuring minimal bias from previous data.
Results
The empirical results indicate that BRPC outperforms traditional methods in terms of calibration accuracy during gradual changes and shows enhanced robustness and predictive performance under abrupt regime shifts, as evidenced by tests on synthetic and plant simulation benchmarks.
Implications
The proposed BRPC framework has significant implications for real-time applications in digital twins and other fields where systems are subject to nonstationary behavior. It allows for more accurate modeling and prediction in dynamic environments, which is crucial for decision-making and operational efficiency.
Nationwide EHR-Based Chronic Rhinosinusitis Prediction Using Demographic-Stratified Models
Interpretability
- Utilized nationwide EHR data to enhance CRS diagnosis prediction.
- Developed a hybrid feature-selection method to condense clinical codes.
- Implemented demographic-stratified models to capture variations in disease presentation.
- Achieved an AUC of 0.8461, demonstrating improved predictive performance.
Read more
Nationwide EHR-Based Chronic Rhinosinusitis Prediction Using Demographic-Stratified Models
Summary
Chronic rhinosinusitis (CRS) is a prevalent inflammatory disorder that poses significant healthcare challenges due to its heterogeneous nature and symptom overlap with other conditions. This study addresses the limitations of previous predictive models that often relied on single-institution cohorts, which lack generalizability. By utilizing nationwide longitudinal electronic health record (EHR) data from the All of Us Research Program, the authors aimed to predict CRS diagnoses based on two years of pre-diagnostic history. A hybrid feature-selection pipeline was developed to reduce over 110,000 clinical codes to 100 interpretable features, addressing the issues of feature sparsity and dimensionality. The authors also implemented demographic-stratified models tailored to six adult sex and life-stage subgroups, enhancing the model's ability to capture demographic heterogeneity. The resulting framework achieved an overall area under the curve (AUC) of 0.8461, indicating improved discrimination compared to baseline models. This research highlights the potential of EHR data in supporting population-representative CRS risk stratification, which could inform earlier triage and referral prioritization in primary care settings.
Methodology
The study employed a hybrid feature-selection pipeline combining statistical prevalence-based screening and model-based importance ranking to reduce a vast number of clinical codes into a manageable set of 100 features. Demographic stratification was applied to train independent predictive models for six subgroups based on sex and age, allowing for tailored hyperparameter tuning and improved risk prediction.
Results
The proposed framework achieved an overall AUC of 0.8461, which is an improvement of 0.0168 over the best baseline model. This indicates that the demographic-specific modeling approach significantly enhances the ability to predict CRS risk across different population segments.
Implications
The findings suggest that routinely collected EHR data can be effectively utilized for population-level CRS risk stratification, potentially leading to earlier diagnosis and improved patient management in primary care settings, especially where diagnostic imaging is not readily available.
Sparse Prefix Caching for Hybrid and Recurrent LLM Serving
Large Language Models
Optimization
Efficient ML
- Introduction of sparse prefix caching for optimizing LLM serving.
- Formalization of the caching problem as a one-sided weighted k-median problem.
- Demonstration of improved performance over existing heuristics on real-world datasets.
- The method allows for exact output preservation without altering recurrent computations.
Read more
Sparse Prefix Caching for Hybrid and Recurrent LLM Serving
Summary
This paper addresses the challenge of optimizing latency in autoregressive large language model (LLM) serving through a novel approach called sparse prefix caching. Traditional prefix caching methods rely on dense per-token key/value reuse, which is inefficient for recurrent layers that can resume from a single stored state. The authors propose a method that stores exact recurrent states at sparse checkpoint positions, allowing for efficient recomputation of the remaining suffix upon a cache hit. They formalize this approach as a checkpoint placement problem, yielding an exact O(NM) dynamic programming solution. The method is particularly effective in scenarios where requests share a common prefix, such as in conversational contexts or when querying related documents. The authors demonstrate that their distribution-aware placement strategy outperforms existing heuristics on real-world datasets, achieving significant reductions in recomputation and improved performance, especially under low checkpoint budgets. The paper also provides theoretical guarantees and empirical validation of the proposed method, highlighting its potential for enhancing the efficiency of hybrid and recurrent LLM serving.
Methodology
The authors develop a dynamic programming approach to determine optimal checkpoint placements based on the distribution of overlap depths in future requests. They analyze the problem as a one-sided weighted k-median problem and provide theoretical results regarding optimal placement under uniform overlap conditions. Empirical validation is conducted using datasets like QuALITY and System Prompts to compare their method against fixed-budget baselines.
Results
The proposed sparse prefix caching method consistently outperforms standard heuristics, achieving better Pareto frontiers in terms of recomputation savings. It demonstrates significant wall-clock speedups in real-world scenarios, particularly when the overlap distribution among requests is non-uniform. The method uses fewer checkpoints while maintaining or improving performance compared to existing caching strategies.
Implications
This work has significant implications for the efficiency of LLM serving, particularly in applications involving conversational AI and document retrieval. By optimizing how recurrent states are cached and reused, the proposed method can lead to faster response times and reduced computational overhead, making it suitable for real-time applications.
SPADE: Faster Drug Discovery by Learning from Sparse Data
Optimization
Efficient ML
- SPADE addresses the inefficiencies in drug discovery for novel proteins by focusing on sparse data.
- The algorithm requires only an average of 40 tests to identify 10 high-quality ligands.
- SPADE outperforms traditional methods, achieving significant improvements in sample efficiency.
- A new dataset of 1.5 million entries was created to support the evaluation of the proposed method.
Read more
SPADE: Faster Drug Discovery by Learning from Sparse Data
Summary
The paper introduces SPADE, a novel algorithm designed to enhance the efficiency of drug discovery by addressing the challenges posed by sparse data, particularly in the context of novel proteins. Traditional drug discovery methods often fail to identify high-quality ligands due to the vast search space and the extreme sparsity of relevant data. SPADE reformulates the drug discovery process as a sparse, sequential race-to-k problem, aiming to identify a set of high-affinity ligands using minimal experimental tests. Instead of estimating the binding affinity (pIC50) for each candidate ligand, SPADE focuses on predicting whether a ligand can outperform the current best candidate. This approach reduces the complexity of the task and improves sample efficiency. The authors developed a robust classifier that minimizes expected loss over a Gaussian centered at positive examples, allowing for effective learning even with limited data. A large-scale dataset of 1.5 million entries was created to evaluate the algorithm. Empirical results demonstrate that SPADE outperforms existing deep learning and Bayesian optimization methods, achieving a 7% to 32% reduction in the number of tests required to identify high-quality ligands and a tenfold increase in scoring speed.
Methodology
SPADE employs a classification-based selection strategy that predicts whether a ligand can outperform the current best candidate rather than estimating binding affinities directly. It utilizes a robust classifier designed to minimize expected loss in the context of sparse data. The method is evaluated using a large dataset derived from PubChem, focusing on the race-to-k problem in drug discovery.
Results
SPADE demonstrated superior performance across 100 proteins, requiring 7% to 32% fewer ligand tests to reach target pIC50 values compared to state-of-the-art methods. Additionally, it achieved a tenfold speedup in scoring candidate ligands, showcasing its efficiency in the drug discovery process.
Implications
The findings suggest that SPADE could significantly accelerate the drug discovery process, particularly for novel proteins where data is scarce. This could lead to faster identification of potential therapeutics and reduce the costs associated with experimental testing.
In-Context Black-Box Optimization with Unreliable Feedback
Optimization
- FICBO integrates auxiliary feedback into the optimization process, improving query selection.
- The framework uses a structured feedback prior to model the reliability of feedback sources.
- Empirical evaluations show FICBO's superiority over classical and amortized optimization baselines.
- The model enhances interpretability by providing insights into how it assesses feedback reliability.
Read more
In-Context Black-Box Optimization with Unreliable Feedback
Summary
This paper introduces Feedback-informed In-Context Black-Box Optimization (FICBO), a novel framework designed to optimize black-box functions in the presence of unreliable auxiliary feedback. Traditional black-box optimization methods often rely on expert feedback, which can be biased or misleading. FICBO addresses this challenge by conditioning a pretrained optimizer on both historical optimization data and auxiliary feedback during the optimization process. The authors propose a structured feedback prior that models the variability in feedback reliability and relevance, allowing the optimizer to adaptively assess the quality of feedback at test time. The framework employs a transformer model that learns to predict outcomes and select queries based on both the optimization history and the auxiliary feedback. The empirical results demonstrate that FICBO outperforms existing methods on both synthetic and real-world tasks, effectively leveraging informative feedback while maintaining robustness against unreliable sources. The study also provides insights into the model's interpretability and decision-making processes, highlighting its ability to discern the reliability of feedback sources during optimization.
Methodology
The authors developed FICBO by pretraining a transformer model on a distribution of synthetic tasks with varying feedback reliability. The model learns to condition its predictions and query selections on both historical optimization data and auxiliary feedback. During deployment, it evaluates the reliability of feedback sources in real-time, allowing for adaptive optimization strategies.
Results
FICBO demonstrated significant improvements in optimization performance across synthetic and real-world tasks compared to classical Bayesian optimization methods and other amortized optimizers. The model effectively utilized informative feedback while remaining robust to misleading sources, leading to better query selections and overall optimization outcomes.
Implications
The findings suggest that FICBO can be widely applied in fields requiring black-box optimization, such as materials science, biology, and engineering, where expert feedback is often available but may not always be reliable. The framework's ability to adaptively assess feedback reliability could enhance the efficiency and effectiveness of optimization processes in these domains.
When Graph Language Models Go Beyond Memorization
Graph Learning
Generative Models
Large Language Models
- Introduces a calibrated diagnostic protocol to evaluate GLMs, overcoming limitations of aggregate fidelity metrics.
- Demonstrates scale-dependent structural learning, with models transitioning from memorization to structural alignment as dataset size increases.
- Identifies a persistent deficit in learning rare graph patterns, indicating a critical gap in current autoregressive graph generation methods.
- Empirical evidence supports that GLMs can implicitly learn structural regularities without explicit pattern enumeration.
Read more
When Graph Language Models Go Beyond Memorization
Summary
This paper investigates whether graph language models (GLMs) learn structural regularities or merely memorize training graphs. The authors introduce a calibrated diagnostic protocol that combines frequent subgraph mining, a graph-level bootstrap baseline, and three levels of frequency stratification to distinguish between memorization and structural alignment. The findings reveal that GLMs can acquire structural regularities beyond memorization, particularly in high-frequency patterns. Empirical results show that while GLMs achieve high subgraph-rank correlation on benchmark datasets, their alignment often matches or exceeds that of a memorization bootstrap. At small scales, models exhibit indistinguishable fidelity from verbatim recall, but as the scale increases, verbatim memorization declines sharply while rank correlation remains high. The study also highlights a persistent deficit in capturing rare patterns, which only marginally improves with increased model capacity. The robustness of these findings is confirmed across different graph serializations.
Methodology
The authors developed a diagnostic protocol that integrates frequent subgraph mining using gSpan, a bootstrap baseline for resampling training graphs, and a three-level frequency stratification (Head, Torso, Tail) to analyze subgraph statistics. This approach allows for a clear distinction between structural learning and memorization.
Results
The study found that GLMs can achieve high subgraph-rank correlation on benchmark datasets, with memorization bootstrap often matching or exceeding model alignment. At larger scales, models show a significant drop in verbatim memorization while maintaining high rank correlation. However, they consistently struggle to capture rare patterns, indicating a limitation in their generative capabilities.
Implications
The findings suggest that while GLMs can learn structural patterns effectively at scale, there remains a critical need to enhance their ability to generate rare graph structures. This has implications for applications in drug discovery, material design, and social network analysis, where understanding and generating diverse graph structures is essential.