AI-generated summaries
Today's ML research,
without the noise.
Daily summaries of the latest machine learning papers from arXiv, processed every 8 hours.
48
Papers today
8h
Update frequency
7
Days of history
Beyond LoRA: Is Sparsity-Induced Adaptation Better?
Efficient ML
Theory
Optimization
- Introduction of sparsity-induced adaptations (cLA and c3LA) to enhance LoRA variants.
- Derivation of generalization error bounds for the proposed methods.
- Empirical evaluation shows significant reductions in training time and memory usage.
- Sparsity in LoRA adaptations maintains competitive performance across various tasks.
Read more
Beyond LoRA: Is Sparsity-Induced Adaptation Better?
Summary
This paper investigates the effectiveness of low-rank adaptation (LoRA) and its variants in the context of parameter-efficient fine-tuning (PEFT) for pre-trained models. The authors introduce new methods that induce sparsity within existing LoRA variants, specifically Cheap LoRA (cLA) and the chained circulant variant (c3LA). These methods aim to enhance the efficiency of adaptation while maintaining competitive performance. The authors derive information-theoretic generalization error bounds for these adaptations, marking a significant contribution to the theoretical understanding of LoRA variants. Empirical evaluations across 11 fine-tuning methods, 10 pre-trained models, and 14 datasets reveal that the proposed sparse adaptations can reduce training time by up to 10% and peak GPU memory usage by up to 15%, while still achieving performance comparable to parameter-matched baselines. The findings suggest that structured column-space restrictions in LoRA-based methods can yield effective adaptations, providing a more principled approach to cost-effective model fine-tuning.
Methodology
The authors conducted a theoretical analysis of LoRA variants, deriving information-theoretic generalization error bounds. They also performed empirical evaluations by fine-tuning 11 methods on 10 pre-trained models across 14 datasets, using loss landscapes and spectral analysis to assess performance and generalization.
Results
The proposed methods (cLA and c3LA) demonstrated competitive performance compared to traditional LoRA adaptations and parameter-matched baselines, achieving up to 10% reduction in training time and 15% in peak GPU memory usage. The theoretical insights provided a clearer understanding of generalization in the context of PEFT.
Implications
The findings suggest that adopting sparsity in LoRA adaptations can lead to more efficient fine-tuning of large models, making it feasible to deploy them in resource-constrained environments. This work may influence future research in parameter-efficient fine-tuning and model adaptation strategies.
Controlled Dynamics Attractor Transformer
Graph Learning
Theory
Optimization
- CDAT integrates energy-based modeling with transformer architectures for improved stability and robustness.
- The framework employs a mixture von Mises–Fisher attention energy and Hopfield refinement energy.
- CANN-inspired modulation introduces a control interface for steering and stabilizing inference dynamics.
- CDAT achieves state-of-the-art performance in graph anomaly detection and classification.
Read more
Controlled Dynamics Attractor Transformer
Summary
The paper introduces the Controlled Dynamics Attractor Transformer (CDAT), a novel framework that integrates energy-based modeling with transformer architectures to enhance representation learning and inference. Traditional transformer models, while effective, struggle with stability and robustness in the presence of noise and complex relational dependencies. CDAT addresses these issues by combining a mixture von Mises–Fisher attention energy with Hopfield refinement energy, incorporating a CANN-inspired excitation-inhibition modulation. This approach creates a topology-constrained dynamical system that encodes relational structures among tokens, allowing for more stable and interpretable inference dynamics. The authors provide a formal dissipation analysis to establish the controlled dynamics of CDAT. The framework demonstrates state-of-the-art performance in graph anomaly detection and graph classification tasks, showcasing its effectiveness in handling complex data structures while maintaining robustness against spurious states.
Methodology
CDAT combines two energy components: a mixture von Mises–Fisher attention for coarse alignment and Hopfield refinement for fine-grained retrieval. It incorporates a CANN-inspired modulation mechanism that stabilizes attractor dynamics against spurious states. The authors conduct a constructive dissipation analysis to formalize the controlled dynamics of the system, allowing for a more interpretable and robust inference process.
Results
CDAT achieves state-of-the-art results across multiple benchmarks in graph anomaly detection and graph classification, outperforming traditional transformer models and demonstrating enhanced robustness against noise and adversarial perturbations.
Implications
The CDAT framework has potential applications in various domains requiring robust representation learning, particularly in graph-based tasks. Its ability to maintain stability in complex relational structures could lead to advancements in areas such as social network analysis, fraud detection, and any field where relational data is prevalent.
Taming Curvature: Architecture Warm-Up for Stable Transformer Training
Optimization
Large Language Models
Theory
- Introduces a fast online estimator for preconditioned Hessian eigenvalues to track curvature during training.
- Demonstrates a correlation between training instabilities and surges in preconditioned curvature.
- Proposes an architecture warm-up strategy to control curvature by gradually increasing network depth.
- Shows that the proposed method reduces instabilities compared to state-of-the-art techniques.
Read more
Taming Curvature: Architecture Warm-Up for Stable Transformer Training
Summary
The paper addresses the challenges of training large-scale Transformers, which often experience instability characterized by loss spikes and divergence. The authors leverage the Edge of Stability (EoS) theory to understand and mitigate these issues through curvature control. They introduce a novel online estimator for the largest preconditioned Hessian eigenvalue, utilizing a warm-started power iteration method with Hessian-vector products. This method allows for efficient curvature tracking at the billion-parameter scale, revealing that training instabilities correlate with increases in preconditioned curvature, which also grows with network depth. To stabilize training, the authors propose an 'architecture warm-up' strategy, where the depth of the network is progressively increased to manage curvature effectively. This approach is shown to enhance training stability without compromising convergence speed, outperforming existing stabilization techniques. Extensive experiments validate the effectiveness of curvature tracking and the stability gains achieved through this method.
Methodology
The authors developed an efficient online curvature estimation method using warm-started power iteration with Hessian-vector products. They implemented an architecture warm-up strategy that involves progressively unfreezing Transformer layers to control curvature during the training process.
Results
The proposed curvature tracking method allows for accurate and efficient monitoring of preconditioned curvature, leading to reduced training instabilities. The architecture warm-up strategy demonstrated significant improvements in stability across various large Transformer settings, outperforming existing stabilization techniques without slowing down convergence.
Implications
The findings suggest that better control of curvature can lead to more stable training of large-scale models, which is crucial for reducing computational costs and improving reproducibility in machine learning applications. This approach can be integrated into existing training frameworks, enhancing the robustness of large language models and other Transformer-based architectures.
Privacy from Symmetry: Orthogonally Equivariant Transformers for LLM Inference
NLP
Large Language Models
- Introduces CONJFORMER, a transformer variant that enhances privacy during LLM inference.
- Employs orthogonal obfuscation to prevent embedding inversion attacks.
- Demonstrates significant reduction in token recovery rates while maintaining model performance.
- Highlights the effectiveness of architectural symmetry in privacy preservation.
Read more
Privacy from Symmetry: Orthogonally Equivariant Transformers for LLM Inference
Summary
This paper addresses the privacy concerns associated with running large language models (LLMs) in cloud environments, where sensitive user data must be transmitted to third-party providers. The authors propose a novel approach called CONJFORMER, which introduces an orthogonal obfuscation procedure to protect client-side embeddings during inference. By multiplying embeddings with a secret orthogonal matrix before transmission, the method ensures that the server only processes rotated representations, thus preventing the recovery of original tokens through nearest-neighbor search. The architecture of CONJFORMER is designed to be equivariant to orthogonal transformations, allowing for correct inference without revealing unrotated hidden states. Experimental results demonstrate that this approach significantly reduces the risk of token recovery from over 35% to at most 1.3%, while only slightly increasing perplexity by 0.4% after fine-tuning. This work highlights the potential of architectural symmetry as a practical defense mechanism for privacy-preserving LLM inference, avoiding the need for noise injection or complex cryptographic solutions.
Methodology
The authors developed CONJFORMER, which modifies the transformer architecture to be equivariant to orthogonal transformations. This involves using a lightweight normalization technique (scalar RMSNorm) and blockwise orthogonal conjugation of linear weights. The client-side embeddings are multiplied by a secret orthogonal matrix before being sent to the server, ensuring that the server processes only the obfuscated representations.
Results
Experiments conducted on GPT-2 and Llama 3.2 1B models fine-tuned on PubMed showed that the proposed orthogonal obfuscation method reduced token recovery from over 35% to at most 1.3% in top-10 predictions, while only increasing perplexity by 0.4% after fine-tuning, indicating a successful balance between privacy and model performance.
Implications
The findings suggest that enforcing symmetry at the architectural level can provide a viable solution for privacy-preserving LLM inference, making it easier for organizations to deploy LLMs without compromising sensitive information. This approach could be particularly beneficial in sectors where data privacy is paramount, such as healthcare and finance.
Flood and Harvest: The Provable Necessity of Trivia for Generating Valuable Mathematics via the Lens of Language Generation in the Limit
Theory
- Verification of mathematical statements is not solely based on taste but on structural conditions.
- Sound coverage by verifiers allows for the assertion of valid statements while covering unseen valuable mathematics.
- A phase transition occurs in trivia generation, where the count of trivia, rather than its rate, impacts the coverage of valuable mathematics.
- An infinite stream of trivial outputs is a provable necessity for generating valuable mathematical insights.
Read more
Flood and Harvest: The Provable Necessity of Trivia for Generating Valuable Mathematics via the Lens of Language Generation in the Limit
Summary
This paper addresses the challenge of generating valuable mathematical statements using AI systems that are coupled with proof assistants. The authors propose a model of nested language generation in the limit, where a formal language F is verified by a membership oracle (proof checker) and contains an unknown valuable language H. The paper identifies a core C of literature that has a specific density α, revealing that outputs can be categorized as valuable, trivial, or hallucinations. The authors explore four main questions: the nature of verification as not being taste-dependent, the sound coverage provided by verifiers, the phase transition in trivia generation, and the necessity of generating trivial statements to uncover valuable mathematics. The findings indicate that while a perfect verifier cannot replace the need for discernment in mathematical discovery, an infinite stream of trivial outputs is essential for accessing unrecorded valuable mathematics. The paper concludes that the 'flood' of trivial outputs is necessary to achieve a 'harvest' of valuable mathematical insights.
Methodology
The authors develop a theoretical framework for nested language generation in the limit, utilizing a membership oracle to verify formal statements. They analyze the relationships between verification, value, and the generation of trivial outputs through mathematical proofs and theoretical constructs.
Results
The study establishes that generators producing a finite number of trivia achieve optimal coverage of valuable mathematics at a density of α/2, while those allowing infinite trivia can reach a coverage of 1 - α/2. The results demonstrate that the transition in coverage is determined by the count of trivia generated, not the rate, emphasizing the necessity of trivial outputs for accessing valuable mathematics.
Implications
The findings suggest that AI systems in mathematical discovery must balance the generation of valid but trivial statements to uncover valuable insights. This has implications for the design of future AI proof assistants and their role in mathematical research, emphasizing the importance of discernment in mathematical generation.
Decompose Sparsely Where You Should, Absorb Densely Where You Should Not
NLP
Large Language Models
Interpretability
- Introduces a parallel low-rank linear bottleneck to improve sparse autoencoder performance.
- Demonstrates that a significant portion of activation content is dense and computationally necessary.
- Finds that removing the dense component drastically increases cross-entropy, indicating its importance.
- Suggests that current practices in training SAEs may overlook critical dense components, leading to inefficiencies.
Read more
Decompose Sparsely Where You Should, Absorb Densely Where You Should Not
Summary
This paper challenges the conventional approach of sparse autoencoders (SAEs) that assumes all activation content can be effectively decomposed into sparse representations. The authors propose that some activation content contains a low-rank, dense component that is crucial for model performance but unsuitable for sparse representation. To address this, they introduce a rank-r linear bottleneck that runs parallel to standard SAEs, allowing for the absorption of dense structures before sparse reconstruction. This method is tested on the Gemma-2-2B model, demonstrating a significant reduction in dense latent counts and improved performance in sparse probing tasks. The findings suggest that the absorbed dense component is not only structurally identifiable but also causally necessary for model performance, indicating that the standard practice of training SAEs to reconstruct the entire residual stream may be suboptimal. The authors argue for a reevaluation of sparsity-based interpretability methods in light of their findings.
Methodology
The authors implemented a rank-r linear bottleneck in parallel with standard sparse autoencoders, maintaining functional separation through constraints on linearity, low rank, and gradient isolation. This approach allows the model to absorb dense activation content before it is processed by the sparse dictionary, without altering the existing architecture or training procedures.
Results
The introduction of the bottleneck resulted in up to an 84% reduction in dense latent counts and improved performance in sparse probing tasks. Causal analysis revealed that the dense component is structurally identifiable and necessary for maintaining low cross-entropy, with its removal leading to a 7.5× increase in cross-entropy compared to a 2.8× increase from removing similar PCA directions.
Implications
The findings suggest that dense components in activation streams are crucial for model performance and that current sparse encoding methods may not capture all relevant information. This could lead to more effective training strategies and improved interpretability in machine learning models, particularly in NLP tasks.
Where Black-box Drug-Target Interaction Prediction Models Look: Cross-Method Explainability
Interpretability
Graph Learning
- The study applies multiple post-hoc explainable AI (XAI) techniques to a black-box DTI prediction model.
- It highlights the significance of bridge nodes and edges in linking drug and protein features.
- The research demonstrates that interpretability can reveal critical insights into model decision-making processes.
- Results indicate that explainability can guide the design of novel therapeutics by uncovering underlying mechanisms.
Read more
Where Black-box Drug-Target Interaction Prediction Models Look: Cross-Method Explainability
Summary
This paper addresses the challenge of interpretability in drug-target interaction (DTI) and drug-target affinity (DTA) prediction models, which often function as 'black boxes' despite achieving high performance metrics. The authors conduct an interpretability audit of the BridgeDPI architecture across three datasets: Gao, Human, and C.elegans. They employ a combination of gradient-based attribution methods (including integrated gradients, saliency, layer-wise relevance propagation, SmoothGrad, and SmoothGrad-IG) alongside feature-wise occlusion ablation to mitigate single-explainer bias. The study reveals that explainability serves as a model criticism tool, uncovering modality dominance, dataset-dependent effects, and chemistry-consistent motifs. The findings indicate that while these analyses do not replace structural or experimental validation, they can generate testable hypotheses for further exploration in computational drug discovery. The research emphasizes the importance of interpretability in DTI/DTA models, suggesting that understanding model predictions can enhance drug discovery processes.
Methodology
The authors utilized a multi-layer BridgeDPI model and applied various XAI techniques, including integrated gradients, layer-wise relevance propagation, saliency maps, SmoothGrad, and perturbation-based methods. They conducted feature-wise occlusion ablation and employed strict intersection consensus across methods to ensure robustness and reduce bias in their interpretability analysis.
Results
The results showed that explainability is most effective when viewed as model criticism, revealing insights into modality dominance and dataset-dependent effects. The analyses highlighted significant patterns and relationships in the data, including the role of bridge nodes and edges, and identified chemistry-consistent motifs where different methods agreed.
Implications
The findings suggest that enhancing interpretability in DTI/DTA models can lead to better understanding and validation of predictions, ultimately aiding in drug discovery. The study proposes that these insights can generate testable hypotheses for future research and improve the reliability and applicability of DTI models in real-world settings.
A Bifurcation Theory Framework for Gradient Descent on the Edge of Stability
Optimization
Theory
- Development of a bifurcation theory framework for gradient descent applicable to overparameterized networks.
- Decomposition of EoS dynamics into normal and tangent components, leading to convergence proofs.
- Identification of the first Lyapunov coefficient as a key stability criterion for EoS training.
- Unification of prior results, including the product-stability condition, under the new framework.
Read more
A Bifurcation Theory Framework for Gradient Descent on the Edge of Stability
Summary
This paper addresses the Edge of Stability (EoS) phenomenon in gradient descent, where training occurs with sharpness exceeding the classical convergence threshold yet still results in loss reduction over extended timescales. Previous analyses have been limited to low-dimensional loss landscapes, failing to capture the complexities of modern deep learning. The author introduces a bifurcation theory framework that applies to overparameterized neural networks, decomposing training dynamics into normal and tangent components relative to the manifold of minimizers. The study reveals that stable EoS training is linked to a flip bifurcation in the normal direction, influenced by the first Lyapunov coefficient. The paper proves convergence to the minimizing manifold under mild conditions, establishing a connection to prior work on product-stability. This framework not only enhances understanding of EoS dynamics but also unifies previous minimalist analyses within a broader theoretical context.
Methodology
The author employs bifurcation theory to analyze the dynamics of gradient descent at the edge of stability. The training dynamics are decomposed into components normal and tangent to the manifold of minimizers, allowing for the examination of stability through the first Lyapunov coefficient. The framework is applied to overparameterized neural networks, and convergence is proven under specific spectral and geometric assumptions.
Results
The paper demonstrates that stable EoS training results from a flip bifurcation in the normal direction, with convergence to the minimizing manifold established under mild conditions. The first Lyapunov coefficient serves as a critical indicator of stability, and the product-stability condition is shown to be a special case within this broader framework.
Implications
This work provides a deeper theoretical understanding of the EoS phenomenon in deep learning, potentially guiding the design of more effective training algorithms for overparameterized models. It also bridges previous minimalist analyses with a comprehensive bifurcation framework, offering insights that could influence future research in optimization and neural network training dynamics.
Remember, Don't Re-read: Stateful ReAct Agents for Token-Efficient Autonomous Experimentation
Large Language Models
Optimization
Efficient ML
- Introduction of a stateful ReAct agent for autonomous experimentation.
- Significant reduction in token consumption compared to stateless designs.
- Maintains experimental history and reasoning across iterations.
- Demonstrated effectiveness in hyperparameter tuning and code optimization tasks.
Read more
Remember, Don't Re-read: Stateful ReAct Agents for Token-Efficient Autonomous Experimentation
Summary
This paper introduces a novel approach to autonomous experimentation using large language models (LLMs) by reformulating the autoresearch pattern into a stateful ReAct agent. The traditional stateless design of autoresearch incurs significant token costs due to the need to reconstruct experimental context at each iteration. The proposed stateful ReAct agent utilizes a typed persistent state to maintain experimental history across iterations, significantly reducing token consumption. Two benchmarks are evaluated: hyperparameter tuning and code performance optimization. The stateful agent demonstrates a 90% reduction in token usage for hyperparameter tuning and a 52% reduction for code optimization, while maintaining comparable optimization quality. The architecture is detailed sufficiently for practitioners to implement their own stateful autoresearch agents, highlighting the structural advantages of using a fixed-size conversation window to manage state efficiently.
Methodology
The methodology involves reformulating the autoresearch pattern into a stateful ReAct agent using LangGraph. The agent maintains a typed persistent state that carries experimental history, allowing it to operate with a fixed-size conversation window, thus reducing the per-iteration token cost from O(n) to O(1). The architecture includes a state graph with nodes for reasoning, tool execution, and convergence checking.
Results
The stateful ReAct agent achieved a 90% reduction in token usage during hyperparameter tuning (2,492 tokens vs. 24,465 tokens) and a 52% reduction in code optimization (627K tokens vs. 1,275K tokens), while achieving similar optimization quality in both tasks.
Implications
The findings suggest that stateful agents can significantly enhance the efficiency of autonomous experimentation in machine learning, making it feasible to conduct longer experiment sequences without the limitations imposed by token costs. This approach could be applied to various domains requiring iterative experimentation and optimization.
Rethinking Structural Anomaly Detection: From Decision Boundaries to Projection Operators
Computer Vision
- Introduces a geometric perspective on anomaly detection using projection operators.
- Addresses the limitations of boundary-based methods in high-dimensional data settings.
- Defines anomalies based on projection residuals, improving detection accuracy.
- Unifies understanding of reconstruction-based methods through projection quality.
Read more
Rethinking Structural Anomaly Detection: From Decision Boundaries to Projection Operators
Summary
This paper addresses the limitations of traditional anomaly detection methods that rely on estimating probability densities or learning decision boundaries, which assume that normal data occupies a non-zero volume in ambient space. The author argues that this assumption is misaligned with high-dimensional data, such as images, where normal samples are concentrated near low-dimensional manifolds. To overcome this mismatch, the paper introduces a geometric perspective by proposing a projection operator that maps data onto the manifold of normal samples. Anomalies are identified based on the projection residual, which quantifies how much a sample deviates from the manifold. This approach not only resolves issues associated with boundary-based methods but also provides a unified interpretation of reconstruction-based methods, explaining their performance in terms of projection quality. The proposed method enhances generalization by decoupling anomaly detection from probabilistic modeling, thus reducing misclassification of rare normal samples. Empirical results demonstrate that projection-aligned methods outperform traditional boundary-based and existing reconstruction-based approaches, highlighting their effectiveness in structural anomaly detection.
Methodology
The methodology involves learning a projection operator that maps input samples onto the manifold of normal data. Anomalous samples are identified by measuring the projection residual, which indicates the deviation from the manifold. The approach leverages spatial correlations in image data to model the global geometry without explicit geometric representation, allowing for effective anomaly localization and detection.
Results
The proposed projection-aligned methods demonstrated strong performance, outperforming traditional boundary-based methods and improving upon existing reconstruction-based approaches. The results indicate that the new approach effectively addresses the misclassification of rare normal samples and enhances generalization capabilities.
Implications
The findings suggest that adopting a geometric perspective in anomaly detection can lead to more robust algorithms, particularly in applications involving high-dimensional data such as industrial inspection and medical imaging. The insights gained from this study could inform the design of future anomaly detection systems that are more aligned with the intrinsic structure of the data.
Graph Diffusion Residuals for Control-Function Instrumental Variables
Graph Learning
Theory
- A-IHF is a deterministic graph-diffusion method for extracting residuals in control-function IV estimators.
- The method effectively identifies treatment jumps and preserves relevant residual information for causal inference.
- Theoretical analysis provides insights into error decomposition and performance guarantees.
- A-IHF outperforms traditional methods in synthetic benchmarks and real-world applications.
Read more
Graph Diffusion Residuals for Control-Function Instrumental Variables
Summary
This paper addresses the challenge of extracting high-quality residuals in control-function instrumental variable (IV) estimators, which are crucial for causal inference when treatment is endogenous. The authors introduce a novel method called Adaptive Anisotropic Instrumental Heat Flow (A-IHF), which operates on a graph of first-stage features to extract residuals effectively. A-IHF utilizes pilot diffusion to identify significant treatment jumps and suppresses conductance across these jumps to compute the generated control. The method incorporates a selection rule based on graph generalized cross-validation and other criteria to ensure the quality of the residuals. The authors provide a theoretical framework that decomposes error into structural leakage, residual attenuation, and treatment variation, establishing finite-sample bounds and graph-admissibility rates. Through extensive experiments on synthetic benchmarks and real IV applications, A-IHF demonstrates superior performance, achieving the lowest average structural-response mean squared error (MSE) compared to various baseline methods, particularly when the underlying graph captures piecewise-smooth structures.
Methodology
The authors propose A-IHF, which treats treatment as a signal on a graph constructed from first-stage features. The method employs pilot diffusion to detect treatment jumps, suppresses conductance across these jumps, and computes the generated control using a sparse graph resolvent. The selection of first-stage features is guided by a combination of graph generalized cross-validation, roughness, and residualized-treatment relevance.
Results
In experiments across 54 synthetic benchmark cells, A-IHF achieved the lowest average structural-response MSE, outperforming the best non-A-IHF baseline in 32 cells. The performance was particularly strong when the graph accurately represented piecewise-smooth first-stage structures.
Implications
The findings suggest that A-IHF can significantly improve causal inference in settings where treatment is endogenous, making it a valuable tool for researchers in econometrics and related fields. The method's reliance on graph structures also opens avenues for further exploration in graph-based learning and causal analysis.
The Weight Norm Sets the Grokking Timescale: A Causal Delay Law
Theory
Interpretability
- The weight norm causally controls the timescale of grokking, reconciling conflicting theories.
- A matched-counterfactual clamp shows that grokking can occur at any weight norm, with time to grok following an exponential delay law.
- The delay law exhibits a shared exponent across different tasks, indicating a scaling relationship.
- Normalization techniques, such as LayerNorm, alter the influence of weight norm on grokking.
Read more
The Weight Norm Sets the Grokking Timescale: A Causal Delay Law
Summary
This paper investigates the phenomenon of 'grokking'—the delayed generalization of neural networks after fitting training data—by focusing on the role of weight norm in controlling the timescale of grokking. The authors reconcile two opposing views: one that posits a critical weight norm for grokking and another that suggests the norm is not a key variable. They demonstrate that, under free dynamics, a network first memorizes by increasing its weight norm and subsequently relaxes under weight decay, leading to generalization as the norm approaches a concentrated value, denoted as ‖𝑊‖𝑐. Through causal interventions, they show that grokking occurs at any norm held during training, with the time to grok following an exponential delay law. This delay is consistent across different tasks and architectures, indicating a scaling law with a shared exponent. The findings suggest that while the weight norm influences grokking timescale, it does not act as a strict threshold, thus providing a causal framework that connects the weight-norm and circuit accounts of grokking. The paper also discusses the implications of normalization techniques on the weight-norm dynamics and presents evidence of task generality across different types of tasks.
Methodology
The authors conducted experiments using a matched-counterfactual clamp to hold the weight norm at various multiples of a critical value throughout training. They analyzed the grokking timescale across different modular arithmetic tasks and architectures, measuring the relationship between weight norm and grokking time using statistical methods to establish causal relationships.
Results
The study found that grokking occurs when the weight norm reaches a concentrated value, with the time to grok following an exponential delay law. This relationship was consistent across multiple tasks and architectures, with a shared exponent of approximately 7.5. The results also indicated that holding the weight norm above the critical value delays grokking rather than preventing it, and that the weight norm has a more significant impact on grokking time compared to the learning rate.
Implications
The findings have implications for understanding the dynamics of neural network training and generalization, suggesting that controlling weight norm could be a strategy for optimizing learning processes. The results may inform the design of neural architectures and training protocols to enhance generalization capabilities in various machine learning applications.
High-Dimensional Random Projection for Activation Steering in Language Models
NLP
Large Language Models
Theory
- HiDRA captures richer discriminative signals in high-dimensional space, overcoming limitations of traditional DiM methods.
- Theoretical analysis supports the existence of second-order signals under the superposition hypothesis.
- HiDRA is a training-free, plug-in solution that enhances existing activation steering frameworks.
- Experimental evaluations show consistent performance improvements across diverse LLMs and tasks.
Read more
High-Dimensional Random Projection for Activation Steering in Language Models
Summary
This paper introduces HiDRA (High-Dimensional Random Projection for Activation Steering), a novel approach to activation steering in large language models (LLMs). Traditional methods based on difference-in-means (DiM) are limited as they only capture mean differences between class activations, neglecting the richer discriminative signals present in the nonlinear feature subspace. HiDRA addresses this limitation by projecting activations into a high-dimensional space where nonlinear relationships can be better captured. The authors provide a theoretical foundation demonstrating that second-order discriminative signals exist under the superposition hypothesis, allowing HiDRA to outperform existing methods in steering model behavior. The approach is training-free and integrates seamlessly with current activation steering techniques, enabling effective behavioral control without significant computational overhead. Experimental results across various LLMs and benchmarks indicate that HiDRA consistently achieves superior performance in steering tasks, such as jailbreaking and truthfulness, while maintaining the general capabilities of the models.
Methodology
HiDRA employs a high-dimensional random projection technique to map model activations into a nonlinear feature space. Steering is performed in this projected space, allowing for the capture of complex discriminative signals. The modified activations are then projected back to the original space for model inference. The approach is based on theoretical insights regarding the superposition hypothesis and the limitations of linear mean shifts in traditional steering methods.
Results
The experiments demonstrate that HiDRA significantly outperforms baseline activation steering methods in tasks such as jailbreaking, truthfulness assessment, and multiple-choice question answering. The results indicate improved behavioral control without incurring substantial computational costs, showcasing the effectiveness of the proposed method.
Implications
HiDRA has the potential to enhance the control and interpretability of large language models in various applications, including text generation, reasoning, and interactive systems. By enabling more nuanced behavioral adjustments, it could improve user experience and model reliability in real-world scenarios.
Compressed Computation is (probably) not Computation in Superposition
Theory
Interpretability
- The CC model's performance advantage is primarily due to a noisy mixing matrix rather than true computation in superposition.
- Performance scales with the magnitude of the mixing matrix, indicating its critical role in the model's success.
- A new SNMF baseline derived from the mixing matrix can replicate the qualitative loss profile of the CC model.
- The learned neuron directions concentrate in the subspace associated with the top eigenvalues of the mixing matrix.
Read more
Compressed Computation is (probably) not Computation in Superposition
Summary
This paper investigates the Compressed Computation (CC) model proposed by Braun et al. (2025) to determine if it exemplifies computation in superposition. The CC model claims to compute 100 ReLU functions using only 50 neurons, achieving a performance that exceeds expectations based on its architecture. The authors reveal that this performance is largely due to a noisy residual stream that mixes inputs, which acts as an unintended mixing matrix in the labels. By separating the training objective into the ReLU term and the mixing term, they demonstrate that the performance gains are directly correlated with the magnitude of the mixing matrix and disappear when it is removed. The learned neuron directions predominantly align with the top 50 eigenvalues of the mixing matrix, indicating that the mixing term is crucial for the model's success. Additionally, a semi-non-negative matrix factorization (SNMF) baseline derived from the mixing matrix replicates the qualitative loss profile of the CC model and outperforms previous baselines, although it does not fully match the trained model's performance. The findings suggest that the CC model is not a valid representation of computation in superposition.
Methodology
The authors analyze the CC model by reformulating it as a one-layer MLP trained on a modified target that includes both the ReLU function and a mixing matrix. They conduct experiments varying the mixing matrix and assess the model's performance against naive baselines. They also introduce a semi-non-negative matrix factorization (SNMF) approach to further investigate the mixing matrix's influence.
Results
The study finds that the CC model only outperforms naive baselines when a non-zero mixing matrix is present. The performance advantage diminishes when the mixing matrix is removed, and the model's success is linked to the noise introduced by this matrix. The SNMF baseline achieves a similar qualitative loss profile but does not match the trained model's performance.
Implications
These findings suggest that the CC model may not be a suitable framework for understanding computation in superposition, prompting a reevaluation of how such models are interpreted and applied in machine learning contexts.
Unsupervised Learning for Missing Modalities in Multimodal Learning
Multimodal
- UL4M4 supports arbitrary numbers of modalities and missing patterns at the sample level.
- The framework decouples the imputation process from downstream tasks, enhancing generalizability.
- It achieves state-of-the-art performance in multimodal sentiment analysis under severe missing conditions.
- The methodology includes a novel partial-modality distance metric and modality-specific normalization.
Read more
Unsupervised Learning for Missing Modalities in Multimodal Learning
Summary
This paper introduces UL4M4, a novel framework designed to address the missing-modality challenge in multimodal learning. The framework employs unsupervised learning to impute missing feature embeddings in a task-independent manner, allowing it to handle any number of modalities and arbitrary missing patterns. Key innovations include modality-specific normalization and a new partial-modality distance metric that facilitate effective clustering of incomplete observations while maintaining scale-invariance. The imputation process is lightweight and utilizes frozen encoders, making it easy to integrate with various fusion and prediction architectures. Extensive experiments demonstrate UL4M4's robustness, achieving F1-Micro scores consistently above 0.7 on challenging benchmarks, even with over 50% of modality slots missing. This performance surpasses existing state-of-the-art methods, showcasing UL4M4's effectiveness in real-world scenarios where data incompleteness is common.
Methodology
The UL4M4 framework employs an unsupervised learning stage to impute missing feature embeddings, utilizing modality-specific normalization and a partial-modality distance metric for clustering incomplete observations. The imputation process is iterative and greedy, leveraging cluster centers from the unsupervised stage to guide the filling of missing modalities during training or inference.
Results
UL4M4 consistently achieved F1-Micro scores above 0.7 on challenging multimodal sentiment analysis benchmarks, even when more than 50% of modality slots were missing. The results demonstrated stability across various cluster sizes and significantly outperformed existing state-of-the-art baselines.
Implications
The UL4M4 framework can be applied in real-world multimodal applications where data incompleteness is prevalent, such as sentiment analysis, emotion recognition, and classification tasks. Its flexibility and robustness make it suitable for various domains that rely on multimodal data.
Federated Learning for Feature Generalization with Convex Constraints
Federated Learning
- FedCONST introduces convex constraints to enhance feature generalization in Federated Learning.
- The method adaptively modulates updates based on the global model's parameter strength.
- FedCONST effectively mitigates overfitting and improves generalization across diverse FL environments.
- The approach is validated through theoretical analysis and extensive empirical testing.
Read more
Federated Learning for Feature Generalization with Convex Constraints
Summary
This paper addresses the challenges of generalization in Federated Learning (FL) due to heterogeneous client data, which often leads to local models overfitting their specific data distributions. The authors propose a novel approach called FedCONST, which adaptively modulates update magnitudes based on the global model's parameter strength. This method employs linear convex constraints to stabilize training and preserve locally learned generalization capabilities during model aggregation. By ensuring that well-learned features remain close during local updates while emphasizing under-learned features, FedCONST effectively aligns local and global objectives. The authors validate their approach through a Gradient Signal to Noise Ratio (GSNR) analysis, demonstrating that FedCONST enhances feature transferability and robustness. Experimental results show that FedCONST significantly outperforms existing FL methods across various models and datasets, achieving state-of-the-art performance while maintaining high computational and communication efficiency.
Methodology
FedCONST applies client-consistent convex constraints derived from the global model's weight magnitudes to stabilize training and enhance feature generalization. The method focuses on retaining well-learned features while emphasizing under-learned ones, ensuring that local updates align with global objectives without distorting generalization capabilities during aggregation.
Results
The experiments demonstrate that FedCONST significantly outperforms existing FL methods across various datasets and model architectures, achieving state-of-the-art performance in both cross-device and cross-silo settings. The method also shows improved stability in local training and reduced gradient variance, contributing to better overall model performance.
Implications
The findings suggest that incorporating convex constraints in Federated Learning can lead to more robust models that generalize better across heterogeneous data distributions. This approach has potential applications in various fields where data privacy is crucial, such as healthcare and finance, allowing for collaborative learning without compromising sensitive information.
Neither Parallel Nor Sequential: How DiffusionGemma Actually Commits Tokens
NLP
Large Language Models
Generative Models
- DiffusionGemma exhibits a partial left-to-right commit bias that is granularity-dependent.
- Token commitment order is moderate and shows significant sub-block disorder.
- The model's performance is comparable to its autoregressive counterpart in certain tasks.
- Commit confidence can predict correctness in specific regimes but not universally.
Read more
Neither Parallel Nor Sequential: How DiffusionGemma Actually Commits Tokens
Summary
This paper investigates the token commitment order in the diffusion language model DiffusionGemma, challenging the common perception that such models operate in a strictly parallel or autoregressive manner. The authors conduct an inference-only interpretability study on the google/diffusiongemma-26B-A4B-it model, which is a masked discrete diffusion mixture-of-experts model. By instrumenting the model's denoising loop, they record the positions of committed tokens and their associated entropy during inference. Through a comprehensive analysis of 686 prompts across various regimes, the study reveals a partial left-to-right commit bias that is granularity-dependent, with token-level commitment order showing moderate correlation (Kendall τb ≈0.43–0.60). The results indicate that while there is a tendency towards sequential commitment, it is not strictly block-autoregressive, as evidenced by significant sub-block disorder and large commit batches. The findings also highlight that the commitment order varies by task regime, with structured outputs like JSON exhibiting near order-independence. Additionally, the study assesses the predictive power of commit confidence on task correctness, noting that it is effective for mathematical tasks but not for factual recall. The authors also document the methodological challenges in measuring decoding order, contributing valuable insights for future research in this area.
Methodology
The authors instrumented the model's denoising loop to observe token commitment during inference, using the EntropyBoundSampler to record which canvas positions were committed and their per-position entropy. They analyzed 686 prompts across six regimes, employing a prompt-clustered bootstrap approach to derive insights about the commitment order and its dependencies.
Results
The analysis revealed a moderate left-to-right commit bias with a Kendall τb ranging from 0.43 to 0.60 across different regimes. The token commitment order was found to be significantly below that of a purely block-sequential model, indicating genuine sub-block disorder. The model's task accuracy was comparable to that of its autoregressive sibling, and commit confidence was predictive of correctness in mathematical tasks but not in factual recall.
Implications
The findings suggest that diffusion models may not operate as strictly parallel or autoregressive systems, which could influence future model design and evaluation. The insights into token commitment order and its variability across tasks can inform the development of more effective decoding strategies and improve interpretability in generative models.
Policy Regret for Embedding Model Routing: Contextual Bandits with Low-Rank Experts
Theory
Optimization
Efficient ML
- Formalizes embedding model routing as an adversarial contextual linear bandit problem.
- Introduces a log-quadratic policy class for efficient online learning in model routing.
- Develops the Hypentropy Policy Gradient (HPG) algorithm that adapts to low-rank structures.
- Achieves O(s√MT) linearized policy regret, avoiding the curse of dimensionality.
Read more
Policy Regret for Embedding Model Routing: Contextual Bandits with Low-Rank Experts
Summary
This paper addresses the challenge of dynamically routing queries to multiple embedding models in modern recommendation systems, which is critical for optimizing information retrieval. The authors formalize this problem as an adversarial contextual linear bandit with low-rank experts, where queries serve as contexts, items as actions, and embedding models as experts. They identify that traditional regret measures are inadequate due to structural misspecification and statistical intractability. To overcome these challenges, they introduce a log-quadratic policy class that captures query-dependent routing while allowing efficient online learning. The authors propose the Hypentropy Policy Gradient (HPG) algorithm, which adapts to the unknown low-rank structure of embedding models and achieves a linearized policy regret of O(s√MT), where s is the intrinsic rank of the experts, M is the number of models, and T is the number of rounds. Additionally, they provide a computationally efficient and parameter-free implementation of HPG, significantly reducing the complexity associated with traditional methods. The paper concludes with numerical experiments that validate the effectiveness of their approach on real-world datasets.
Methodology
The authors frame the routing problem as a T-round adversarial contextual linear bandit with low-rank experts. They propose a new log-quadratic policy class to better capture the underlying structures of embedding models and develop the HPG algorithm, which utilizes online mirror descent techniques to efficiently adapt to the low-rank structure of the models.
Results
The proposed HPG algorithm demonstrates a linearized policy regret of O(s√MT), significantly improving upon traditional methods that suffer from higher dimensionality dependencies. The implementation is shown to be computationally efficient, with a projection step that can be executed in O(d²qM) time, making it practical for real-world applications.
Implications
The findings have significant implications for the design of recommendation systems and search engines, particularly in improving the efficiency and effectiveness of query routing to specialized embedding models. This work can enhance user experience by providing more relevant recommendations and can be applied in various domains such as e-commerce, content retrieval, and personalized services.
Uncertainty Estimation and Generalization Bounds for Modern Deep Learning
Theory
- Introduction of the Deep Variational Implicit Process (DVIP) for scalable Bayesian modeling.
- Development of two methods (VaLLA and FMGP) for uncertainty calibration in deterministic networks.
- Establishment of a unified probabilistic framework linking diversity, smoothness, and stochasticity in generalization.
- Derivation of PAC-Chernoff bounds that explain double-descent behavior.
Read more
Uncertainty Estimation and Generalization Bounds for Modern Deep Learning
Summary
This dissertation explores the intersection of Bayesian principles and modern deep learning, focusing on uncertainty estimation and generalization bounds. It introduces the Deep Variational Implicit Process (DVIP), a scalable Bayesian framework that extends implicit processes to deep architectures, enabling efficient variational inference and modeling of non-Gaussian priors. Additionally, two post-hoc methods, Variational Linearized Laplace Approximation (VaLLA) and Fixed-Mean Gaussian Process (FMGP), are proposed to enhance uncertainty calibration in pretrained networks. The theoretical contributions address the question of why large, over-parameterized neural networks generalize well, developing a unified probabilistic framework that connects diversity, smoothness, and stochasticity. This framework formalizes how ensemble diversity reduces generalization error and interprets smoothness in the loss landscape as a factor in empirical loss concentration. The PAC-Chernoff bounds derived provide insights into double-descent behavior, while stochasticity in optimization is analyzed as a regularization mechanism. Overall, the thesis argues that reliable generalization and calibrated uncertainty can be achieved through a Bayesian lens, offering both practical tools and theoretical insights into deep learning.
Methodology
The methodology includes the development of the DVIP framework for Bayesian inference in deep learning, alongside the introduction of VaLLA and FMGP for uncertainty estimation. The theoretical analysis employs PAC-Bayesian and large-deviation theory to explore generalization mechanisms in neural networks.
Results
The DVIP framework demonstrates competitive performance with deep Gaussian processes at a lower computational cost. The proposed methods (VaLLA and FMGP) yield well-calibrated predictions on large-scale tasks. The theoretical framework provides a coherent explanation for generalization in over-parameterized networks, supported by empirical results.
Implications
The findings suggest that integrating Bayesian principles into deep learning can enhance model reliability and performance, particularly in applications requiring uncertainty quantification. This work may influence future research directions in Bayesian deep learning and generalization theory.
Smoothing Dark Areas in Molecular Latent Diffusion
Generative Models
Graph Learning
Optimization
- Identification of dark areas in molecular latent space that lead to invalid molecule generation.
- Introduction of TopVAE, which incorporates topology and chemical constraints during training.
- Demonstrated improvements in off-posterior robustness and molecular generation quality.
- Achieved significant reductions in FCD3D metrics on QM9 and GEOM-Drugs datasets.
Read more
Smoothing Dark Areas in Molecular Latent Diffusion
Summary
This paper addresses the challenges of latent diffusion models in 3D molecular generation, particularly the issue of 'dark areas' in the latent space that lead to the generation of chemically invalid or disconnected molecules. The authors propose TopVAE, a topology-optimized Variational Autoencoder (VAE) that internalizes structural and chemical constraints during training. This approach aims to create a smoother and more navigable latent space, enhancing the robustness of molecular generation. The paper identifies dark areas as regions in the latent space that, while reachable during diffusion sampling, result in invalid molecular outputs. TopVAE incorporates three main components: TopoBridge for ensuring connectivity, inherent chemical constraint learning through unrolled primal-dual optimization, and Advantage-Gated Constraint Learning (AGCL) for selectively feeding correction signals into the decoder. The results demonstrate that TopVAE significantly reduces the frequency of dark areas and improves the stability and connectivity of generated molecules, achieving state-of-the-art performance on benchmark datasets.
Methodology
The authors developed TopVAE, which includes a topology-first multi-stage decoder (TopoBridge) to ensure connectivity, and employs unrolled primal-dual optimization for chemical constraint learning. Additionally, Advantage-Gated Constraint Learning (AGCL) is used to selectively inject constraint-based corrections into the training process, allowing for inference without chemical corrections.
Results
TopVAE, when combined with a standard diffusion model (DiT), achieved a 77% reduction in FCD3D on the QM9 dataset and a 52% reduction on the GEOM-Drugs dataset. Furthermore, it produced 1.29 times more stable and connected molecules in zero-shot scaffold inpainting tasks, indicating a significant improvement in the quality of molecular generation.
Implications
The findings suggest that internalizing structural and chemical constraints during training can lead to more reliable molecular generation methods. This has potential applications in drug discovery and materials science, where generating valid molecular structures is crucial.
HAPI-EP: Towards Hybrid, Adaptive, and Predictive Digital Twins of Cardiac Electrophysiology
Theory
Generative Models
Optimization
- HAPI addresses the challenges of dynamic adaptation and predictive capability in digital twins for cardiac electrophysiology.
- The framework combines mechanistic models with neural networks to create a hybrid model that is interpretable and efficient.
- HAPI enables rapid adaptation to live data through few-shot learning and meta-learning techniques.
- The proposed approach ensures theoretical identifiability, enhancing the predictive performance of the digital twin.
Read more
HAPI-EP: Towards Hybrid, Adaptive, and Predictive Digital Twins of Cardiac Electrophysiology
Summary
The paper presents HAPI, a novel framework aimed at developing hybrid, adaptive, and predictive digital twins (DTs) for cardiac electrophysiology. The authors identify key challenges in creating DTs that can dynamically adapt to live patient data and maintain predictive capabilities. They propose a hybrid model that combines mechanistic and data-driven approaches, allowing for rapid adaptation through few-shot learning. HAPI integrates a physics-informed gray-box model with neural components to enhance interpretability and predictive performance. The framework's adaptability is achieved via feedforward meta-learners that optimize both mechanistic and neural parameters, ensuring theoretical identifiability. The authors demonstrate HAPI's effectiveness through a proof-of-concept study using a hybrid monodomain model, showcasing its strong predictive capabilities across both synthetic and real-world data scenarios. This work highlights the potential of HAPI to improve personalized cardiac care by enabling real-time updates and predictions based on individual patient data.
Methodology
The authors developed the HAPI framework, which constructs a hybrid model integrating a mechanistic backbone with a neural component. This model is adapted using few-shot learning techniques facilitated by feedforward meta-learners, which optimize the parameters of both the mechanistic and neural components. The framework emphasizes theoretical identifiability to ensure robust predictive capabilities.
Results
HAPI was validated through a proof-of-concept study in cardiac electrophysiology, demonstrating strong predictive and out-of-distribution capabilities. The hybrid model showed effective adaptation to live data, outperforming traditional models in terms of both speed and accuracy in predictions.
Implications
The HAPI framework has significant implications for personalized medicine, particularly in cardiac care, by enabling real-time updates and predictions based on individual patient data. This could lead to improved risk stratification, diagnosis, and treatment planning for patients with cardiac conditions.
Hierarchical ODE: Learning Continuous-Time Physical Prototypes for Early Link Failure Detection
Time Series
- Introduces a hierarchical ODE clustering network for time series prototype learning.
- Effectively disentangles smooth trends from stochastic noise in signal data.
- Autonomously determines the number of prototypes without rigid constraints.
- Validated on early link failure detection with irregularly sampled time series.
Read more
Hierarchical ODE: Learning Continuous-Time Physical Prototypes for Early Link Failure Detection
Summary
This paper addresses the challenges of time series prototype learning in the context of early link failure detection, particularly in environments with stochastic noise and observational ambiguity. The authors propose a hierarchical ordinary differential equation (ODE) clustering network that models latent state evolution as a continuous integral curve. This approach allows for the effective disentanglement of smooth feature trends from stochastic noise, overcoming the limitations of discrete architectures that struggle with continuous dynamics. The hierarchical mechanism autonomously determines the number of prototypes, adapting to the data without rigid prior constraints. The method is validated on the task of early link failure detection using irregularly sampled time series data, demonstrating its ability to extract underlying physical prototypes and enhance failure detection robustness. The results indicate that the proposed framework significantly improves the identification of distinct degradation patterns, thereby optimizing resource allocation in wireless communication networks.
Methodology
The authors utilize a hierarchical ordinary differential equation framework to model the continuous evolution of latent states as integral curves. This method enforces temporal continuity and allows for the separation of persistent degradation trends from transient noise. The hierarchical mechanism adaptsively identifies the number of prototypes based on the data characteristics.
Results
The proposed method demonstrated superior performance in early link failure detection compared to traditional discrete models. It effectively extracted physical prototypes from noisy time series data, leading to improved accuracy in predicting imminent link failures and optimizing resource reallocation in wireless communication networks.
Implications
The findings suggest that the hierarchical ODE approach can be applied to various domains requiring robust time series analysis, particularly in scenarios where continuous dynamics are critical for decision-making. This could enhance the reliability of systems in telecommunications and other fields reliant on real-time data interpretation.
Causal-Privacy Audit Workflow for Synthetic and Distilled Data in Dropout Support
Theory
- Introduces CaP-Eval, a workflow for auditing synthetic and distilled data in dropout support.
- Demonstrates that DPGNet and distilled data better preserve financial-status treatment effects compared to other methods.
- Highlights the importance of joint audits for predictive utility, privacy, and causal fidelity in educational data.
- Identifies the need for careful consideration of data generation methods in institutional decision-making processes.
Read more
Causal-Privacy Audit Workflow for Synthetic and Distilled Data in Dropout Support
Summary
This paper addresses the challenges of using synthetic and distilled student data for dropout support in higher education, emphasizing the need for a causal-privacy audit to ensure that generated data maintains both predictive utility and the integrity of financial-status evidence critical for institutional decision-making. The authors introduce CaP-Eval, a comprehensive workflow designed to evaluate the causal fidelity and privacy risks of various data generation methods, including distilled data, adversarial synthetic data, statistical synthetic data, and DPGNet. By applying this workflow to a rich dataset that includes demographic, socioeconomic, and academic variables, the study assesses how well these generated datasets can replicate the treatment-effect structures of original data. The findings indicate that DPGNet and distilled data outperform other methods in preserving the original financial-status treatment effects, with specific epsilon levels yielding varying degrees of fidelity. The paper concludes that a thorough joint audit of predictive utility, privacy orientation, and causal fidelity is essential before utilizing generated student data for decision-making in educational contexts.
Methodology
The study employs a decision-facing causal-privacy audit workflow (CaP-Eval) to evaluate various synthetic data generation methods against a fixed estimand and timing-aware adjustment design. It compares original, distilled, adversarial synthetic, statistical synthetic, and DPGNet datasets based on predictive utility, treatment-effect fidelity, and privacy governance metrics.
Results
The results reveal that DPGNet and distilled data maintain the original financial-status treatment-effect structure more reliably than adversarial and Gaussian Copula baselines. DPGNet showed full direction and rank agreement across different epsilon levels, with epsilon = 10 yielding the smallest deviations from original data. Distilled data exhibited strong local training-record proximity, while TabularGNet and Gaussian Copula showed moderate and compressed effects, respectively.
Implications
The findings suggest that institutions should adopt comprehensive audits of generated datasets to ensure they support accurate decision-making regarding student support and financial aid. This approach can enhance the reliability of learning analytics in higher education, ultimately improving student retention strategies.
Small LLMs: Pruning vs. Training from Scratch
Large Language Models
Efficient ML
NLP
- Pruning provides a strong initialization advantage over random initialization for small LLMs.
- The advantage of pruning diminishes as the pruning ratio increases and with extended training.
- When given a full token budget, training from scratch can match or exceed the performance of coarser pruning methods.
- Fine-grained pruning retains more knowledge transfer from the parent model compared to coarser methods.
Read more
Small LLMs: Pruning vs. Training from Scratch
Summary
This paper investigates the effectiveness of pruning large language models (LLMs) compared to training smaller models from scratch. The authors prune the Llama-3.1-8B model at various ratios (0.5–0.8) using six methods that vary in granularity (depth, width, and sparse). They conduct experiments under two controlled settings: (1) equal training token budget, where pruned models are compared against randomly initialized models, and (2) equal total token budget, where the full token budget of the pruning pipeline is allocated to training from scratch. The findings reveal that pruned models consistently outperform those initialized randomly, although this advantage diminishes with higher pruning ratios. When given the full token budget, pruned models still maintain an edge, especially with fine-grained pruning, suggesting that knowledge transfer from the parent model is significant. The study concludes that pruning is a more efficient strategy for creating small models when the training token budget is limited, while training from scratch can be competitive when resources are abundant.
Methodology
The authors employed a comparative analysis of six pruning methods across different granularities (depth, width, and sparse) on the Llama-3.1-8B model. They conducted experiments under two settings: one with an equal training token budget and another with an equal total token budget, measuring performance through validation loss and accuracy metrics.
Results
The results indicate that pruned models consistently outperform randomly initialized models under the same training token budget, with diminishing returns at higher pruning ratios. When the full token budget is utilized, pruned models still show superior performance, particularly with fine-grained pruning methods, while coarser pruning approaches can be matched or surpassed by training from scratch.
Implications
The findings suggest that pruning can be an effective strategy for developing smaller, efficient language models, particularly in resource-constrained environments. This has implications for deploying LLMs in applications where computational efficiency and accessibility are critical.
Graph-based Target Back-Propagation for Context Adaptation in Multi-LLM Agentic Systems
Large Language Models
Graph Learning
Optimization
- GTBP introduces a graph-based framework for context adaptation in multi-LLM systems.
- The method addresses credit assignment issues by propagating local target outputs rather than relying on gradients.
- Theoretical analysis confirms the stability of prompt updates and the ability to decrease task-level objectives.
- Empirical validation shows GTBP outperforms existing methods on benchmark datasets.
Read more
Graph-based Target Back-Propagation for Context Adaptation in Multi-LLM Agentic Systems
Summary
This paper presents Graph-based Target Back-Propagation (GTBP), a novel framework for context adaptation in multi-LLM agentic systems. The authors address the challenges of inaccurate credit assignment and convergence guarantees in existing methods by proposing a graph-based approach that propagates local target outputs backward through a directed acyclic graph representing the workflow. GTBP utilizes discrepancies between target outputs and actual outputs to guide a stage-wise prompt update mechanism, allowing for effective prompt engineering without modifying model weights. The theoretical analysis demonstrates that GTBP's prompt updates stabilize over iterations, and a capable LLM optimizer can reduce the overall objective. Empirical results show that GTBP consistently outperforms strong baseline methods across three benchmark datasets while maintaining similar computational costs, highlighting its effectiveness in automating prompt adaptation in complex multi-LLM systems.
Methodology
GTBP operates by modeling agentic workflows as directed acyclic graphs, where local target outputs are inferred from final outputs. It uses target-output discrepancies to update prompts in a stage-wise manner, avoiding the need for gradient-based credit assignment. The framework is theoretically analyzed for stability and optimization capabilities.
Results
GTBP demonstrated superior performance compared to strong baseline methods across three benchmark datasets, achieving better prompt optimization while maintaining similar computational costs.
Implications
The proposed GTBP framework can significantly enhance the efficiency of context adaptation in multi-LLM systems, making it a practical alternative to manual prompt engineering. This has potential applications in various domains requiring complex multi-step problem-solving using LLMs.
Beyond Defensive Reporting: Machine Learning for Active Anti-Money Laundering Control in Insurance
Theory
Optimization
- First empirical study of machine learning for money laundering detection in insurance claims.
- Shift from passive reporting to active prevention in anti-money laundering strategies.
- Incorporation of insurance fraud labels improves detection of laundering cases.
- Introduction of the Budget-Weighted Capture Rate metric for evaluating model performance.
Read more
Beyond Defensive Reporting: Machine Learning for Active Anti-Money Laundering Control in Insurance
Summary
This paper addresses the challenge of detecting money laundering in insurance claims, a significant issue for insurers due to the risks of fraudulent payouts and regulatory repercussions. The authors propose a shift from passive reporting to active prevention by leveraging machine learning techniques. They utilize production data from a major Norwegian insurer, training gradient-boosted decision tree models to identify claims that have been flagged for suspected money laundering. The study also explores the potential of using insurance fraud labels as auxiliary training signals, given the behavioral similarities between fraud and laundering. A novel metric, the Budget-Weighted Capture Rate, is introduced to evaluate the effectiveness of the models in capturing laundering cases when only a small fraction of claims can be manually reviewed. The findings reveal that integrating fraud-related investigation labels significantly enhances laundering detection, with the best model capturing nearly two-thirds of laundering cases within the top 2-6% of claims selected for investigation. This research represents the first empirical study applying machine learning to money laundering detection in the insurance sector, emphasizing the importance of proactive measures in combating financial crime.
Methodology
The authors trained gradient-boosted decision tree models on a dataset of insurance claims from Fremtind Insurance, focusing on claims later reported for suspected money laundering. They compared various learning setups, including binary and multiclass models, and utilized fraud investigation labels to enhance model training under conditions of extreme class imbalance.
Results
The study found that incorporating fraud-related labels significantly improved the detection of money laundering cases. The optimal model achieved a capture rate of nearly 66% for laundering cases within the top 2-6% of claims selected for further investigation, demonstrating the effectiveness of the proposed machine learning approach.
Implications
The findings suggest that insurers can enhance their anti-money laundering efforts by adopting machine learning techniques for proactive claim assessment. This approach not only improves detection rates but also helps mitigate reputational and regulatory risks associated with fraudulent payouts. The research could inform the development of more effective AML strategies in the insurance sector and potentially in other financial domains.
Utility-Constrained Policy Optimization
Reinforcement Learning
Optimization
Robotics
- Introduces a modification to UCMDPs that allows for practical solutions and optimal Markov policies.
- Presents utility-constrained policies (UCP), a Lagrangian deep RL algorithm for UCMDPs.
- Demonstrates strong performance of UCP in Safety Gymnasium benchmarks, outperforming existing methods.
- Allows for post-training adjustment of constraint limits, enhancing policy flexibility.
Read more
Utility-Constrained Policy Optimization
Summary
This paper addresses the limitations of constrained Markov decision processes (CMDPs) in reinforcement learning (RL), particularly their inability to incorporate risk-sensitive constraints. The authors propose a novel framework called utility-constrained MDPs (UCMDPs), which allows for the integration of risk-sensitive constraints without the need to predefine constraint limits before training. This flexibility enhances policy adaptability and performance. The methodology introduced, termed utility-constrained policies (UCP), is a Lagrangian deep RL approach that effectively solves UCMDPs. The authors validate their approach through extensive experiments on the Safety Gymnasium benchmark, demonstrating that UCP consistently matches or outperforms existing baselines while adhering to risk-sensitive constraints. The findings suggest that incorporating risk-sensitive constraints can lead to more reliable and effective RL policies, particularly in safety-critical applications.
Methodology
The authors modify the existing framework of utility-constrained MDPs to create a practical solution for incorporating risk-sensitive constraints. They develop a Lagrangian deep reinforcement learning algorithm called utility-constrained policies (UCP) that allows for dynamic adjustment of constraint limits during policy execution. This approach leverages dynamic programming techniques to solve the modified UCMDPs effectively.
Results
The experimental results indicate that UCP achieves comparable or superior performance to existing baselines in various Safety Gymnasium tasks. The agent demonstrates a significant reduction in the tail of the cost distribution, effectively managing risk-sensitive constraints while maintaining high episodic returns. The empirical analysis shows that UCP with risk-sensitive constraints outperforms its risk-neutral counterpart, particularly in terms of constraint adherence.
Implications
The proposed framework and methodology have significant implications for developing RL agents in safety-critical applications where risk management is essential. By allowing for risk-sensitive constraints, the approach can lead to more reliable and effective decision-making in environments where catastrophic failures must be minimized.
Benchmarking Instance-Dependent Label Noise with Controlled Corruptions
Computer Vision
Theory
- CILN framework allows for explicit control over the source and severity of instance-dependent label noise.
- The benchmarks created using CILN demonstrate realistic noise patterns that better reflect human uncertainty.
- Corruption-mediated IDN can expose failure modes in existing noisy-label learning methods.
- The study emphasizes the importance of noise structure in evaluating machine learning algorithms.
Read more
Benchmarking Instance-Dependent Label Noise with Controlled Corruptions
Summary
This paper introduces CILN (Corruption-Induced Label Noise), a novel framework for generating benchmarks that simulate instance-dependent label noise (IDN) through controlled input corruptions. Unlike existing methods that rely on imperfect annotators or classifiers to create noise, CILN explicitly manipulates the input data using various corruption techniques, such as noise, blur, and geometric distortions. This approach allows for the generation of benchmark datasets where the source and severity of label noise are both transparent and adjustable. The authors constructed 90 benchmark settings across datasets like CIFAR-10, MNIST, and Adult, demonstrating that these benchmarks exhibit realistic IDN and can reveal the limitations of popular noisy-label learning algorithms, such as Co-Teaching and DivideMix. The findings indicate that the structure of noise significantly influences the performance of learning algorithms, emphasizing the importance of understanding the mechanisms behind label noise rather than merely its rate. CILN thus provides a valuable tool for researchers to systematically study the effects of different types of input corruption on label uncertainty and classifier performance.
Methodology
The authors developed the CILN framework, which applies controlled corruptions to clean input data to generate noisy labels. A diverse voter pool evaluates the corrupted instances, producing soft label distributions that reflect the uncertainty introduced by the corruptions. The framework was tested on multiple datasets, including CIFAR-10 and MNIST, across various corruption families and severity levels.
Results
The experiments revealed that the benchmarks created with CILN exhibit genuine instance-dependent noise and diverse confusion structures. Notably, the label distributions produced on CIFAR-10 were found to be closer to human uncertainty compared to existing synthetic IDN benchmarks. Furthermore, the study identified specific failure modes in popular noisy-label learning methods that were not apparent under traditional rater-fallibility noise conditions.
Implications
CILN provides a robust framework for evaluating and improving noisy-label learning algorithms by allowing researchers to systematically investigate how different types of input corruption affect model performance. This could lead to the development of more resilient machine learning systems capable of handling real-world label noise more effectively.
How Post-Training Shapes Biological Reasoning Models
Multimodal
- Post-training stages uniquely influence model generalization in biological reasoning.
- CPT aligns models with biological language, improving downstream performance.
- SFT enhances in-domain performance but can lead to over-specialization and decline in out-of-domain performance.
- Reinforcement learning can recover generalization when applied to strong SFT checkpoints.
Read more
How Post-Training Shapes Biological Reasoning Models
Summary
This paper investigates the effects of post-training on biological reasoning models that integrate language models with multimodal biological data, such as DNA, RNA, and proteins. The authors conduct a controlled study involving over 100 models to understand how different post-training stages—continued pre-training (CPT), supervised fine-tuning (SFT), and reinforcement learning (RL)—affect model performance in both in-domain (ID) and out-of-domain (OOD) contexts. The findings reveal that each post-training stage uniquely influences generalization, with CPT aligning models with biological language, SFT enhancing ID performance but leading to OOD performance decline, and RL improving OOD performance when applied to strong SFT checkpoints. The study emphasizes that biological reasoning does not improve linearly with added supervision or compute, and optimal performance is achieved through strategic allocation of post-training resources. The paper provides insights into training dynamics and offers practical design principles for post-training under limited computational budgets.
Methodology
The authors trained and evaluated over 100 biological reasoning models across genomics, transcriptomics, and protein tasks. They systematically varied the backbone architecture, continued pre-training, supervised fine-tuning, and reinforcement learning to isolate their effects on model performance in both in-domain and out-of-domain settings.
Results
The study found that CPT improves downstream adaptation, SFT consistently increases ID performance but narrows OOD robustness, and RL enhances OOD performance when initialized from strong SFT checkpoints. The results indicate that performance is contingent on the composition of training stages rather than a simple increase in supervision or compute.
Implications
The findings suggest that careful design of post-training strategies is crucial for developing robust biological reasoning models. This has implications for future research in biological modeling and could inform the development of more effective training protocols in other domains where generalization is critical.
Beyond Classification: A Cough Regression Benchmark for Respiratory Acoustic Foundation Models
Audio & Speech
- Introduces a cough regression benchmark evaluating five respiratory acoustic foundation models across multiple health targets.
- Demonstrates that MLP-small architecture consistently outperforms mean-predictor baselines and linear probing.
- Finds that generative pretraining provides an advantage in age regression tasks.
- Highlights the asymmetry in cross-dataset transfer, with larger datasets performing better on smaller clinical populations.
Read more
Beyond Classification: A Cough Regression Benchmark for Respiratory Acoustic Foundation Models
Summary
This paper addresses the underexplored capability of respiratory acoustic foundation models (FMs) to predict continuous health metrics from cough audio, such as age, BMI, and disease probability. The authors introduce a comprehensive cough regression benchmark that evaluates five different FMs (OPERA-CT, OPERA-CE, OPERA-GT, HEAR, M2D+RESP) across six health targets using three distinct datasets. The study employs subject-disjoint protocols and compares various regression heads, including linear and multi-layer perceptron (MLP) architectures. Results indicate that the MLP-small architecture outperforms the mean-predictor baseline across all tasks and linear probing in 23 out of 30 model-task combinations. The findings reveal a dataset-size and head-capacity trade-off, with HEAR achieving the best performance in age regression on the Coswara dataset. The paper also highlights the generative pretraining advantage of OPERA-GT over OPERA-CT and discusses the asymmetry in cross-dataset transfer, where larger datasets generalize better to smaller clinical populations. Additionally, the analysis shows that HEAR and M2D+RESP can achieve near-full performance with only 50 samples, while OPERA models require 400 samples, suggesting that the diversity of the pretraining corpus significantly influences low-data regression performance.
Methodology
The authors designed a benchmark that evaluates five frozen foundation models on three cough datasets, employing a unified pipeline for audio processing. They utilized subject-disjoint splits for training and testing, and compared different regression heads, including linear and MLP architectures, to assess performance across various health metrics.
Results
The MLP-small architecture outperformed the mean-predictor baseline in all tasks and linear probing in 23 of 30 model-task combinations. HEAR achieved the lowest mean absolute error (MAE) for age regression on the Coswara dataset. OPERA-GT consistently outperformed OPERA-CT across datasets, and cross-dataset transfer showed significant asymmetry, with larger datasets generalizing better to smaller clinical populations.
Implications
The findings suggest that cough audio can serve as a valuable tool for estimating health metrics in settings where traditional measurements are unavailable, particularly in low- and middle-income countries. The study also emphasizes the importance of model architecture and pretraining diversity in achieving reliable performance in clinical applications.
Analytic Torsion and Spectral Gap Capture Persistent-Laplacian Performance
Graph Learning
Theory
Efficient ML
- Persistent Laplacians offer a richer geometric representation than persistent homology but face challenges in high-dimensional data.
- The proposed compact spectral representation includes Betti numbers, spectral gap, and analytic torsion, which effectively captures essential predictive signals.
- The new representation outperforms traditional methods in some cases while reducing computational complexity and noise.
- Analytic torsion serves as a mathematically grounded feature that simplifies the transition from raw spectral data to predictive performance.
Read more
Analytic Torsion and Spectral Gap Capture Persistent-Laplacian Performance
Summary
This paper addresses the challenges associated with utilizing persistent Laplacians (PL) in machine learning due to high dimensionality and varying lengths of spectra across filtration scales. The authors propose a compact spectral representation that distills the persistent Laplacian into three key invariants: Betti numbers, the spectral gap, and analytic torsion. By applying this reduced feature space to benchmark datasets such as MNIST, QM-3D, and SKEMPI WT, the authors demonstrate that these invariants effectively capture the essential predictive signals of the full spectrum. In some instances, this approach outperforms traditional methods while significantly reducing computational overhead and mitigating noise from higher-frequency eigenvalues. The findings suggest that these mathematical invariants provide a principled and fixed-length interface between spectral geometry and topological learning, enhancing the applicability of persistent Laplacians in various machine learning tasks.
Methodology
The authors developed a compact representation of persistent Laplacians by extracting three invariants: Betti numbers, spectral gap, and analytic torsion. They applied this representation to various benchmark datasets and utilized standard machine learning models such as Random Forests and Gradient Boosting Regressors to evaluate predictive performance.
Results
The results indicate that the proposed compact representation captures the essential predictive signals of the full persistent Laplacian spectrum. In benchmark tests, this approach not only reduced computational overhead but also outperformed traditional methods in certain scenarios, demonstrating the effectiveness of the selected invariants.
Implications
The findings suggest that incorporating analytic torsion and spectral gap into machine learning models can enhance the performance of topological data analysis. This approach could be particularly beneficial in fields requiring the analysis of complex geometric and topological structures, such as biomolecular prediction and graph learning.
MyPCBench: A Benchmark for Personally Intelligent Computer-Use Agents
NLP
Large Language Models
- MYPCBENCH addresses the gap in existing benchmarks by simulating a personal computing environment with a coherent user identity.
- The benchmark includes 184 tasks that reflect real-world requests, enhancing the relevance of the evaluation.
- The best-performing model, Claude Opus 4.6, solves only 55.4% of tasks, indicating significant room for improvement in multi-application interactions.
- The study emphasizes the importance of personalization in evaluating agent performance, particularly in complex, multi-turn tasks.
Read more
MyPCBench: A Benchmark for Personally Intelligent Computer-Use Agents
Summary
The paper introduces MYPCBENCH, a novel benchmark designed to evaluate computer-use agents as personal assistants in a realistic Linux desktop environment. Unlike existing benchmarks that operate in impersonal settings, MYPCBENCH incorporates a canonical persona, Michael Scott, and simulates a full desktop experience with 17 real-world web applications. The benchmark consists of 184 tasks inspired by real user requests from the OpenClaw community, allowing for a more authentic assessment of agent capabilities. The authors benchmark six models, including Claude Opus 4.6, which achieves the highest success rate of 55.4% across tasks. The study reveals that model failures are prevalent in tasks requiring interaction across multiple applications, highlighting the challenges of personalization in agent performance. The authors provide a reproducible environment, task set, and evaluation framework, contributing significantly to the field of personalized computer-use agents.
Methodology
The authors developed a reproducible Linux desktop environment populated with 17 simulated web applications and a detailed persona. They defined 184 tasks based on real user requests and benchmarked six different models using a uniform computer+bash tool surface. A rubric-grading system was employed to evaluate model performance, allowing for partial credit on tasks.
Results
The benchmarking results showed that the highest-performing model, Claude Opus 4.6, fully solved 55.4% of the tasks, with only 36% success on tasks involving seven or more applications. Other models performed significantly worse, with GPT-5.5 achieving just 4.5% success on complex tasks.
Implications
MYPCBENCH has the potential to advance the development of personalized computer-use agents by providing a realistic evaluation framework. It encourages the design of models that can better handle complex, multi-application tasks, ultimately improving user experience in personal assistant technologies.
Not all Jensen-Shannon Divergence Estimators are Equal
Theory
- Different estimation protocols yield substantially different Jensen-Shannon divergence values.
- Marginal estimators underestimate divergence by ignoring joint dependencies.
- Classifier-based estimators are sensitive to model choice and data characteristics.
- A closed-form posterior correction is proposed to address prior shift in classifier-based estimations.
Read more
Not all Jensen-Shannon Divergence Estimators are Equal
Summary
This paper investigates the estimation of Jensen-Shannon divergence (DJS), a common metric for evaluating the fidelity of synthetic tabular data. The authors highlight that the empirical estimation of DJS is highly dependent on the chosen estimator family, sampling protocol, and data characteristics, leading to potentially incomparable results across studies. They categorize the estimation methods into marginal-based and classifier-based approaches, revealing that marginal estimators often underestimate divergence by ignoring joint dependencies, while classifier-based estimators are sensitive to model choice and data regimes. The authors conduct a systematic empirical analysis to demonstrate these behaviors and propose a closed-form posterior correction to address prior shift issues in classifier-based estimations. They emphasize the necessity of explicitly specifying the estimation protocol to ensure meaningful comparisons and provide practical guidelines along with an open-source tool for estimator-aware DJS evaluation.
Methodology
The authors perform a systematic empirical study comparing marginal and classifier-based Jensen-Shannon divergence estimators across various controlled settings, including dependence structure, sample size, dimensionality, and class imbalance. They analyze the performance of these estimators on both synthetic data generated by Variational Autoencoders and Generative Adversarial Networks, and derive a closed-form correction for classifier-based estimations.
Results
The study reveals that empirical Jensen-Shannon divergence values are inherently protocol-dependent, with marginal estimators showing significant underestimation of divergence due to their disregard for joint dependencies. Classifier-based estimators exhibit variability based on model and data regimes, leading to inconsistent evaluation outcomes. The proposed posterior correction effectively restores consistency under prior shift conditions.
Implications
The findings underscore the importance of careful selection and specification of divergence estimation protocols in synthetic data evaluation, particularly in privacy-preserving contexts. The guidelines and open-source tool provided can enhance the reliability of divergence assessments in various machine learning applications.
Learning Urban Access Costs from Origin-Destination Flows via Inverse Optimal Transport
Optimization
Interpretability
- Introduces two inverse optimal transport models for estimating urban access costs.
- Demonstrates the application of the framework to large-scale school choice data in the Philippines.
- Estimates a subsidy-equivalent distance metric that quantifies perceived travel cost offsets.
- Highlights the spatial variability of subsidy impacts on accessibility.
Read more
Learning Urban Access Costs from Origin-Destination Flows via Inverse Optimal Transport
Summary
This paper addresses the challenge of understanding urban access costs in mixed public-private service networks, specifically through the lens of school choice in the Philippines. The author investigates how households navigate trade-offs between distance, price, and institutional access when selecting schools, particularly in the context of a national education subsidy aimed at alleviating congestion in public schools. By treating school-to-school enrollment flows as an entropic optimal transport plan, the study employs two complementary inverse optimal transport models: a distance-banded piecewise model and a neural cost model. The framework is applied to a dataset comprising 283,016 learner trips across 23,820 observed flows, allowing for the estimation of a subsidy-equivalent distance metric, which quantifies the perceived travel cost offset by the subsidy. The findings reveal that the impact of subsidies on accessibility is spatially variable, suggesting that uniform subsidies may lead to unequal accessibility gains depending on the proximity of subsidized facilities to demand. This work not only contributes to urban planning metrics but also demonstrates the potential for administrative origin-destination data to inform subsidy design and urban service allocation.
Methodology
The study utilizes inverse optimal transport to recover latent cost functions from observed school-to-school enrollment flows. It employs two models: a distance-banded piecewise model with explicit subsidy terms and a neural cost model trained via a differentiable Sinkhorn forward pass. The models are calibrated using a large dataset from the Philippine Department of Education.
Results
The piecewise model achieved a reduction in the cost function error from 5.00 to 3.29, while the neural model further improved the fit, achieving an error of 2.84. The estimated subsidy-equivalent distances indicate how many kilometers of perceived travel cost are offset by a 1,000-peso subsidy, with values varying across distance bands. For instance, a 1,000-peso increase in subsidy offsets approximately 6.07 km of perceived travel cost in the 5–15 km band.
Implications
The findings suggest that urban subsidies have a significant spatial footprint, affecting accessibility differently based on the location of subsidized facilities relative to demand. This insight can inform more effective subsidy calibration, facility placement, and urban service allocation strategies, potentially extending to other sectors such as healthcare and transit.
Provably Safe, Yet Scalable Reinforcement Learning
Reinforcement Learning
Robotics
Theory
- Introduction of the PS2-RL framework for scalable and provably safe RL.
- Utilization of a learned backup policy to create implicit control-invariant sets.
- Development of a novel safe-arrival value function for optimal policy training.
- Implementation of a control-invariant layer for efficient end-to-end training.
Read more
Provably Safe, Yet Scalable Reinforcement Learning
Summary
The paper introduces the PS2-RL framework, a novel two-phase architecture for safe reinforcement learning (RL) that aims to optimize rewards while ensuring safety constraints are met. Traditional approaches often rely on soft-constrained policy optimization, which lacks formal safety guarantees, or on explicit certificate functions that are computationally intensive and scale poorly with state dimensions. The PS2-RL framework addresses these limitations by leveraging a learned backup policy to create an implicit control-invariant set online, thus avoiding the need for explicit invariant set computation. In the first phase, the backup policy is trained using a safe-arrival value function, which defines the optimal policy for constructing the invariant set. The second phase involves training an RL policy through a differentiable projection layer that enforces safety guarantees derived from the backup policy. This approach allows for high-dimensional, input-constrained systems to be managed effectively without sacrificing performance. The authors provide theoretical guarantees for the framework and demonstrate its effectiveness in robotic control tasks, achieving 100% safety during both training and deployment while outperforming existing methods.
Methodology
The PS2-RL framework consists of two phases: Phase I focuses on training a backup policy using a safe-arrival value function to define optimal behavior for reaching a target safely. Phase II involves training the final RL policy through a differentiable projection layer that enforces safety constraints derived from the backup policy, allowing for scalable and efficient training without the need for explicit invariant set synthesis.
Results
The PS2-RL framework was evaluated on robotic control tasks, including unicycle lane-keeping and powerloop tracking for a 10-dimensional quadrotor. The results showed that the PS2-RL achieved 100% safety during both training and deployment, while also exceeding the performance of all baseline methods.
Implications
The PS2-RL framework has significant implications for the deployment of reinforcement learning in real-world applications, particularly in robotics, where safety is paramount. Its ability to ensure safety without excessive conservatism opens avenues for more reliable and efficient control systems in complex environments.
Can Neural Networks Achieve Optimal Computational-statistical Tradeoff? An Analysis on Single-Index Model
Theory
Efficient ML
- Neural networks can achieve optimal sample complexity for learning Gaussian single-index models.
- The proposed algorithm adapts to various loss and activation functions, enhancing its applicability.
- A novel weight perturbation technique is introduced to handle k-sparse signals effectively.
- The results demonstrate that neural networks can match SQ lower bounds, addressing open problems in the field.
Read more
Can Neural Networks Achieve Optimal Computational-statistical Tradeoff? An Analysis on Single-Index Model
Summary
This paper investigates whether neural networks trained via gradient-based methods can achieve the optimal computational-statistical tradeoff when learning Gaussian single-index models. The authors highlight a known sample complexity lower bound for polynomial-time algorithms under the statistical query (SQ) framework, which requires Ω(ds⋆/2 ∨ d) samples, where s⋆ represents the generative exponent of the model. The study proposes a unified gradient-based algorithm for training a two-layer neural network that adapts to various loss and activation functions. The algorithm is shown to learn a feature representation closely aligned with the unknown signal θ⋆, achieving a sample complexity of rO(ds⋆/2 ∨ d), which matches the SQ lower bound up to a polylogarithmic factor for generative exponents s⋆ ≥ 1. Additionally, the authors extend their approach to k-sparse settings, introducing a weight perturbation technique that leverages sparsity, achieving a corresponding SQ lower bound of order rΩ(ks⋆). The findings suggest that neural networks can effectively bridge the computational-statistical gap in learning single-index models, especially when sparsity is present.
Methodology
The authors propose a unified gradient-based algorithm for training a two-layer neural network, utilizing techniques such as label transformation and landscape smoothing. The algorithm is designed to learn feature representations that align with the unknown signal θ⋆, and it incorporates a weight perturbation method to exploit sparsity in the signal.
Results
The proposed algorithm achieves a sample complexity of rO(ds⋆/2 ∨ d), matching the SQ lower bound for generative exponents s⋆ ≥ 1. In k-sparse settings, the method achieves a sample complexity of rΩ(ks⋆), also matching the corresponding SQ lower bound up to a polylogarithmic factor.
Implications
The findings suggest that neural networks can effectively learn complex statistical models while maintaining computational efficiency. This has implications for various applications in statistics and machine learning, particularly in scenarios where sample efficiency is critical. The weight perturbation technique may also inspire new approaches in related fields, such as sparse tensor PCA.
D2H-AD: A Hybrid Model Utilizing Hyperdimensional Computing for Advanced Anomaly Detection
Efficient ML
Interpretability
- D2H-AD utilizes Hyperdimensional Computing for improved anomaly detection.
- The hybrid model combines distance-based similarity with density-aware encoding.
- Ablation studies show significant performance improvements over traditional methods.
- D2H-AD is lightweight and interpretable, suitable for edge AI applications.
Read more
D2H-AD: A Hybrid Model Utilizing Hyperdimensional Computing for Advanced Anomaly Detection
Summary
The paper presents D2H-AD, a novel anomaly detection framework that leverages Hyperdimensional Computing (HDC) to enhance the identification of outliers in various applications such as healthcare, cybersecurity, and IoT. Traditional machine learning methods often struggle with the need for large labeled datasets and high computational costs, particularly in edge devices. D2H-AD addresses these challenges by combining distance-based similarity and density-aware encoding in a hybrid model, which significantly improves anomaly detection accuracy. The authors conducted ablation experiments demonstrating that hyperdimensional encoding alone can yield a 5.4% increase in ROC-AUC compared to traditional Euclidean feature space methods. D2H-AD outperformed five established baseline methods across multiple datasets, achieving superior F1 scores and ROC-AUC metrics while maintaining robustness against class imbalance and noise. The framework is designed to be lightweight and interpretable, making it suitable for resource-constrained environments. Additionally, D2H-AD provides feature-level interpretability through hypervector decoding, which is crucial for applications requiring transparency in decision-making.
Methodology
The authors developed D2H-AD by integrating hyperdimensional encoding with distance-based similarity and density-aware scoring. They conducted extensive evaluations on five benchmark datasets, comparing D2H-AD's performance against five established anomaly detection techniques, including HDAD, ODHD, One-Class SVM, Isolation Forest, and Autoencoders.
Results
D2H-AD consistently outperformed all baseline methods across all datasets, achieving higher F1 scores and ROC-AUC metrics. The framework demonstrated robustness against class imbalance and noise, with a notable 5.4% increase in ROC-AUC when using hyperdimensional encoding compared to traditional methods.
Implications
The findings suggest that D2H-AD has significant potential for deployment in resource-constrained environments such as IoT and embedded systems. Its lightweight design and interpretability make it a promising solution for secure and efficient anomaly detection in dynamic applications.
Curvature-Guided Geometric Representation for Protein-Ligand Binding Affinity Prediction
Graph Learning
- Introduction of RicciBind, a novel framework for PLA prediction.
- Utilization of Ricci curvature for enhanced molecular structure representation.
- Optimal transport mechanism for aligning protein-ligand clusters.
- Significant improvements in predictive performance and interpretability.
Read more
Curvature-Guided Geometric Representation for Protein-Ligand Binding Affinity Prediction
Summary
This paper addresses the challenge of predicting protein-ligand binding affinity (PLA), a critical task in drug discovery. Traditional methods have struggled to effectively model the complex interactions between proteins and ligands due to limitations in characterizing local geometric organization and global cross-molecular interactions. The authors propose RicciBind, a novel geometric representation framework that integrates curvature-guided hierarchical structure learning with optimal transport (OT)-based cross-domain alignment. RicciBind utilizes Ricci curvature to enhance the representation of molecular structures by capturing local interaction tightness and organizing atomic interactions into curvature-aware hierarchical representations. Additionally, an OT-based cluster matching mechanism aligns protein and ligand clusters across heterogeneous domains, facilitating globally consistent correspondences and revealing higher-order interaction patterns. The combination of curvature-guided structure encoding and OT-driven alignment significantly improves the accuracy and interpretability of binding affinity predictions. Extensive experiments demonstrate that RicciBind outperforms existing methods on various PLA benchmarks and virtual screening tasks, with ablation studies confirming the importance of Ricci curvature in enhancing molecular interaction representations.
Methodology
The methodology involves a geometric representation framework that combines curvature-guided hierarchical structure learning with optimal transport-based alignment. Ricci curvature is employed to capture local interaction patterns, while an optimal transport mechanism aligns molecular clusters across different domains, enhancing the model's ability to understand complex interactions.
Results
RicciBind achieved superior predictive performance compared to existing methods on various protein-ligand binding affinity benchmarks and virtual screening tasks. The model demonstrated improved generalization capabilities, and ablation studies highlighted the critical role of Ricci curvature in enhancing the representation of molecular interactions.
Implications
The findings suggest that incorporating geometric descriptors like Ricci curvature can significantly enhance the modeling of molecular interactions, potentially leading to more accurate predictions in drug discovery. This approach could facilitate the identification of promising therapeutic candidates more efficiently.
Attention-Based Estimation of the Individual Treatment Benefit Probability under Dose Variation
Theory
- Introduces a framework for estimating IPTB under dose variation, moving beyond binary treatment settings.
- Utilizes attention mechanisms for effective aggregation of treatment effects from covariate-similar patient comparisons.
- Demonstrates superior performance of the proposed method over traditional kernel regression approaches.
- Provides a foundation for personalized dose selection based on individual-level treatment benefit probabilities.
Read more
Attention-Based Estimation of the Individual Treatment Benefit Probability under Dose Variation
Summary
This paper addresses the estimation of the Individual Probability of Treatment Benefit (IPTB) for patients undergoing dose-varying treatments, a significant advancement over traditional binary treatment models. The authors propose a novel framework, Dose-AIPTB, which utilizes attention mechanisms to estimate IPTB with ordinal outcomes under discrete dose assignments. By reformulating the IPTB estimation as a binary classification problem, the method constructs pseudo-labels from pairwise comparisons of covariate-similar patients, allowing for the aggregation of treatment effects across multiple discrete doses. The framework is validated through numerical experiments on both real-world and synthetic datasets, demonstrating that attention-based aggregation consistently outperforms traditional Nadaraya-Watson kernel regression methods. The findings suggest that this approach can enhance personalized medicine by providing a more accurate assessment of treatment benefits at varying doses, thereby facilitating better clinical decision-making.
Methodology
The proposed method, Dose-AIPTB, reformulates IPTB estimation as a binary classification problem. It constructs pseudo-labels from pairwise comparisons of patients with similar covariates and aggregates these using attention mechanisms or Nadaraya-Watson kernel regression. This approach accommodates multiple discrete dose levels, allowing for a more nuanced understanding of treatment effects.
Results
Numerical experiments indicate that the attention-based aggregation method consistently outperforms traditional kernel regression methods in estimating IPTB across various datasets and conditions, including covariate shift and heterogeneous outcomes.
Implications
The framework has significant implications for personalized medicine, as it allows clinicians to better assess the probability of treatment benefits for individual patients based on their specific characteristics and treatment doses. This could lead to more tailored treatment plans and improved patient outcomes.
Multi-Fidelity SINDy: Sparse Discovery of Nonlinear Dynamical Systems with Fidelity-Weighted Measurements
Time Series
Theory
Interpretability
- Introduction of a multi-fidelity SINDy framework that combines low and high-fidelity data.
- Statistical justification for a covariance-aware weighting strategy in regression.
- Validation on benchmark systems shows improved model recovery despite noise.
- Demonstrates that low-fidelity data can enhance model performance when high-fidelity data is scarce.
Read more
Multi-Fidelity SINDy: Sparse Discovery of Nonlinear Dynamical Systems with Fidelity-Weighted Measurements
Summary
This paper presents a novel approach to the Sparse Identification of Nonlinear Dynamical Systems (SINDy) framework, addressing the challenge of discovering nonlinear dynamical systems from data with heterogeneous fidelity and measurement noise. The authors propose a multi-fidelity SINDy method that integrates Ensemble SINDy and Weak SINDy within a weighted regression framework based on generalized least squares. This approach allows for effective model identification even when data quality varies significantly. The methodology is validated through various benchmark systems, including ordinary and partial differential equations, demonstrating its robustness against heteroscedastic noise. The results indicate that utilizing low-cost, low-fidelity measurements can enhance model recovery, sometimes achieving performance comparable to or better than models trained solely on high-fidelity data. This work highlights the potential of multi-fidelity data integration in improving the accuracy and interpretability of dynamical system models.
Methodology
The authors extend the SINDy framework by integrating Ensemble SINDy and Weak SINDy, employing a weighted regression approach derived from generalized least squares to account for varying noise levels in measurements. This allows for the effective combination of data from different fidelity sources while preserving the integrity of high-fidelity observations.
Results
The proposed multi-fidelity SINDy method was validated on several benchmark systems, including ordinary and partial differential equations. The results showed that the method effectively mitigates the impact of heteroscedastic noise and that low-fidelity measurements can significantly improve model recovery, achieving results comparable to those obtained from high-fidelity data alone.
Implications
This research has significant implications for fields requiring accurate modeling of dynamical systems, such as engineering, physics, and control systems. The ability to leverage low-fidelity data can reduce costs and time associated with high-fidelity simulations while maintaining model interpretability, which is crucial for applications like digital twins and real-time forecasting.
When Language Representations Interact: Separability and Cross-Lingual Effects in LLMs
NLP
Large Language Models
Interpretability
- Bilingual contrasts can be represented by stable, approximately linear directions in representation space.
- A covariance-adjusted inner product reduces overlap between language directions, allowing for meaningful interpretation of residual structures.
- Languages within the same family exhibit a simplex-like geometric organization, indicating hierarchical relationships.
- Additive interventions reveal systematic effects at the logit level but limited control at the generation level, highlighting challenges in multilingual steering.
Read more
When Language Representations Interact: Separability and Cross-Lingual Effects in LLMs
Summary
This paper investigates the internal representations of multilingual large language models (LLMs) to understand how language identity is represented and whether it behaves as an independent variable. The authors apply causal-geometric analysis to study 28 bilingual contrasts across three LLMs, revealing that language concepts can be represented as stable linear directions that are largely separable under a covariance-adjusted inner product. They find that while languages exhibit structured dependencies reflecting linguistic similarity, they also show a simplex-like geometric organization within language families. The study highlights the implications of these findings for the interpretability of multilingual models, particularly in terms of predicting and controlling cross-lingual effects during model deployment. The results suggest that while languages are largely separable, they are not independent, indicating that residual structures can influence how effects propagate across languages, which is critical for ensuring reliable behavior in multilingual applications.
Methodology
The authors employed a geometry-based interpretability framework, adapting a causal inner product to analyze multilingual representations. They examined 28 bilingual contrasts across three public LLMs (Qwen3-4B, Mistral-7B-v0.3, Llama-3-8B) to assess the stability and independence of language representations.
Results
The analysis demonstrated that many bilingual contrasts have stable linear representations that are largely separable, with structured deviations reflecting linguistic similarity. The covariance-adjusted inner product effectively reduced overlap between language directions, and the study identified a simplex-like organization among languages within the same family. The findings also indicated that while languages are largely separable, they exhibit residual structures that can lead to cross-lingual effects.
Implications
These findings have significant implications for the trustworthy deployment of multilingual LLMs, as understanding the separability and structured dependencies of language representations can help mitigate unintended cross-lingual effects during monitoring and intervention. This research contributes to the broader field of interpretability in AI, particularly in multilingual contexts.
Context-aware Modality-Topology Co-Alignment for Multimodal Attributed Graphs
Graph Learning
Multimodal
- CoMAG reframes MAG representation learning as task-adaptive context construction and modality-preserving alignment.
- The framework integrates reliable context learning and modality-specific hop trajectories for improved performance.
- CoMAG achieves state-of-the-art results across multiple graph-level and modality-level tasks.
- The approach retains modality-specific evidence while enabling effective cross-modal alignment.
Read more
Context-aware Modality-Topology Co-Alignment for Multimodal Attributed Graphs
Summary
This paper introduces CoMAG (Context-aware Modality-Topology Co-Alignment), a novel framework for learning from Multimodal Attributed Graphs (MAGs). MAGs represent real-world entities by integrating graph topology with diverse semantic attributes, such as text and images. Existing methods often struggle with task-agnostic context propagation and over-compressed cross-modal fusion, which can hinder performance across various tasks. CoMAG addresses these challenges by employing Reliable Context Learning to estimate edge reliability based on multimodal semantic consistency and selecting context components tailored to specific tasks. Additionally, it utilizes Modality-preserving Hop-token Alignment, which maintains modality-specific multi-hop trajectories and aligns them through shared-private representation decoupling. The framework is designed to support both graph-centric tasks (like node classification and link prediction) and modality-centric tasks (such as cross-modal retrieval and generation) within a unified learning pipeline. The authors provide theoretical analyses of stable propagation and modality-collapse control, and extensive experiments on nine OpenMAG datasets demonstrate that CoMAG outperforms existing methods, achieving state-of-the-art results in various tasks while maintaining efficient edge-linear complexity.
Methodology
CoMAG employs two main components: Reliable Context Learning, which assesses edge reliability and selects context components based on task requirements, and Modality-preserving Hop-token Alignment, which maintains distinct propagation paths for each modality while aligning them through a shared-private representation framework.
Results
Experimental evaluations on nine OpenMAG datasets show that CoMAG achieves the best reported performance among various baselines, indicating its effectiveness in enhancing structural prediction, cross-modal matching, and graph-conditioned generation.
Implications
The findings suggest that CoMAG can be applied in various domains where multimodal data is present, such as e-commerce and social networks, facilitating better understanding and interaction with complex data structures.
A Unified Causal-Origin Taxonomy of Distributional Shifts in Reinforcement Learning
Reinforcement Learning
Theory
- Introduces a unified causal-origin taxonomy for distributional shifts in reinforcement learning.
- Links ID/OOD generalization and non-stationarity under a common framework.
- Decomposes the generative interaction process in RL to analyze the impact of various components on distributional shifts.
- Distinguishes between internal and external shifts, and categorizes shifts into explicit, implicit, and hybrid types.
Read more
A Unified Causal-Origin Taxonomy of Distributional Shifts in Reinforcement Learning
Summary
This paper addresses the issue of distributional shifts in reinforcement learning (RL), which occur when the operating conditions of RL systems differ from those encountered during training. The authors propose a unified causal-origin taxonomy that characterizes the sources of these shifts, linking them to both in-distribution (ID) and out-of-distribution (OOD) generalization, as well as non-stationary environments. By reformulating distributional shifts in terms of the generative interaction process within a Partially Observable Markov Decision Process (POMDP) framework, the authors decompose the interaction into structural components such as state distribution, observation process, policy, reward, and transition dynamics. This decomposition allows for a clearer understanding of how changes in these components lead to distributional shifts. The taxonomy distinguishes between internal (agent-driven) and external (environment-driven) shifts and introduces a shifted-time boundary perspective that categorizes shifts as explicit, implicit, or hybrid. Furthermore, the authors present an evaluation framework to quantify the impact of shifts and the adaptation of agents through performance metrics. This work provides a systematic foundation for analyzing robustness in RL under distributional shifts, bridging the gap between supervised learning and RL.
Methodology
The authors adapt the classical dataset-shift principle from supervised learning to reinforcement learning by reformulating distributional shifts in terms of the generative interaction process. They utilize a Partially Observable Markov Decision Process (POMDP) framework to model realistic environments and decompose the interaction into structural components, analyzing how changes in these components affect the data-generating mechanism.
Results
The proposed taxonomy effectively distinguishes between different types of distributional shifts and provides a structured understanding of how these shifts occur in RL. The evaluation framework introduced allows for quantifying the impact of shifts on agent performance, facilitating a better understanding of agent adaptation in varying conditions.
Implications
This work lays a foundational framework for future research on robustness in reinforcement learning, guiding the development of more resilient RL systems capable of adapting to distributional shifts. It also connects insights from supervised learning to RL, potentially influencing methodologies for handling shifts in both domains.
pFedUL: Layer-Aware Federated Unlearning for Personalized Federated Learning
Federated Learning
- pFedUL addresses the unique challenges of federated unlearning in personalized federated learning settings.
- The framework includes a layer-aware approach that distinguishes between shared and personalized model components.
- New metrics (PPS and CFI) are introduced to evaluate unlearning quality in pFL.
- Experimental results show pFedUL achieves high personalized accuracy while effectively removing client data contributions.
Read more
pFedUL: Layer-Aware Federated Unlearning for Personalized Federated Learning
Summary
The paper introduces pFedUL, a novel framework for federated unlearning (FU) tailored for personalized federated learning (pFL). Traditional FU methods primarily focus on the FedAvg paradigm, which does not account for the unique structure of pFL models that separate shared global layers from client-specific personalized layers. This separation creates challenges for unlearning, as it necessitates balancing the removal of a client's data influence from shared layers while preserving the personalized performance for remaining clients. The authors propose a layer-aware selective unlearning approach that includes three key components: (1) a gradient-based method for attributing contributions to shared and personalized parameters, (2) adaptive selective unlearning strategies that differentiate between layer types, and (3) a recalibration protocol to help remaining clients restore their personalization with minimal overhead. Additionally, the paper introduces two new metrics, the Personalization Preservation Score (PPS) and the Cross-client Fairness Index (CFI), to evaluate the effectiveness of unlearning in the pFL context. Experimental results demonstrate that pFedUL achieves unlearning effectiveness comparable to full retraining while maintaining an average of 97.3% personalized accuracy for remaining clients. Compared to six state-of-the-art FU methods adapted for pFL, pFedUL shows superior performance in personalization preservation, with an average improvement of 6.3% in PPS and an 8.4× speedup across various architectures and datasets.
Methodology
The authors developed a layer-aware selective unlearning framework that includes gradient-based contribution attribution for shared and personalized layers, adaptive selective unlearning strategies tailored to different layer types, and a lightweight recalibration protocol for remaining clients. They also introduced new evaluation metrics to assess the quality of unlearning in the context of personalized federated learning.
Results
pFedUL demonstrated unlearning effectiveness comparable to full retraining while achieving an average of 97.3% personalized accuracy for remaining clients. It consistently outperformed six state-of-the-art FU methods adapted for pFL, improving the Personalization Preservation Score (PPS) by an average of 6.3% and achieving an 8.4× speedup across tested architectures and datasets.
Implications
The proposed pFedUL framework has significant implications for applications requiring compliance with data protection regulations, such as GDPR, while maintaining personalized model performance in federated learning settings. It can be particularly beneficial in sensitive domains like healthcare and finance, where data privacy is paramount.
Neural Slack Variables for Shape Constraints
Optimization
Theory
- Introduction of neural slack variables as a novel method for enforcing shape constraints in neural networks.
- Demonstration of the 'constraint drifting' failure mode in traditional penalty and primal-dual methods.
- Achieved zero violations in monotonicity and convexity test cases, outperforming existing methods.
- Enabled arbitrage-free learning of volatility surfaces, addressing a critical challenge in quantitative finance.
Read more
Neural Slack Variables for Shape Constraints
Summary
This paper addresses the challenge of enforcing functional inequality constraints, such as monotonicity and convexity, in neural networks, which is crucial for various industrial and scientific applications. Traditional methods like penalty and primal-dual approaches often lead to fragile satisfaction of these constraints, as they provide gradients only at violated locations, resulting in 'constraint drifting.' The authors introduce a novel approach called 'neural slack variables,' which involves coupling a primary neural network with an auxiliary network that learns non-negative slack variables. This auxiliary network serves as a valid target for the primary network's constraint quantities, effectively converting the constraint enforcement into a regression problem. The proposed method achieves zero measured violations in monotonicity and convexity test cases, outperforming existing penalty and primal-dual methods that leave residual violations. Additionally, the approach facilitates arbitrage-free learning of volatility surfaces in quantitative finance, addressing a significant industrial challenge. The paper demonstrates that neural slack variables not only stabilize constraint satisfaction but also transfer regularity from the auxiliary network to the primary network's constraint profile, enhancing the overall model performance.
Methodology
The authors propose a dual-network architecture where the primary network's constraint quantities are coupled with an auxiliary network that outputs non-negative slack variables. These networks are trained jointly using a quadratic matching loss in constraint space, ensuring that the primary network's outputs remain feasible throughout the training process.
Results
The experiments show that neural slack variables maintain stable constraint profiles without drifting into violations, achieving zero measured violations in dense-grid monotonicity and convexity test cases. This is a significant improvement over traditional penalty and primal-dual methods, which often leave residual violations.
Implications
The proposed method has significant implications for various fields requiring strict adherence to functional constraints, such as control systems and quantitative finance. By ensuring that learned models respect structural laws, it enhances the reliability and applicability of neural networks in practical scenarios.
Continuous Cross-Domain Traffic State Prediction via Memory-Augmented Graph Liquid Time-Constant Networks
Graph Learning
Time Series
- Introduction of MA-GLTC framework for continuous cross-domain traffic prediction.
- Utilization of spatio-temporal units (STUs) for fine-grained knowledge alignment.
- Development of GLTC for modeling traffic dynamics with adaptive time constants.
- Implementation of Memory-based Transfer Storage (MTS) for continual adaptation.
Read more
Continuous Cross-Domain Traffic State Prediction via Memory-Augmented Graph Liquid Time-Constant Networks
Summary
This paper addresses the challenges of traffic state prediction in intelligent transportation systems, particularly in regions with limited traffic observations. The authors propose a novel framework called Memory-Augmented Graph Liquid Time-Constant Network (MA-GLTC) to enhance cross-domain traffic prediction. The framework introduces spatio-temporal units (STUs) to facilitate fine-grained knowledge transfer between domains. A key innovation is the Graph Liquid Time-Constant Network (GLTC), which models traffic evolution in continuous time while incorporating graph-coupled dynamics. This allows for adaptive time constants and neighborhood-aware feedback, improving the model's ability to handle varying traffic patterns. Additionally, the Memory-based Transfer Storage (MTS) mechanism is designed to retain source-domain knowledge and adapt to unseen target-domain patterns. Experimental results on five public traffic datasets show that MA-GLTC significantly outperforms existing methods in both short-term and long-term prediction tasks, demonstrating its effectiveness in addressing the limitations of current cross-domain traffic prediction approaches.
Methodology
The methodology involves constructing spatio-temporal units (STUs) to break down traffic networks into transferable segments. The Graph Liquid Time-Constant Network (GLTC) is employed to model traffic evolution in continuous time, incorporating graph-coupled dynamics. The Memory-based Transfer Storage (MTS) mechanism is integrated to manage knowledge transfer and adaptation to new traffic patterns.
Results
The MA-GLTC framework consistently outperformed various inner-domain and cross-domain baselines across five public traffic datasets. It achieved an average reduction in prediction errors by 3.02%, 0.33%, 8.92%, 10.09%, and 2.11% compared to the second-best method, demonstrating its effectiveness in both short-term and long-term traffic state predictions.
Implications
The findings suggest that MA-GLTC can significantly enhance traffic prediction in low-resource regions, potentially improving traffic management and smart city initiatives. The framework's ability to adapt to new patterns may also facilitate better decision-making in dynamic urban environments.
How to Score Experts for One-Shot MoE Expert Pruning: A Unified Formulation and Selection Principle
NLP
Large Language Models
Efficient ML
- Introduces a unified formulation for one-shot MoE expert pruning based on routing frequency, gate weighting, and activation strength.
- Establishes a selection principle for pruning criteria that varies between task-agnostic and task-specific scenarios.
- Presents two new task-agnostic criteria, MAN and MSAN, which show superior performance compared to existing methods.
- Demonstrates the effectiveness of the proposed criteria across four MoE models and 16 diverse benchmarks.
Read more
How to Score Experts for One-Shot MoE Expert Pruning: A Unified Formulation and Selection Principle
Summary
This paper addresses the challenge of one-shot expert pruning in Mixture-of-Experts (MoE) language models, which are designed to optimize computation by activating only a subset of experts for each token. While effective, existing pruning criteria are often heuristic and lack universality across different tasks. The authors propose a unified formulation that identifies three core components influencing expert importance: routing frequency, gate weighting, and activation strength. This formulation leads to a selection principle for pruning criteria, suggesting that task-agnostic pruning should prioritize routed-token-averaged, gate-free activation-based criteria, while task-specific pruning can leverage routing-frequency and gate-weight information. Additionally, the authors introduce two new task-agnostic criteria, Mean Activation Norm (MAN) and Mean Squared Activation Norm (MSAN), which consistently outperform existing methods across various benchmarks. The study emphasizes the importance of selecting appropriate pruning criteria based on deployment objectives, thereby providing a systematic approach to expert pruning in MoE models.
Methodology
The authors analyze the damage caused by single-expert pruning to identify key components affecting expert importance. They develop a unified scoring formulation that incorporates these components and derive a selection principle for pruning criteria. They also introduce two new criteria, MAN and MSAN, and evaluate their performance across multiple models and benchmarks.
Results
The proposed criteria, MAN and MSAN, achieve top-two average ranks across four MoE models and 16 benchmarks, improving average performance by up to 8.8 percentage points over the strongest baseline. The results indicate that the new criteria provide a more balanced performance in task-agnostic settings.
Implications
The findings suggest that a systematic approach to expert pruning can significantly enhance the efficiency of MoE models, making them more suitable for deployment in resource-constrained environments. The proposed criteria could be applied in various applications requiring efficient memory usage without compromising performance.
LapidaryEngine: Fully Conversational Crystal Generation
Generative Models
NLP
Large Language Models
- LapidaryEngine enables fully conversational interaction for crystal generation.
- It introduces a pivot representation for bidirectional translation between text and crystal structures.
- The model allows iterative refinement based on user feedback, enhancing usability for non-experts.
- Democratizes materials design by allowing vague natural language prompts.
Read more
LapidaryEngine: Fully Conversational Crystal Generation
Summary
The paper introduces LapidaryEngine, a novel model designed for fully conversational crystal generation from natural language prompts. Traditional text-to-crystal models face limitations in requiring structured input formats and only allowing one-directional generation (text to crystal). LapidaryEngine overcomes these challenges by enabling users to provide free-form natural language requests and facilitating iterative refinement through a dialogue-like interaction. The key innovation is the introduction of a pivot representation, which serves as an intermediate form for bidirectional translation between text and crystal structures, even in the absence of direct paired datasets. This allows for robust interpretation of user feedback and precise control over generated structures. The model is demonstrated across various tasks, including insulator discovery and stability optimization, showcasing its ability to align generated materials with user intent interactively.
Methodology
LapidaryEngine employs a generative model that utilizes a pivot representation to facilitate the conversion between natural language descriptions and crystal structures. This model allows for iterative feedback and refinement, enabling a conversational approach to materials design.
Results
The model successfully generated crystal structures that align with user specifications through natural language prompts. It demonstrated capabilities in discovering new insulating materials and optimizing their properties, showcasing its effectiveness in interactive materials design.
Implications
LapidaryEngine has the potential to democratize materials science by enabling non-experts to design bespoke materials through intuitive interactions. This could lead to accelerated discovery and innovation in material design, impacting various fields such as electronics, energy storage, and pharmaceuticals.