AI-generated summaries
Today's ML research,
without the noise.
Daily summaries of the latest machine learning papers from arXiv, processed every 8 hours.
64
Papers today
8h
Update frequency
7
Days of history
Dummy-Aware Weighted Attack (DAWA): Breaking the Safe Sink in Dummy Class Defenses
Theory
Optimization
- Conventional evaluation methods overestimate the robustness of Dummy Classes-based defenses.
- DAWA introduces a dual-targeting approach that simultaneously attacks both true and dummy labels.
- Extensive experiments show a significant reduction in measured robustness of Dummy Classes defenses.
- The study highlights the need for continuous evolution in adversarial robustness evaluation methodologies.
Read more
Dummy-Aware Weighted Attack (DAWA): Breaking the Safe Sink in Dummy Class Defenses
Summary
The paper addresses the challenge of adversarial robustness evaluation, particularly in the context of Dummy Classes-based defenses that introduce a 'dummy' class to mislead adversarial attacks. The authors identify that conventional evaluation methods, such as AutoAttack, significantly overestimate the robustness of these defenses because they primarily focus on misleading the true class label, inadvertently aligning with the defense mechanism. To counter this, the authors propose a novel evaluation method called Dummy-Aware Weighted Attack (DAWA), which targets both the true and dummy labels during adversarial example synthesis with adaptive weighting. Extensive experiments demonstrate that DAWA effectively undermines the Dummy Classes-based defenses, reducing their measured robustness from 58.61% to 29.52% on the CIFAR-10 dataset under ℓ∞ perturbation. This work emphasizes the necessity for evolving robustness assessment methodologies to keep pace with emerging defense strategies.
Methodology
The authors developed DAWA, which adaptively weights the importance of the true and dummy labels during adversarial example synthesis. This approach allows for a more effective attack strategy that simultaneously targets both labels, breaking the illusion of safety provided by dummy classes.
Results
DAWA was shown to reduce the robustness of a leading Dummy Classes-based defense from 58.61% to 29.52% on CIFAR-10 under ℓ∞ perturbation (ϵ = 8/255), demonstrating its effectiveness in evaluating the true robustness of models employing dummy classes.
Implications
The findings suggest that current evaluation strategies may be inadequate for assessing new defense paradigms, necessitating the development of more sophisticated attack methodologies. This has implications for the design of future adversarial defenses and the assessment of their effectiveness.
Curvature-Guided LoRA: Steering in the pretrained NTK subspace
NLP
Efficient ML
Optimization
- Introduction of the prediction alignment problem focusing on model outputs.
- Development of Curvature-Guided LoRA (CG-LoRA) leveraging curvature information for low-rank updates.
- Demonstration of improved performance and faster convergence compared to existing LoRA variants.
- Emphasis on the significance of function-space alignment in parameter-efficient fine-tuning.
Read more
Curvature-Guided LoRA: Steering in the pretrained NTK subspace
Summary
This paper addresses the limitations of parameter-efficient fine-tuning methods, particularly Low-Rank Adaptation (LoRA), which often do not achieve the performance of full fine-tuning. The authors introduce the prediction alignment problem, which focuses on aligning model outputs rather than parameter updates. They propose a curvature-aware, second-order approach that utilizes local curvature information to guide low-rank updates, leading to a method called Curvature-Guided LoRA (CG-LoRA). This approach allows for the selection and scaling of adaptation directions based on curvature, resulting in improved performance and faster convergence on natural language understanding benchmarks. The authors emphasize the importance of function-space alignment and second-order information in enhancing parameter-efficient adaptation, demonstrating that CG-LoRA can achieve results comparable to full fine-tuning without the computational overhead of explicit second-order matrix construction.
Methodology
The authors propose a second-order formulation that incorporates local curvature information to guide the adaptation directions in LoRA. This involves a Newton-like initialization of low-rank adapters, where updates are scaled based on curvature, allowing for effective alignment with the outputs of full fine-tuning. The method is computationally efficient, avoiding the explicit formation of large curvature matrices by using structured approximations.
Results
Preliminary experiments on standard natural language understanding benchmarks indicate that CG-LoRA outperforms existing LoRA variants in terms of both convergence speed and final performance, highlighting the effectiveness of curvature-aware updates in achieving better prediction alignment.
Implications
The findings suggest that incorporating curvature information can significantly enhance the performance of parameter-efficient fine-tuning methods, making them more competitive with full fine-tuning approaches. This has potential applications in various domains where large pretrained models are adapted for specific tasks, particularly in natural language processing.
Lie Generator Networks for Nonlinear Partial Differential Equations
Theory
Interpretability
Time Series
- Introduction of LGN-KM, a neural operator for nonlinear PDEs that lifts dynamics into a linear space.
- Structured decomposition of the Koopman generator enhances stability and interpretability.
- Successful application on Navier–Stokes turbulence, recovering known physical properties from data alone.
- Demonstration of gauge invariance across different flow regimes.
Read more
Lie Generator Networks for Nonlinear Partial Differential Equations
Summary
The paper introduces the Lie Generator Network – Koopman (LGN-KM), a novel neural operator designed to analyze nonlinear dynamical systems governed by partial differential equations (PDEs). Unlike traditional methods that rely on linearity, LGN-KM lifts nonlinear dynamics into a linear latent space, allowing for the learning of the continuous-time Koopman generator (Lk) through a structured decomposition of Lk into a skew-symmetric component (S) and a positive-definite diagonal component (Dk). This decomposition not only ensures stability but also enhances interpretability by providing direct spectral access to the learned dynamics. The authors demonstrate the effectiveness of LGN-KM on two-dimensional Navier–Stokes turbulence, where the generator successfully recovers known dissipation scaling and a complete multi-branch dispersion relation from trajectory data without any physics supervision. The model exhibits gauge invariance across different flow regimes and guarantees long-horizon stability, enabling continuous-time evaluations and efficient transfer across viscosity models. The architectural constraints of LGN-KM, while improving interpretability and stability, may trade-off with one-step accuracy.
Methodology
The LGN-KM architecture employs a spectral encoder to lift input nonlinear data into a linear latent space, followed by a structured generator that decomposes the Koopman generator into skew-symmetric and positive-definite components. The model is trained on trajectory data to learn the dynamics without physics supervision, leveraging Fourier transforms for modal analysis.
Results
The LGN-KM successfully recovers the known dissipation scaling and a complete multi-branch dispersion relation from trajectory data of 2D Navier–Stokes turbulence. The model demonstrates gauge invariance across different training conditions and ensures long-horizon stability, outperforming unconstrained operators.
Implications
The findings suggest that LGN-KM can be a powerful tool for analyzing complex nonlinear systems, making linear analysis techniques applicable to nonlinear dynamics. This could have significant implications in fields such as fluid dynamics, control systems, and any domain where understanding the underlying dynamics is crucial.
An Explicit Surrogate for Gaussian Mixture Flow Matching with Wasserstein Gap Bounds
Optimization
Generative Models
Theory
- Development of a closed-form surrogate for Gaussian mixture transport using affine flow dynamics.
- Establishment of second-order agreement between the surrogate and the exact Gaussian Wasserstein cost under local commuting assumptions.
- Derivation of an explicit cubic bound on the surrogate-Wasserstein gap in local commuting regimes.
- Introduction of a path-splitting strategy for improved error control in nonlocal transport scenarios.
Read more
An Explicit Surrogate for Gaussian Mixture Flow Matching with Wasserstein Gap Bounds
Summary
This paper addresses the problem of training-free flow matching between two Gaussian mixture models (GMMs) by introducing an explicit surrogate for the kinetic transport cost. The authors develop a method that constructs component-wise Gaussian paths using affine velocity fields that satisfy the continuity equation, leading to a closed-form expression for the pairwise kinetic transport cost. Unlike the exact Gaussian Wasserstein cost, which requires complex matrix square-root computations, the proposed surrogate is derived from the kinetic energy of the flow and offers a simpler analytic form. The paper provides a thorough analysis of the accuracy of this surrogate, proving second-order agreement in a local commuting regime and deriving an explicit cubic error bound. To extend the applicability of the surrogate to nonlocal regimes, a path-splitting strategy is introduced, allowing for localized covariance evolution and piecewise error control. The authors compare the surrogate with the exact Gaussian Wasserstein geodesic, presenting a practical regime map that indicates when to use the surrogate versus the exact method. The main contributions include the development of a training-free surrogate for Gaussian component transport, explicit accuracy guarantees, and a detailed analysis of the efficiency-accuracy trade-off between the surrogate and exact methods.
Methodology
The authors construct a time-dependent density and velocity field to connect two GMMs, utilizing affine velocity fields that satisfy the continuity equation. They analyze the surrogate's accuracy through mathematical proofs and derive bounds for both local and nonlocal regimes. The methodology includes a comparison with the exact Gaussian Wasserstein geodesic and the introduction of a path-splitting strategy for enhanced error management.
Results
The paper demonstrates that the proposed surrogate closely approximates the exact Gaussian squared Wasserstein cost with a second-order agreement in local commuting scenarios. An explicit cubic error bound is established, and the path-splitting strategy allows for effective handling of nonlocal transport, providing a clear regime map for selecting between the surrogate and exact methods based on accuracy and efficiency.
Implications
The findings have significant implications for optimal transport applications, particularly in generative modeling and machine learning, where efficient and accurate transport of probability distributions is crucial. The proposed methods can enhance computational efficiency in scenarios involving complex endpoint distributions.
Beta-Scheduling: Momentum from Critical Damping as a Diagnostic and Correction Tool for Neural Network Training
Optimization
Interpretability
Efficient ML
- Introduces a diagnostic pipeline that connects damping regimes, error-specific gradient attribution, and surgical layer correction.
- Successfully identifies and corrects errors in specific layers of neural networks without full retraining.
- Demonstrates cross-optimizer invariance in identifying problematic layers, suggesting architectural issues rather than optimizer artifacts.
- Achieves significant computational savings (82%) and performance improvements (+22) compared to full retraining.
Read more
Beta-Scheduling: Momentum from Critical Damping as a Diagnostic and Correction Tool for Neural Network Training
Summary
This paper presents a novel diagnostic pipeline for neural networks that addresses the challenge of identifying and correcting errors in specific layers without the need for full retraining. The pipeline integrates three key concepts: the damped harmonic oscillator model of stochastic gradient descent (SGD) with momentum, error-specific gradient attribution focused on misclassified images, and a surgical correction approach that targets only the identified problematic layers using physics-derived momentum. The approach is validated on a ResNet-18 model trained on the CIFAR-10 dataset, where it successfully identifies three out of seven layer groups as sources of error. Surgical corrections applied to these layers resulted in the fixing of 62 errors, achieving a net improvement of +22 with an 82% reduction in computational cost compared to full retraining. Notably, the diagnostic method demonstrated cross-optimizer invariance, identifying the same problematic layers in both SGD and Adam models, indicating that the issues are architectural rather than artifacts of the optimization process. Additionally, a zero-parameter momentum schedule derived from the critical damping condition was found to accelerate convergence significantly, achieving 90% accuracy 1.9 times faster. The findings suggest that targeted layer-level interventions can enhance the efficiency of model repairs, with implications for knowledge editing and fine-tuning in large language models.
Methodology
The methodology involves a diagnostic pipeline that classifies training epochs using a damped harmonic oscillator model, computes gradient attribution on misclassified images to identify error sources, and applies surgical corrections to the identified layers using a physics-derived momentum schedule. The effectiveness of the pipeline is tested on a ResNet-18 model with the CIFAR-10 dataset.
Results
The pipeline identified three problematic layer groups out of seven, leading to the correction of 62 errors. The best variant of the surgical correction achieved a net improvement of +22 with an 82% reduction in computational resources compared to full retraining. The diagnostic measures showed 100% overlap in identified layers across different optimization methods, indicating robustness.
Implications
The findings suggest that the proposed diagnostic pipeline can be applied to efficiently repair specific failure modes in neural networks, potentially influencing practices in knowledge editing and representation engineering in large language models. This could lead to more efficient training and fine-tuning processes in various applications.
Variational Graph Neural Networks for Uncertainty Quantification in Inverse Problems
Graph Learning
- Introduces a hybrid architecture combining graph neural networks with variational inference for uncertainty quantification.
- Addresses limitations of traditional deterministic methods in inverse problems by providing measures of confidence in predictions.
- Demonstrates high precision in recovering physical parameters and estimating loads with associated confidence intervals.
- Validates methodology through practical applications in solid mechanics, showcasing its effectiveness in real-world scenarios.
Read more
Variational Graph Neural Networks for Uncertainty Quantification in Inverse Problems
Summary
This paper introduces a novel architecture called Variational Graph Neural Networks (VGNNs) aimed at addressing uncertainty quantification in inverse problems, particularly in the context of computational mechanics. Traditional deterministic methods often fail to provide reliable measures of confidence in their predictions, especially when dealing with noisy data or non-unique solutions. The proposed VGNN architecture integrates variational layers into the decoder, allowing for the modeling of weight probability distributions without the computational burden associated with full Bayesian networks. The methodology is validated through two solid mechanics cases: identifying the elastic modulus in a 2D elastic problem and locating loads on a 3D hyperelastic beam using only displacement fields as input. The results demonstrate that VGNNs not only accurately recover physical parameters but also provide confidence intervals that align with the underlying physics, thus enhancing the reliability of predictions in critical applications such as Digital Twins in engineering and medicine.
Methodology
The authors developed a VGNN architecture that incorporates variational layers to model the probability distribution of weights. This approach allows for the quantification of cognitive and statistical uncertainties while maintaining computational efficiency. The architecture was tested on two solid mechanics problems, utilizing only displacement fields as input data.
Results
The VGNN model successfully identified the elastic modulus in a 2D elastic problem and accurately located and quantified loads on a 3D hyperelastic beam. The model provided confidence intervals consistent with the physics of the problems, demonstrating its capability to deliver reliable predictions.
Implications
The proposed VGNN architecture has significant implications for fields requiring reliable uncertainty quantification, such as engineering simulations and precision medicine. Its ability to provide confidence measures enhances the utility of machine learning in critical applications like Digital Twins, where accurate and trustworthy predictions are essential.
Mitigating Forgetting in Continual Learning with Selective Gradient Projection
Optimization
Theory
Efficient ML
- Introduction of Selective Forgetting-Aware Optimization (SFAO) to mitigate catastrophic forgetting.
- Utilization of cosine similarity and per-layer gating for controlled gradient updates.
- Achieves a 90% reduction in memory cost while maintaining competitive accuracy.
- Demonstrates improved performance on continual learning benchmarks, especially with MNIST datasets.
Read more
Mitigating Forgetting in Continual Learning with Selective Gradient Projection
Summary
The paper addresses the challenge of catastrophic forgetting in neural networks, particularly in continual learning scenarios where models must adapt to new tasks without losing previously acquired knowledge. The authors propose a novel method called Selective Forgetting-Aware Optimization (SFAO), which employs a dynamic mechanism to regulate gradient updates based on cosine similarity and per-layer gating. This approach allows for controlled forgetting while maintaining a balance between plasticity (the ability to learn new tasks) and stability (the retention of old knowledge). SFAO selectively projects, accepts, or discards updates using a tunable mechanism that significantly reduces memory costs by 90% compared to existing methods. Experimental results demonstrate that SFAO achieves competitive accuracy and improved performance on standard continual learning benchmarks, particularly on the MNIST dataset, making it suitable for resource-constrained environments.
Methodology
The SFAO method employs a per-layer gating rule that determines whether to accept, project, or discard gradient updates based on their cosine similarity with previously stored gradients. This is achieved through a Monte Carlo approximation that samples a subset of past gradients, allowing for efficient computation. The method balances the need for stability in previously learned tasks with the adaptability required for new tasks, thus addressing the issue of gradient-induced interference.
Results
SFAO was evaluated on standard continual learning benchmarks, showing that it not only maintains competitive accuracy but also significantly reduces memory usage by 90%. The method demonstrated improved performance in terms of reduced forgetting on the MNIST dataset, indicating its effectiveness in managing the trade-off between learning new information and retaining old knowledge.
Implications
The findings suggest that SFAO can be effectively applied in dynamic environments where models need to continuously learn and adapt, such as in autonomous driving, medical diagnostics, and cybersecurity. Its low memory requirements make it particularly suitable for deployment in resource-constrained scenarios, enhancing the reliability and efficiency of continual learning systems.
Foundations of Polar Linear Algebra
Theory
Efficient ML
Interpretability
- Introduction of Polar Linear Algebra as a structured framework for operator learning.
- Demonstrated effectiveness on the MNIST benchmark, showing reliable training of polar operators.
- Imposing self-adjoint-inspired spectral constraints improves training stability and convergence.
- Reduction in parameter count and computational complexity while enhancing interpretability.
Read more
Foundations of Polar Linear Algebra
Summary
This paper introduces Polar Linear Algebra, a novel framework for operator learning that integrates polar geometry with linear and periodic components. The framework is structured around radial-angular operators, allowing for a spectral analysis of their properties. The author demonstrates the feasibility of this approach through experiments on the MNIST dataset, showing that polar operators can be effectively trained. The results indicate that imposing spectral constraints inspired by self-adjointness enhances stability and convergence during training. Additionally, the proposed framework reduces the number of parameters and computational complexity while providing a clearer interpretation through decoupled spectral modes. This shift from spatial to spectral domains enables the decomposition of problems into orthogonal eigenmodes, facilitating independent computational pipelines and enhancing model parallelization. The work presents a fresh perspective on operator learning, particularly beneficial for tasks where spectral structure and parallel execution are critical.
Methodology
The paper employs a spectral perspective to define and analyze polar operators, utilizing polar geometry to combine radial and angular components. The framework is evaluated through experiments on the MNIST dataset, focusing on the training of polar and fully spectral operators under various constraints.
Results
The experiments on MNIST revealed that polar operators can be trained effectively, with improved stability and convergence when spectral constraints are applied. The approach also resulted in a significant reduction in parameters and computational complexity, while providing a more interpretable model structure.
Implications
The findings suggest that Polar Linear Algebra could be applied to various machine learning tasks that benefit from spectral analysis and parallel execution, potentially leading to more efficient and interpretable models in fields such as computer vision and beyond.
Target-Aligned Reinforcement Learning
Reinforcement Learning
Theory
Optimization
- TARL mitigates the stability-recency tradeoff by prioritizing updates based on alignment between target and online network estimates.
- A novel offline-online target alignment metric is introduced to quantify the agreement between value estimates.
- Theoretical analysis indicates that learning on aligned transitions acts as a variance reduction mechanism, improving learning efficiency.
- Empirical results show consistent improvements over standard RL algorithms in various environments.
Read more
Target-Aligned Reinforcement Learning
Summary
The paper introduces Target-Aligned Reinforcement Learning (TARL), a novel framework designed to address the stability-recency tradeoff inherent in traditional reinforcement learning (RL) algorithms that utilize target networks. While target networks stabilize training by providing lagged estimates, this lag can lead to stale target values that do not align with the current policy, hindering convergence speed. TARL focuses on transitions where the estimates from the online and target networks are highly aligned, allowing for updates that leverage recent information without sacrificing stability. The authors provide a theoretical analysis showing that this alignment can accelerate convergence and present empirical results demonstrating that TARL consistently outperforms standard RL algorithms across various benchmark environments. The framework can be integrated into existing RL algorithms, enhancing their efficiency and effectiveness.
Methodology
The authors propose TARL, which integrates a target alignment metric into standard RL algorithms. The framework emphasizes updates based on the alignment of estimates from the online and target networks, allowing for more effective learning by filtering out misaligned updates. Theoretical analysis supports the efficacy of this approach, while empirical studies validate its performance across different RL tasks.
Results
The empirical results demonstrate that TARL significantly improves convergence speed and overall performance compared to traditional RL algorithms that rely on standard target networks. The framework shows consistent advantages across various benchmark environments, confirming its effectiveness in addressing the stability-recency tradeoff.
Implications
TARL has the potential to enhance the performance of a wide range of RL applications, particularly in environments where rapid adaptation to changing dynamics is crucial. By improving the efficiency of learning processes, TARL could lead to more robust and effective RL agents in real-world scenarios.
From Density Matrices to Phase Transitions in Deep Learning: Spectral Early Warnings and Interpretability
Theory
Interpretability
Optimization
- Introduction of the 2-datapoint reduced density matrix (2RDM) for studying phase transitions in neural networks.
- Derivation of spectral diagnostics: spectral heat capacity for early warning of second-order transitions and participation ratio for dimensionality assessment.
- Top eigenvectors of the 2RDM offer mechanistic insights into the nature of transitions.
- Validation of the framework across multiple distinct settings.
Read more
From Density Matrices to Phase Transitions in Deep Learning: Spectral Early Warnings and Interpretability
Summary
This paper addresses the challenge of predicting and understanding emergent capabilities in AI models during training by introducing the 2-datapoint reduced density matrix (2RDM). This computationally efficient observable allows for the study of phase transitions in neural networks. By analyzing the eigenvalue statistics of the 2RDM, the authors derive two key signals: the spectral heat capacity (SHC), which serves as an early warning for second-order phase transitions, and the participation ratio, which indicates the dimensionality of the model's reorganization. The top eigenvectors of the 2RDM provide interpretable insights into the nature of these transitions. The framework is validated across various settings, including deep linear networks and emergent misalignment, and establishes connections with existing theoretical frameworks, enhancing the understanding of phase transitions in deep learning.
Methodology
The authors introduce the 2RDM, which captures the covariance of per-sample losses over a selected probe set during training. They derive spectral diagnostics from the 2RDM and validate the framework through empirical studies across different neural network settings.
Results
The study demonstrates that the 2RDM effectively captures phase transitions during training, providing early warning signals for second-order transitions and concurrent detection for first-order transitions. The framework is validated in four distinct scenarios, showcasing its applicability and interpretability.
Implications
This work has significant implications for understanding and controlling model training in deep learning, offering a new tool for researchers to predict emergent capabilities and manage phase transitions effectively.
Reward-Based Online LLM Routing via NeuralUCB
Large Language Models
NLP
Reinforcement Learning
- NeuralUCB is proposed as a novel approach for cost-aware LLM routing.
- The method outperforms random and min-cost baselines in utility reward.
- Achieves lower inference costs while maintaining competitive rewards compared to max-quality models.
- The study highlights challenges in action discrimination and exploration.
Read more
Reward-Based Online LLM Routing via NeuralUCB
Summary
This paper explores the application of NeuralUCB for cost-aware routing of large language models (LLMs). The authors identify the need for efficient routing methods that balance the trade-off between model quality and inference cost. They categorize existing routing approaches into supervised and partial-feedback methods, each with distinct advantages and limitations. The proposed NeuralUCB-based routing policy is implemented and evaluated using the RouterBench benchmark in a simulated online environment. The results demonstrate that the NeuralUCB method consistently outperforms both random and min-cost baselines in terms of utility reward, achieving lower inference costs compared to a max-quality reference while maintaining competitive rewards. The study emphasizes the potential of NeuralUCB for effective LLM routing while acknowledging challenges related to action discrimination and exploration in the context of sparse feedback.
Methodology
The authors frame the problem of LLM routing as a contextual bandit problem, employing NeuralUCB to learn a non-linear routing policy. They utilize a UtilityNet architecture to predict utility rewards based on context-action pairs, incorporating feedback signals for model quality and inference cost. The routing policy is designed to maximize long-term expected utility by selecting the most appropriate model for each query based on the predicted utility.
Results
Experimental evaluations reveal that the NeuralUCB-based routing policy consistently yields higher utility rewards than random and min-cost baselines. It also demonstrates a significant reduction in inference costs compared to max-quality references while still achieving competitive reward levels. These findings validate the effectiveness of NeuralUCB in optimizing LLM routing.
Implications
The findings suggest that NeuralUCB can be effectively applied to optimize the selection of LLMs in real-time applications, potentially leading to more efficient resource utilization in AI systems. This approach could enhance user experience by providing high-quality responses at lower costs, making it suitable for various applications in natural language processing.
HCLSM: Hierarchical Causal Latent State Machines for Object-Centric World Modeling
Robotics
Computer Vision
Graph Learning
- HCLSM integrates object-centric decomposition, hierarchical temporal dynamics, and causal reasoning in a single differentiable architecture.
- The two-stage training protocol enhances model performance by first specializing object slots before predicting dynamics.
- The model achieves state-of-the-art performance on the PushT benchmark with significant improvements in prediction accuracy and speed.
- A custom Triton kernel optimizes the selective state space model scan, drastically reducing computational time.
Read more
HCLSM: Hierarchical Causal Latent State Machines for Object-Centric World Modeling
Summary
The paper introduces HCLSM, a novel world model architecture designed to enhance object-centric world modeling by addressing the limitations of existing models that rely on flat latent representations. HCLSM operates on three core principles: object-centric decomposition using slot attention with spatial broadcast decoding, hierarchical temporal dynamics through a three-level engine that integrates selective state space models for continuous physics, sparse transformers for discrete events, and compressed transformers for abstract goals, and causal structure learning via graph neural network interactions. The authors propose a two-stage training protocol that first focuses on spatial reconstruction to encourage slot specialization before transitioning to dynamics prediction. The model, comprising 68 million parameters, was trained on the PushT robotic manipulation benchmark, achieving a next-state prediction loss of 0.008 MSE, with significant improvements in spatial decomposition and learned event boundaries. Additionally, a custom Triton kernel was developed, resulting in a 38× speedup over traditional PyTorch implementations. The authors provide an open-source codebase, training infrastructure, and evaluation suite to facilitate further research.
Methodology
HCLSM employs a five-layer architecture that combines object slots, a hierarchical temporal structure (selective state space models, sparse transformers, and compressed transformers), and causal interaction modeling through graph neural networks. The training is conducted in two stages: first focusing on spatial reconstruction to ensure slot specialization, followed by dynamics prediction.
Results
The HCLSM model achieved a next-state prediction loss of 0.008 MSE on the PushT benchmark, with a spatial decomposition loss of 0.0075. The implementation also demonstrated a 38× speedup in processing time due to the custom Triton kernel, significantly enhancing efficiency.
Implications
The advancements presented in HCLSM have the potential to improve robotic manipulation and other applications requiring object-centric reasoning and prediction. The model's ability to learn causal relationships and temporal dynamics could enhance decision-making processes in autonomous systems.
On the Asymptotics of Self-Supervised Pre-training: Two-Stage M-Estimation and Representation Symmetry
Theory
- Develops an asymptotic theory for self-supervised pre-training using two-stage M-estimation.
- Addresses the challenge of group symmetry in pre-training estimators through Riemannian geometry.
- Establishes a link between pre-training representations and downstream predictors via orbit-invariance.
- Applies theoretical results to case studies, showing substantial improvements over prior work.
Read more
On the Asymptotics of Self-Supervised Pre-training: Two-Stage M-Estimation and Representation Symmetry
Summary
This paper addresses the theoretical underpinnings of self-supervised pre-training in machine learning, particularly focusing on the interaction between pre-training and fine-tuning phases. The authors develop an asymptotic theory using two-stage M-estimation to analyze how pre-training on unlabeled data influences downstream task performance. A significant challenge in this area is the group symmetry inherent in pre-training estimators, which complicates the application of traditional M-estimation techniques. The authors utilize Riemannian geometry to explore the intrinsic parameters of the pre-training representation and establish a connection to downstream predictors through the concept of orbit-invariance. This framework allows for a precise characterization of the limiting distribution of downstream test risk. The paper further applies its findings to various case studies, including spectral pre-training and Gaussian mixture models, demonstrating improvements over existing theoretical bounds in specific scenarios. Overall, this work aims to provide a sharper, instance-optimal understanding of the benefits of self-supervised pre-training.
Methodology
The authors employ two-stage M-estimation and tools from Riemannian geometry to analyze the asymptotic behavior of pre-training representations. They focus on establishing the asymptotic normality of these representations and their relationship to downstream regression tasks, particularly through the lens of orbit-invariance.
Results
The main results include a precise characterization of the limiting distribution of downstream test risk based on pre-training representations. The authors demonstrate that their theoretical framework leads to significant improvements in understanding the role of pre-training in various models, including spectral pre-training and Gaussian mixture models, compared to existing literature.
Implications
This work has implications for the design of self-supervised learning frameworks, providing a deeper theoretical understanding that can guide the development of more effective pre-training strategies. It may also influence how practitioners approach the balance between pre-training and fine-tuning in machine learning applications.
Preconditioned Attention: Enhancing Efficiency in Transformers
Efficient ML
Optimization
Theory
- Standard attention mechanisms in Transformers can produce ill-conditioned matrices, negatively impacting training efficiency.
- Preconditioned attention introduces a conditioning matrix to improve the condition number of attention matrices.
- The method is theoretically grounded and empirically validated across multiple transformer applications.
- Preconditioned attention is compatible with various existing attention mechanisms, enhancing their performance.
Read more
Preconditioned Attention: Enhancing Efficiency in Transformers
Summary
This paper addresses the inefficiencies in standard attention mechanisms used in Transformers, which often lead to ill-conditioned matrices that hinder gradient-based optimization. The author introduces a novel approach called preconditioned attention, which incorporates a conditioning matrix into each attention head. The theoretical framework developed demonstrates that this method significantly reduces the condition number of attention matrices, thereby improving their conditioning and facilitating more efficient training. The effectiveness of preconditioned attention is validated across various applications, including image classification, object detection, instance segmentation, long sequence modeling, and language modeling. The proposed method serves as a versatile drop-in replacement for existing attention mechanisms, consistently yielding superior results across diverse tasks.
Methodology
The paper presents a theoretical analysis of the self-attention matrix's condition number and proposes a preconditioner that modifies the self-attention matrix to reduce its condition number. This preconditioner is dynamically computed based on the queries, keys, and values of the self-attention mechanism, allowing for improved optimization during training.
Results
Empirical evaluations demonstrate that preconditioned attention outperforms standard attention mechanisms across a range of tasks, including image classification and language modeling, leading to more efficient training and better overall performance.
Implications
The introduction of preconditioned attention has the potential to enhance the training efficiency of Transformers in various applications, making it a valuable tool for researchers and practitioners in machine learning. Its versatility allows it to be integrated into existing models without significant modifications, promoting broader adoption in the field.
Kernel Dynamics under Path Entropy Maximization
Theory
- The kernel function is treated as a dynamical variable, influencing the optimization landscape of inference.
- Fixed points of the dynamics correspond to self-consistent kernels that reinforce their own distinction structures.
- Kernel change incurs a thermodynamic cost, establishing a link between information theory and thermodynamics.
- The framework connects various domains, including biology, learning, and craft mastery, through structured correspondences.
Read more
Kernel Dynamics under Path Entropy Maximization
Summary
This paper introduces a variational framework that treats the kernel function as a dynamical variable subject to path entropy maximization, known as Maximum Caliber (MaxCal). The kernel function is viewed as the foundational object that encodes the distinctions an agent can represent, influencing the geometry of probability space. The author formulates fixed-point conditions for self-consistent kernels and proposes the renormalization group (RG) flow as a structured special case. The evolution of the neural tangent kernel (NTK) during deep network training is suggested as an empirical instantiation of this framework. The paper establishes that the work required for kernel change is thermodynamically bounded, linking mutual information to the cost of conceptual change. The framework is positioned within the context of assembly theory and the MaxCal literature, distinguishing formal results from conjectural interpretations, and poses six open questions for future exploration.
Methodology
The paper employs a variational approach to maximize path entropy over trajectories in kernel space, deriving fixed-point conditions and stability criteria. It integrates concepts from information geometry, thermodynamics, and the renormalization group to analyze kernel dynamics.
Results
The study establishes a framework for understanding how kernels evolve over time, identifying self-consistent kernels as stable fixed points. It also provides a thermodynamic interpretation of kernel changes, linking them to mutual information and the work required for conceptual shifts.
Implications
This framework has potential applications in understanding the dynamics of learning systems, biological evolution, and the development of scientific paradigms. It may also inform the design of adaptive algorithms in machine learning and cognitive systems.
Capturing Multivariate Dependencies of EV Charging Events: From Parametric Copulas to Neural Density Estimation
Time Series
- Introduces Vine copulas and CODINE for modeling EV charging events.
- Demonstrates superior performance in capturing multivariate dependencies compared to traditional methods.
- Evaluates models on diverse real-world datasets, enhancing generalizability.
- Preserves tail behaviors and correlation structures effectively.
Read more
Capturing Multivariate Dependencies of EV Charging Events: From Parametric Copulas to Neural Density Estimation
Summary
This paper addresses the challenge of accurately modeling electric vehicle (EV) charging events, which is crucial for grid reliability and smart charging design. Traditional statistical methods often fail to capture the complex, non-linear dependencies among variables such as arrival times, durations, and energy demand. The authors introduce a novel approach that combines Vine copulas and a Copula Density Neural Estimation framework (CODINE) to model these dependencies in the EV domain. The study evaluates these models across three diverse datasets from Slovakia, Norway, and Scotland, demonstrating that the proposed methods outperform traditional parametric models and remain competitive with advanced benchmarks like conditional Gaussian Mixture Model Networks. The results indicate that Vine copulas and CODINE effectively preserve tail behaviors and correlation structures, providing a robust framework for synthetic charging event generation across various infrastructure contexts.
Methodology
The authors employed Vine copulas and the CODINE framework to model the joint dependencies of EV charging events. They validated their approach using three datasets, focusing on variables such as arrival time, charging duration, and energy consumed. The models were trained on a chronological split of the data, ensuring robust evaluation.
Results
The study found that the Vine copulas and CODINE framework outperformed established parametric models and benchmarks in preserving the joint dependence structure of EV charging events. The models demonstrated enhanced performance in capturing tail behaviors and correlation structures, making them suitable for synthetic event generation.
Implications
The findings suggest that using advanced copula-based methods can significantly improve the modeling of EV charging events, which is essential for effective grid management and the development of smart charging strategies. This could lead to better load management and reliability in power systems as EV adoption increases.
Monodense Deep Neural Model for Determining Item Price Elasticity
Optimization
Theory
Time Series
- Proposes a novel framework for estimating item price elasticity using large-scale transactional data.
- Introduces the Monodense deep neural network, which combines various neural network layers for improved performance.
- Demonstrates the ability to model price elasticity without requiring control/treatment groups, making it scalable for millions of items.
- Shows superior performance of the proposed model compared to traditional econometric and machine learning methods.
Read more
Monodense Deep Neural Model for Determining Item Price Elasticity
Summary
This paper addresses the challenge of estimating item price elasticity, which quantifies consumer demand responsiveness to price changes. Traditional econometric models often fail to capture complex demand patterns and require costly control/treatment experiments, making them impractical for large-scale retailers. The authors propose a novel Monodense Deep Neural Model (DLM) that leverages large-scale transactional data to estimate price elasticity without the need for treatment groups. The framework incorporates a hybrid neural network architecture that combines embedding, dense, and Monodense layers, allowing for the modeling of intricate relationships between price and demand. The proposed method is evaluated on multi-category retail data, demonstrating its scalability and effectiveness compared to traditional econometric and machine learning approaches. The results indicate that the Monodense DLM outperforms existing methods in estimating price elasticity, providing valuable insights for businesses to optimize pricing strategies and enhance revenue management.
Methodology
The authors developed a framework that creates a rich feature set from aggregated monthly transaction data, including price, inventory, competitor pricing, and other contextual signals. This data is then fed into the Monodense deep neural network to predict demand based on price changes, allowing for the estimation of price elasticity without the need for control groups.
Results
The experimental results indicate that the Monodense DLM significantly outperforms traditional econometric models and other machine learning methods in estimating price elasticity. The model effectively captures the non-linear relationships between price and demand, providing accurate elasticity estimates across a diverse range of products.
Implications
The findings suggest that retailers can utilize the proposed framework to optimize pricing strategies, enhance revenue management, and better understand consumer behavior in response to price changes. This approach can be particularly beneficial in competitive markets where accurate elasticity insights are crucial for maximizing profitability.
Concept frustration: Aligning human concepts and machine representations
Interpretability
- Introduces the concept of 'concept frustration' to describe contradictions in known concepts due to unobserved concepts.
- Develops task-aligned similarity measures for detecting concept frustration in machine learning models.
- Demonstrates that frustration can degrade both performance and interpretability in concept-based models.
- Provides a closed-form expression for Bayes-optimal classifier accuracy, highlighting the impact of frustration.
Read more
Concept frustration: Aligning human concepts and machine representations
Summary
This paper addresses the challenge of aligning human-interpretable concepts with the internal representations of machine learning systems, particularly in high-stakes domains like medicine and criminal justice. The authors introduce a geometric framework to compare supervised human concepts with unsupervised representations from foundation model embeddings. They define 'concept frustration' as a phenomenon where unobserved concepts create contradictions among known concepts, indicating an incomplete ontology. The authors propose task-aligned similarity measures to detect this frustration, demonstrating that it can be identified in task-aligned geometry, unlike conventional Euclidean methods. They derive a closed-form expression for the accuracy of concept-based classifiers under a linear-Gaussian generative model, breaking down predictive signals into known and unknown contributions. Experiments on synthetic and real data reveal that incorporating frustrating concepts into interpretable models can reorganize concept representations, enhancing alignment between human and machine reasoning. This work provides a framework for diagnosing incomplete ontologies and improving the interpretability of AI systems.
Methodology
The authors developed a geometric framework for comparing supervised and unsupervised concept representations, focusing on task-aligned similarity measures. They derived a closed-form expression for classifier accuracy under a linear-Gaussian model and conducted experiments on synthetic and real-world language and vision tasks to validate their findings.
Results
The experiments confirmed that concept frustration is detectable in foundation model representations. Incorporating frustrating concepts into models reorganized the geometry of learned representations, leading to better alignment between human and machine reasoning. The proposed framework effectively diagnosed incomplete concept ontologies.
Implications
This research has significant implications for the development of interpretable AI systems, particularly in high-risk applications where understanding model reasoning is crucial. It provides a systematic approach to identifying and resolving conceptual inconsistencies, enhancing the reliability and transparency of AI systems.
PRISM: PRIor from corpus Statistics for topic Modeling
NLP
- PRISM enhances LDA by using corpus-intrinsic statistics for initialization.
- The method improves topic coherence and interpretability without relying on external knowledge.
- Empirical results show PRISM's effectiveness across diverse datasets, including text and biological data.
- The approach is particularly beneficial in domains with limited external resources.
Read more
PRISM: PRIor from corpus Statistics for topic Modeling
Summary
The paper introduces PRISM, a novel method for topic modeling that enhances the Latent Dirichlet Allocation (LDA) framework by deriving a Dirichlet parameter from word co-occurrence statistics within the corpus. Unlike traditional methods that rely on external knowledge, PRISM focuses on corpus-intrinsic initialization, making it particularly useful in emerging or underexplored domains where external resources may be limited. The authors demonstrate that PRISM improves topic coherence and interpretability across various datasets, including text corpora and single-cell RNA sequencing data. The results indicate that PRISM can rival or exceed the performance of models that depend on external knowledge, highlighting the importance of corpus-driven approaches in resource-constrained settings. The methodology involves constructing a second-order word-similarity graph, obtaining diffusion-map embeddings, and clustering them to estimate a data-driven topic-word Dirichlet prior for LDA, while maintaining the original generative process of LDA.
Methodology
PRISM constructs a second-order word-similarity graph from the corpus, applies diffusion-map embeddings, and clusters these embeddings to derive a topic-word Dirichlet prior for LDA. This process allows for a data-driven initialization that enhances the model's performance while preserving the original generative structure of LDA.
Results
Experiments conducted on five text corpora and a single-cell RNA sequencing dataset demonstrate that PRISM significantly enhances topic coherence and interpretability, often matching or surpassing the performance of models that utilize external knowledge sources.
Implications
The findings suggest that corpus-driven initialization methods like PRISM can be crucial for effective topic modeling in fields where external knowledge is scarce or unreliable. This has potential applications in various domains, including social sciences, biology, and any emerging fields where data-driven insights are essential.
Stochastic Dimension Implicit Functional Projections for Exact Integral Conservation in High-Dimensional PINNs
Theory
Optimization
Efficient ML
- Introduces the Stochastic Dimension Implicit Functional Projection (SDIFP) framework for enforcing exact conservation laws in PINNs.
- Bypasses the need for deterministic quadrature and spatial grid dependencies, enhancing scalability in high dimensions.
- Implements a doubly-stochastic unbiased gradient estimator (DS-UGE) to reduce memory complexity during optimization.
- Maintains O(1) point-wise inference efficiency while ensuring mathematical regularity of solutions.
Read more
Stochastic Dimension Implicit Functional Projections for Exact Integral Conservation in High-Dimensional PINNs
Summary
This paper addresses the challenges of enforcing exact macroscopic conservation laws in high-dimensional physics-informed neural networks (PINNs), particularly focusing on total mass and energy conservation. Traditional methods for enforcing these laws often rely on explicit discrete projections that require deterministic quadrature over uniform grids, which do not scale well with increasing spatial dimensions and compromise the mesh-free nature of PINNs. The proposed Stochastic Dimension Implicit Functional Projection (SDIFP) framework offers a novel approach by applying a global affine transformation directly to the continuous network output, allowing for closed-form algebraic solutions to non-convex integral constraints through detached Monte Carlo quadrature. This method effectively mitigates the issues associated with spatial grid dependencies and high-order differential operator evaluations, which can lead to significant memory overhead. Additionally, the introduction of a doubly-stochastic unbiased gradient estimator (DS-UGE) decouples the spatial mini-batch sampling from the stochastic subsampling of differential operator terms, reducing the optimization memory complexity. The SDIFP framework not only preserves the mathematical regularity of the approximated solutions but also maintains efficient point-wise inference, providing a scalable and mesh-free solution for exactly conservative high-dimensional PDEs.
Methodology
The SDIFP framework applies a global affine transformation to the continuous output of the neural network, allowing for closed-form solutions to integral constraints using detached Monte Carlo quadrature. It employs a doubly-stochastic unbiased gradient estimator to optimize memory usage during backpropagation, decoupling spatial mini-batch sampling from differential operator subsampling.
Results
The SDIFP framework successfully mitigates overfitting to mini-batch sampling variance, preserves the regularity of approximated solutions, and maintains efficient inference. It demonstrates improved scalability and performance in solving high-dimensional PDEs while ensuring adherence to conservation laws.
Implications
The proposed framework has significant implications for computational physics and engineering, enabling more accurate and stable simulations of complex systems governed by partial differential equations. It can be applied in various fields requiring high-dimensional modeling and simulation, such as fluid dynamics, material science, and climate modeling.
Label-efficient Training Updates for Malware Detection over Time
Efficient ML
- Proposes a model-agnostic framework combining active learning and semi-supervised learning for malware detection.
- Demonstrates a reduction in manual labeling costs by up to 90% while maintaining detection performance.
- Introduces a feature-level drift analysis methodology to understand feature stability and its impact on performance.
- Evaluates a comprehensive set of AL and SSL techniques across multiple platforms (Android and Windows).
Read more
Label-efficient Training Updates for Malware Detection over Time
Summary
This paper addresses the challenges of malware detection using machine learning (ML) in the context of distribution drift, where both benign and malicious software evolve over time. Traditional ML models, when trained under static conditions, suffer performance degradation as they encounter new data distributions. The authors propose a model-agnostic framework that integrates active learning (AL) and semi-supervised learning (SSL) techniques to reduce the costs associated with manual labeling while maintaining detection performance. The study systematically evaluates eight AL and two SSL techniques across Android and Windows malware datasets, revealing that their combination can reduce manual annotation costs by up to 90% while achieving performance comparable to full-labeling retraining. Additionally, the authors introduce a feature-level drift analysis methodology that assesses feature stability over time, correlating it with detector performance. This work provides valuable insights into the behavior of AL and SSL under distribution drift, offering practical guidance for developing effective malware detectors that can adapt over time.
Methodology
The authors conducted a systematic evaluation of eight active learning (AL) techniques and two semi-supervised learning (SSL) techniques, both in isolation and in combination, to assess their effectiveness in adapting malware detectors to distribution drift. They also developed a feature-level drift analysis to measure the stability of features over time and its correlation with detection performance.
Results
The combination of AL and SSL techniques achieved comparable detection performance to full retraining while significantly reducing labeling costs by up to 90%. The feature-level drift analysis provided insights into how feature stability affects the performance of malware detectors over time.
Implications
This research has significant implications for the development of cost-effective and adaptive malware detection systems. By reducing the reliance on manual labeling, it enables more efficient updates to detection models in real-world scenarios, ultimately enhancing cybersecurity measures against evolving malware threats.
Tucker Attention: A generalization of approximate attention mechanisms
NLP
Large Language Models
Efficient ML
- Tucker Attention generalizes existing approximate attention mechanisms, providing a more interpretable framework.
- It significantly reduces the number of parameters required for self-attention while maintaining performance.
- The method encompasses existing techniques like GQA and MLA, offering insights into their low-rank structures.
- Tucker Attention is compatible with advanced attention techniques such as flash-attention and RoPE.
Read more
Tucker Attention: A generalization of approximate attention mechanisms
Summary
The paper introduces Tucker Attention, a novel approach that generalizes existing approximate attention mechanisms to reduce the memory footprint of self-attention in multi-headed attention (MHA) architectures. The authors critique current methods like Grouped-Query Attention (GQA) and Multi-Head Latent Attention (MLA) for their unconventional low-rank approximations and propose a more interpretable framework for understanding attention weight representations. Tucker Attention employs a low-rank Tucker decomposition of the attention weights, allowing for a significant reduction in the number of parameters while maintaining competitive performance metrics. The method is shown to encompass existing approaches (GQA, MLA, MHA) as special cases and is compatible with advanced techniques like flash-attention and rotary position embeddings (RoPE). The authors provide insights into the actual ranks achieved by these attention mechanisms and suggest simplifications for MLA, ultimately demonstrating that Tucker Attention can achieve comparable validation metrics with an order of magnitude fewer parameters in large language models (LLMs) and vision transformers (ViTs).
Methodology
The authors propose a generalized view of attention weights by analyzing pre-softmax and post-softmax attention weights as standalone tensor objects. They utilize a low-rank Tucker decomposition to represent these tensors, allowing for parameter-efficient attention mechanisms. The methodology includes theoretical insights into the ranks of existing attention mechanisms and their approximations.
Results
Tucker Attention demonstrates an order of magnitude reduction in parameters compared to GQA and MLA while achieving comparable validation metrics in large language models and vision transformers. The analysis reveals that MHA is the most efficient factorization under the assumption of full rank, while the proposed Tucker Attention effectively captures low-rank structures.
Implications
The findings suggest that Tucker Attention could lead to more efficient transformer architectures, reducing memory costs during training and inference. This has potential applications in deploying large models in resource-constrained environments, enhancing the scalability of transformer-based systems.
A Tight Expressivity Hierarchy for GNN-Based Entity Resolution in Master Data Management
Graph Learning
Theory
Efficient ML
- Introduces a separation theory for entity resolution using MPNNs on typed entity-attribute graphs.
- Establishes necessary and sufficient conditions for various entity resolution tasks.
- Demonstrates a complexity gap between detecting single and multiple shared attributes.
- Proposes a minimal-architecture principle for selecting MPNN adaptations based on task requirements.
Read more
A Tight Expressivity Hierarchy for GNN-Based Entity Resolution in Master Data Management
Summary
This paper addresses the challenge of entity resolution, which involves identifying database records that refer to the same real-world entity, by modeling it on bipartite graphs that connect entity nodes to their attribute values. The author critiques the common approach of using message-passing neural networks (MPNNs) with all architectural extensions, which can lead to unnecessary computational overhead. The central question posed is about identifying the most efficient MPNN architecture for specific matching criteria. The author develops a four-theorem separation theory on typed entity-attribute graphs, introducing co-reference predicates that capture the pattern of shared attribute values between entities. The paper establishes both lower and upper bounds for these predicates, demonstrating that certain adaptations, such as ego IDs, are necessary for tasks like cycle detection and verifying multiple shared attributes. The findings reveal a significant complexity gap between detecting any shared attribute and multiple shared attributes, with the latter requiring more complex architectures. The results provide a minimal-architecture principle, allowing practitioners to select the most efficient adaptation set for their specific needs, supported by computational validation of the theoretical predictions.
Methodology
The author formulates a separation theory consisting of four theorems that analyze typed entity-attribute graphs. This involves defining co-reference predicates and cycle participation predicates, proving both necessity and sufficiency results for various adaptations of MPNNs. The methodology includes constructing graph pairs to demonstrate indistinguishability and developing explicit MPNNs of minimal depth for accurate computation.
Results
The study finds that detecting any shared attribute can be achieved with a simpler architecture, while detecting multiple shared attributes requires more complex adaptations, specifically ego IDs and deeper MPNNs. The results are summarized in a table that outlines the minimal architecture required for different predicates and graph classes, confirming that no simpler architecture suffices for the specified tasks.
Implications
The findings have significant implications for practitioners in data integration and master data management, as they provide guidelines for selecting the most efficient MPNN architectures tailored to specific entity resolution tasks, potentially improving computational efficiency and accuracy in large-scale databases.
ERPO: Token-Level Entropy-Regulated Policy Optimization for Large Reasoning Models
NLP
Large Language Models
Reinforcement Learning
- ERPO improves reasoning in LLMs by focusing on token-level dynamics instead of sequence-level advantages.
- Critical Decision Pivots (CDPs) are identified as crucial points where the model's reasoning is most sensitive to perturbations.
- The methodology includes Entropy-aware Gating, Bucket-based Implicit Normalization, and Result-anchored Advantage Synthesis.
- Extensive experiments show ERPO significantly enhances reasoning accuracy and produces more concise derivation paths compared to GRPO.
Read more
ERPO: Token-Level Entropy-Regulated Policy Optimization for Large Reasoning Models
Summary
The paper introduces Entropy-Regulated Policy Optimization (ERPO), a novel approach to enhance the reasoning capabilities of large language models (LLMs) through reinforcement learning from verifiable rewards (RLVR). Traditional methods like Group Relative Policy Optimization (GRPO) apply uniform advantages across sequences, which neglects the varying importance of tokens during reasoning processes. This oversight can lead to premature entropy collapse and the generation of redundant reasoning paths. The authors identify Critical Decision Pivots (CDPs) as key moments in reasoning where the model's trajectory is sensitive to changes, necessitating diverse exploration. ERPO addresses these challenges by focusing on token-level dynamics rather than sequence-level advantages. It incorporates three main components: Entropy-aware Gating to enhance exploration at CDPs, Bucket-based Implicit Normalization to reduce bias, and Result-anchored Advantage Synthesis to re-weight token signals based on outcomes. Experiments on mathematical benchmarks demonstrate that ERPO significantly outperforms GRPO, improving both reasoning accuracy and the quality of derivation paths, thus setting a new standard for efficiency and accuracy in large reasoning models.
Methodology
The authors conducted a systematic empirical analysis to identify Critical Decision Pivots (CDPs) and their impact on reasoning accuracy. They proposed ERPO, which integrates three components to refine credit assignment in reasoning tasks, focusing on token-level signals and enhancing exploration at critical moments.
Results
ERPO was tested on competitive mathematical benchmarks, showing significant improvements over GRPO in reasoning accuracy and the robustness of derivation paths. The model demonstrated a 35.2% drop in accuracy when high-entropy tokens were perturbed, highlighting their importance in the reasoning process.
Implications
The findings from this research could lead to more effective training methods for large language models, particularly in complex reasoning tasks. By improving the understanding of token dynamics and their impact on model performance, ERPO may enhance applications in areas requiring high-level reasoning, such as automated theorem proving and advanced problem-solving.
Derived Fields Preserve Fine-Scale Detail in Budgeted Neural Simulators
Optimization
Theory
Efficient ML
- Introduces Derived-Field Optimization (DerivOpt) for state design in neural simulators.
- Demonstrates that primitive and derived fields retain detail differently under fixed storage budgets.
- Shows that fine-scale fidelity can be significantly improved by optimizing the choice of carried fields.
- Empirical results indicate that carried-state design is a critical factor in neural simulation performance.
Read more
Derived Fields Preserve Fine-Scale Detail in Budgeted Neural Simulators
Summary
This paper addresses the challenge of preserving fine-scale details in neural simulations of time-dependent partial differential equations (PDEs) under fixed storage budgets. The authors introduce Derived-Field Optimization (DerivOpt), a novel framework that optimally selects which physical fields to carry and how to allocate storage across them. The study reveals that primitive and derived fields exhibit different levels of detail retention when subjected to the same operator, indicating that the choice of carried state is crucial for maintaining fine-scale fidelity. Through empirical evaluation across various PDE benchmarks, DerivOpt demonstrates significant improvements in both mean rollout normalized root mean square error (nRMSE) and fine-scale fidelity compared to existing methods. The findings suggest that carried-state design should be prioritized alongside architecture and training strategies in budget-constrained neural simulations.
Methodology
The authors analyze the retention of fine-scale details in primitive and derived fields using the periodic incompressible Navier-Stokes equations as a testbed. They develop the DerivOpt framework, which evaluates various combinations of physical fields and storage allocations through a closed-form design score. The framework is empirically validated across a comprehensive set of PDE benchmarks, including advection, Burgers, and diffusion problems.
Results
DerivOpt significantly outperforms traditional primitive-only methods, yielding the best pooled mean rollout nRMSE and enhancing fine-scale fidelity across various PDE scenarios. The improvements are evident even before the rollout learning phase begins, highlighting the importance of the carried state in the simulation process.
Implications
The findings imply that optimizing the carried state in neural simulations can lead to better performance in applications requiring high fidelity in fine-scale structures, such as fluid dynamics and other engineering simulations. This approach could influence future designs of neural simulators and their training methodologies.
Efficient and Scalable Granular-ball Graph Coarsening Method for Large-scale Graph Node Classification
Graph Learning
Efficient ML
- Introduces a multi-granularity granular-ball graph coarsening algorithm.
- Achieves linear time complexity in the graph coarsening process.
- Enhances training efficiency and scalability of GCNs for large-scale datasets.
- Demonstrates superior performance in node classification tasks compared to existing methods.
Read more
Efficient and Scalable Granular-ball Graph Coarsening Method for Large-scale Graph Node Classification
Summary
This paper presents a novel framework called the Efficient and Scalable Granular-ball Graph Coarsening Method (GB-CGNN) aimed at improving the efficiency and scalability of Graph Convolutional Networks (GCNs) for large-scale graph node classification tasks. Traditional GCNs face significant computational challenges when dealing with large datasets, particularly due to the exponential growth of neighborhood expansion with increasing convolutional layers. While existing methods such as sampling and graph coarsening have been developed to mitigate these issues, they often overlook multi-granularity information and suffer from high time complexity. The proposed method introduces a multi-granularity granular-ball graph coarsening algorithm that coarsens the original graph into subgraphs with linear time complexity, significantly reducing the scale of the graph. These subgraphs are then randomly sampled to create minibatches for GCN training. Experimental results demonstrate that the GB-CGNN framework outperforms existing methods in terms of training efficiency and classification accuracy across multiple datasets, showcasing its potential for large-scale applications.
Methodology
The methodology involves a two-stage process: first, the use of the METIS algorithm to generate initial granular-balls during the coarse-grained initialization stage, followed by an adaptive splitting mechanism to refine these granular-balls into high-quality supernodes. The GCN is then trained using stochastic gradient descent (SGD) on the coarsened graph structure.
Results
The experimental results indicate that the proposed GB-CGNN method significantly improves the training efficiency and classification accuracy of GCNs on large-scale graph datasets, outperforming existing sampling and coarsening techniques.
Implications
The findings suggest that the GB-CGNN framework can be effectively applied to various large-scale graph-related tasks, including social network analysis, recommendation systems, and other domains requiring efficient graph processing.
HISA: Efficient Hierarchical Indexing for Fine-Grained Sparse Attention
Large Language Models
Efficient ML
- HISA provides a two-stage hierarchical indexing approach that significantly reduces indexing complexity.
- The method achieves 2-4× speedup in kernel-level benchmarks without compromising selection quality.
- HISA is a drop-in replacement for existing indexers, requiring no retraining or architectural modifications.
- Empirical results show that HISA closely matches the performance of traditional sparse attention mechanisms.
Read more
HISA: Efficient Hierarchical Indexing for Fine-Grained Sparse Attention
Summary
The paper introduces HISA (Hierarchical Indexed Sparse Attention), a novel indexing method designed to optimize the token-level sparse attention mechanism used in large language models (LLMs). Traditional sparse attention approaches, such as DeepSeek Sparse Attention (DSA), face a significant bottleneck due to the quadratic complexity of indexing, which becomes prohibitive as context lengths increase. HISA addresses this issue by implementing a two-stage hierarchical search process. The first stage involves a block-level coarse filter that scores pooled block representatives to eliminate irrelevant regions, while the second stage refines the selection by scoring tokens only within the surviving candidate blocks. This method maintains the original token-level top-k selection pattern without requiring additional training or architectural changes. The authors demonstrate that HISA achieves substantial speed improvements (2× at 32K context and 4× at 128K) while preserving selection quality, as evidenced by a mean Intersection over Union (IoU) greater than 99% compared to the original DSA. The paper validates HISA's effectiveness through kernel-level benchmarks and real-world tasks, showcasing its potential to enhance the efficiency of LLMs operating over long contexts.
Methodology
HISA employs a two-stage hierarchical indexing strategy: first, it uses a block-level coarse filter to score pooled block representatives and prune irrelevant tokens; second, it refines the selection by scoring tokens only within the selected blocks using the original DSA scoring mechanism. This reduces the overall indexing complexity from O(L) to O(L/B + mB), where L is the prefix length, B is the block size, and m is the number of retained blocks.
Results
HISA demonstrates a 2× speedup at 32K context lengths and a 4× speedup at 128K context lengths in kernel-level benchmarks. Additionally, it maintains a mean IoU greater than 99% with the original DSA, indicating that the token selection quality is preserved. The method was validated on Needle-in-a-Haystack and LongBench benchmarks, showing comparable performance to traditional sparse attention methods.
Implications
The development of HISA has significant implications for the efficiency of large language models, particularly as they scale to handle longer context lengths. By optimizing the indexing process, HISA can reduce computational costs and improve response times, making it suitable for applications requiring real-time processing of extensive textual data.
Reducing Oracle Feedback with Vision-Language Embeddings for Preference-Based RL
Reinforcement Learning
Robotics
Multimodal
- ROVED combines vision-language embeddings with targeted oracle feedback for efficient preference-based reinforcement learning.
- The framework reduces oracle queries by up to 80% while maintaining performance comparable to oracle-only methods.
- A parameter-efficient fine-tuning method enhances the quality of VLE-generated preferences.
- The adapted VLE demonstrates strong cross-task generalization, yielding up to 90% cumulative annotation savings.
Read more
Reducing Oracle Feedback with Vision-Language Embeddings for Preference-Based RL
Summary
This paper introduces ROVED, a novel framework aimed at enhancing preference-based reinforcement learning (PbRL) by integrating lightweight vision-language embeddings (VLE) with selective oracle feedback. The authors highlight the challenges of obtaining high-quality oracle feedback, which is often costly and time-consuming, thus limiting the scalability of PbRL methods. ROVED addresses this by using VLEs to generate segment-level preferences while relying on oracle feedback only for uncertain cases, identified through a filtering mechanism. This hybrid approach not only reduces the dependency on extensive oracle feedback but also improves the VLE's performance over time through a parameter-efficient fine-tuning method. The framework was evaluated on robotic manipulation tasks, demonstrating that ROVED can match or exceed the performance of traditional oracle-only methods while significantly reducing the number of required oracle queries by up to 80%. Additionally, the fine-tuned VLE showed strong generalization capabilities across tasks, leading to cumulative annotation savings of up to 90%. Overall, ROVED presents a promising direction for efficient and practical PbRL by combining the scalability of VLEs with the precision of selective oracle supervision.
Methodology
ROVED employs a hybrid strategy where vision-language embeddings generate segment-level preferences, and oracle feedback is solicited only for uncertain cases. It incorporates a parameter-efficient fine-tuning scheme that combines unsupervised dynamics-aware objectives with sparse oracle feedback, alongside a confidence-aware training mechanism to minimize unnecessary oracle queries.
Results
The evaluation of ROVED on robotic manipulation tasks from Meta-World showed that it achieves oracle-level performance while reducing annotation costs by 50-80%. The fine-tuned VLE exhibited generalization across related tasks, resulting in cumulative annotation savings of 75-90%.
Implications
The findings suggest that combining scalable vision-language embeddings with selective oracle feedback can significantly enhance the efficiency and practicality of preference-based reinforcement learning, making it more accessible for real-world applications in robotics and beyond.
Deep Learning-Based Anomaly Detection in Spacecraft Telemetry on Edge Devices
Time Series
Optimization
Efficient ML
- Three deep learning approaches for anomaly detection in spacecraft telemetry were evaluated.
- The forecasting & threshold method outperformed other approaches with a CEF0.5 of 92.7%.
- Neural architecture optimization significantly reduced model size and computational requirements.
- Optimized models can operate within the stringent constraints of space-grade hardware.
Read more
Deep Learning-Based Anomaly Detection in Spacecraft Telemetry on Edge Devices
Summary
This paper addresses the critical need for anomaly detection in spacecraft telemetry, which is essential for mission safety. The authors explore three deep learning approaches: forecasting & threshold, direct classification, and image classification, optimizing them for deployment on edge devices with limited computational resources. Using the European Space Agency Anomaly Dataset, the study demonstrates that the forecasting & threshold method achieves the highest detection performance with a Corrected Event-wise F0.5-score (CEF0.5) of 92.7%. Through multi-objective neural architecture optimization, the authors significantly reduce the computational requirements of the models while maintaining high accuracy. The optimized forecasting & threshold model retains 88.8% of its original detection performance while decreasing RAM usage by 97.1% to just 59 KB and operations by 99.4%. The findings indicate that on-board anomaly detection is feasible even on highly constrained hardware, enabling near-instantaneous detection and response to critical events, thus enhancing mission safety and operational efficiency.
Methodology
The authors employed three distinct deep learning approaches for anomaly detection: forecasting & threshold, direct classification, and image classification. They utilized multi-objective neural architecture optimization to systematically reduce model size and computational requirements while preserving detection accuracy, specifically targeting deployment on resource-constrained edge devices.
Results
The forecasting & threshold approach achieved a CEF0.5 score of 92.7%, outperforming the other methods. The optimized model maintained 88.8% of its original detection performance while reducing RAM usage by 97.1% to 59 KB and operations by 99.4%. The optimized models require only 0.36-6.25% of CubeSat RAM, making them viable for on-board deployment.
Implications
The research demonstrates that sophisticated anomaly detection systems can be effectively deployed on spacecraft with limited computational resources, allowing for real-time monitoring and response to anomalies. This capability is crucial for enhancing mission safety and operational efficiency in space missions.
Improving Efficiency of GPU Kernel Optimization Agents using a Domain-Specific Language and Speed-of-Light Guidance
Optimization
Efficient ML
- Introduction of µCUTLASS, a compact DSL for GPU kernel optimization.
- Implementation of Speed-of-Light (SOL) guidance to improve optimization efficiency.
- Demonstrated significant speedups in kernel performance over traditional methods.
- Reduction in token costs while maintaining high performance.
Read more
Improving Efficiency of GPU Kernel Optimization Agents using a Domain-Specific Language and Speed-of-Light Guidance
Summary
This paper addresses the inefficiencies in optimizing GPU kernels using Large Language Model (LLM) agents by proposing two key design principles: a compact domain-specific language (DSL) and Speed-of-Light (SOL) guidance. The authors observe that the abstraction level at which agents operate significantly affects their performance. If the abstraction is too low, agents waste time on trivial details; if too high, they may overlook critical optimization opportunities. To tackle these issues, the authors introduce µCUTLASS, a DSL that allows LLMs to reason at a higher level while maintaining essential optimization levers. Additionally, SOL guidance provides performance bounds to help agents prioritize optimization efforts effectively. The implementation of these principles leads to significant improvements in kernel optimization efficiency, as demonstrated through experiments on 59 KernelBench problems, achieving speedups of up to 1.68× compared to traditional methods. The proposed approach not only enhances the performance of weaker models but also reduces the token cost of optimization, making it a promising solution for the evolving landscape of GPU kernel development.
Methodology
The authors developed a domain-specific language (µCUTLASS) that abstracts GPU kernel optimization tasks, allowing LLMs to focus on high-level decisions. They also implemented SOL guidance to provide performance estimates and steer the optimization process, thereby improving the efficiency of the search for optimal kernel configurations.
Results
The experiments showed that switching from low-level code generation to DSL code using GPT-5-mini resulted in a 1.27× speedup over PyTorch, which increased to 1.56× with SOL guidance. The approach allowed weaker models to outperform stronger baseline agents while saving 19–43% of tokens, achieving a maximum efficiency gain of 1.68×.
Implications
The findings suggest that using a DSL and performance-guided optimization can significantly enhance the efficiency of GPU kernel development, making it more accessible for developers and potentially accelerating the adoption of advanced GPU features in machine learning applications.
From Astronomy to Astrology: Testing the Illusion of Zodiac-Based Personality Prediction with Machine Learning
Theory
- Astrology lacks a credible causal mechanism and has not demonstrated predictive validity.
- A synthetic dataset was created to test zodiac-based personality predictions using machine learning.
- Machine learning classifiers showed performance indistinguishable from random chance.
- The success of astrology is attributed to cognitive biases and the overlap of common personality traits.
Read more
From Astronomy to Astrology: Testing the Illusion of Zodiac-Based Personality Prediction with Machine Learning
Summary
This paper investigates the validity of zodiac-based personality predictions using machine learning techniques. Despite astrology's cultural significance and its influence on social decisions, the authors argue that it lacks a scientifically valid predictive foundation. They create a synthetic dataset where individuals are assigned zodiac signs and personality traits from a pool of 100 common descriptors. The study employs Logistic Regression, Random Forest, and neural network classifiers to analyze whether zodiac signs can reliably predict personality traits. The findings reveal that the predictive performance of these models is no better than random chance, suggesting that the perceived success of astrology is due to cognitive biases and the universality of personality traits rather than any actual predictive capability. The paper concludes that zodiac systems serve more as narrative frameworks than as reliable predictors of human behavior, highlighting the psychological and social needs they fulfill despite their lack of empirical support.
Methodology
The authors constructed a synthetic dataset assigning zodiac signs to individuals along with personality traits drawn from a pool of 100 descriptors. They trained various classifiers (Logistic Regression, Random Forest, neural networks) to predict personality traits based on zodiac features and nuisance covariates, comparing results to shuffled-label controls.
Results
The classifiers consistently performed at or near random expectation, indicating that zodiac signs do not encode any predictive signal regarding personality traits. Shuffled-label controls yielded similar accuracies, reinforcing the conclusion that astrology's perceived effectiveness is illusory.
Implications
The findings suggest that zodiac-based personality assessments do not provide reliable insights into human behavior, emphasizing the need for critical evaluation of such systems in cultural contexts. This could influence how astrology is perceived and utilized in social decision-making.
ATLAS-RTC: Closing the Loop on LLM Agent Output with Token-Level Runtime Control
Large Language Models
NLP
Generative Models
- ATLAS-RTC intercepts token generation at the logit level, allowing for real-time corrections.
- The system employs a graduated intervention policy to ensure structured outputs without modifying model weights.
- Significant improvements in JSON schema satisfaction (from 56.7% to 76.7%) and tool call reliability (from 28.3% to 58.3%) were achieved.
- The paper provides an honest characterization of failure modes and conditions under which ATLAS-RTC may degrade.
Read more
ATLAS-RTC: Closing the Loop on LLM Agent Output with Token-Level Runtime Control
Summary
The paper introduces ATLAS-RTC, a novel token-level runtime controller designed to enhance the output quality of large language model (LLM) agents by addressing issues that arise during the generation of structured outputs, such as malformed tool calls and JSON schemas. Traditional methods of output governance, including prompt engineering and post-hoc validation, fail to intercept errors during generation, leading to increased latency and costs due to retry loops. ATLAS-RTC operates at the logit distribution level, allowing it to intervene in real-time during the token generation process. It employs a graduated intervention strategy that includes logit biasing, temperature modulation, token masking, and mid-step rollbacks, all while maintaining the integrity of the model's weights. The system is evaluated in two scenarios: one focusing on JSON schema satisfaction under ambiguous prompts and another on the reliability of agent tool calls. Results show significant improvements in both areas, demonstrating ATLAS-RTC's effectiveness in ensuring well-formed outputs and reducing the likelihood of failures. The paper also discusses the limitations of the system, particularly in handling complex schemas and the added latency in certain scenarios, providing a comprehensive view of where runtime control is beneficial.
Methodology
The methodology involves developing a token-level runtime controller that operates at the logit distribution during the generation process. It utilizes a formal output contract specification and a graduated intervention ladder to manage structural drift and ensure compliance with output contracts. The system is evaluated through empirical testing in two specific scenarios to assess its performance and reliability.
Results
ATLAS-RTC improved first-attempt JSON schema satisfaction from 56.7% to 76.7%, with a peak improvement of 40 percentage points on the hardest schema task. Additionally, it enhanced agent tool call reliability from 28.3% to 58.3%, with one tool recovering from 0% to 90% success rate. The paper also identifies specific failure modes and conditions that affect performance.
Implications
The implications of this work extend to the deployment of LLMs in production environments, where ensuring the correctness of structured outputs is critical. ATLAS-RTC can significantly reduce the costs associated with retries and failures, making LLMs more reliable for applications requiring structured data generation, such as API interactions and automated reporting.
Key-Embedded Privacy for Decentralized AI in Biomedical Omics
Federated Learning
- Introduction of INFL, a lightweight federated learning method for biomedical applications.
- Integration of a secret key into the model architecture to enhance privacy.
- Demonstrated effectiveness across diverse biomedical omics tasks.
- Maintains model utility while ensuring strong privacy controls.
Read more
Key-Embedded Privacy for Decentralized AI in Biomedical Omics
Summary
This paper addresses the pressing need for privacy-preserving methods in decentralized AI applications within biomedical omics, where data privacy concerns hinder the sharing of sensitive data necessary for effective AI model training. The authors propose a novel lightweight federated learning method named INFL (Implicit Neural Representations Federated Learning), which integrates a secret key directly into the model architecture to enhance privacy while allowing for seamless aggregation of model updates across heterogeneous data sources. This approach is designed to overcome the limitations of traditional cryptographic methods and differential privacy, which often incur significant overhead or degrade model performance. The authors validate INFL across various biomedical tasks, including classification in bulk proteomics, regression in single-cell transcriptomics, and clustering in multi-omics datasets. The results indicate that INFL not only maintains strong privacy controls but also achieves high utility, making it suitable for real-world biomedical applications.
Methodology
The authors developed INFL, which employs Implicit Neural Representations and incorporates plug-and-play, coordinate-conditioned modules into client models. This architecture embeds a secret key, facilitating secure model updates without transferring raw data, thus preserving data sovereignty and compliance with regulations.
Results
INFL was tested on various biomedical omics tasks, showing strong performance in cohort-scale classification, regression for perturbation prediction, and clustering. The method successfully balanced privacy and utility, outperforming traditional privacy-preserving techniques in terms of model accuracy and efficiency.
Implications
The proposed method has significant implications for advancing decentralized AI in biomedical research, enabling cross-institutional collaborations without compromising data privacy. This could lead to more robust and representative AI models in clinical settings, ultimately enhancing patient care and research outcomes.
The Spectral Edge Thesis: A Mathematical Framework for Intra-Signal Phase Transitions in Neural Network Training
Theory
- The Spectral Edge Thesis provides a new mathematical framework for understanding phase transitions in neural network training.
- Empirical evidence shows that gap dynamics in the Gram matrix are crucial for predicting grokking events.
- The framework is architecture-agnostic and relies on NTK eigenvalues and Hessian curvatures.
- The study confirms 19 out of 20 predictions, highlighting the robustness of the proposed framework.
Read more
The Spectral Edge Thesis: A Mathematical Framework for Intra-Signal Phase Transitions in Neural Network Training
Summary
This paper introduces the Spectral Edge Thesis, a mathematical framework that elucidates the dynamics of phase transitions during neural network training. The author empirically demonstrates that gap dynamics in the rolling-window Gram matrix are precursors to grokking events, with a notable distinction between scenarios with and without weight decay. The framework is validated across various model families, confirming 19 out of 20 quantitative predictions. The study reveals that the number of simultaneously active modes is small and varies with the optimizer used. The framework is architecture-agnostic, relying on NTK eigenvalues and Hessian curvatures to characterize phase transitions and feature circuit formation. Theoretical results derived from three axioms include the characterization of the gap position, dynamics of gap changes, and a coupled system of ordinary differential equations governing signal strengths. The findings align with existing theories in the field, suggesting that the spectral gap of the Gram matrix is a critical object for understanding neural network behavior during training.
Methodology
The author develops a mathematical framework based on three axioms, deriving key theoretical results related to gap dynamics and phase transitions. Empirical validation is conducted across various neural network models, analyzing the rolling-window Gram matrix and its spectral properties.
Results
The study confirms that gap dynamics precede grokking events in all tested scenarios with weight decay, while no such events were observed without it. The number of active modes during training is found to be small and optimizer-dependent. The framework's predictions align closely with empirical observations, reinforcing its validity.
Implications
The findings suggest that understanding the spectral gap of the Gram matrix can lead to better insights into neural network training dynamics, potentially guiding the design of more effective training strategies and architectures. This could enhance feature learning and improve model performance in various applications.
Causality-inspired Federated Learning for Dynamic Spatio-Temporal Graphs
Graph Learning
Federated Learning
Time Series
- SC-FSGL addresses the challenges of representation entanglement and negative transfer in Federated Learning for dynamic graphs.
- The framework introduces a Conditional Separation Module to decouple transferable causal knowledge from client-specific noise.
- A Causal Codebook is utilized to promote knowledge sharing and consistency across clients through contrastive learning.
- Experiments show significant performance improvements over existing state-of-the-art methods on various datasets.
Read more
Causality-inspired Federated Learning for Dynamic Spatio-Temporal Graphs
Summary
The paper presents a novel framework, SC-FSGL, for Federated Learning over Dynamic Spatio-Temporal Graphs (FSTGs), addressing the limitations of existing methods that primarily focus on static graphs. The authors argue that traditional Federated Graph Learning (FGL) approaches fail to account for the spatial and temporal heterogeneity present in real-world data, leading to issues such as representation entanglement and negative transfer. To overcome these challenges, SC-FSGL introduces a Conditional Separation Module that utilizes client-conditioned masks to disentangle transferable causal knowledge from client-specific noise. Additionally, a Causal Codebook is proposed to cluster causal prototypes and align local representations through contrastive learning, enhancing cross-client consistency. Experimental results on five diverse datasets demonstrate that SC-FSGL significantly outperforms state-of-the-art methods, showcasing its effectiveness in improving generalization performance in federated settings.
Methodology
The proposed SC-FSGL framework employs a Conditional Separation Module that simulates soft interventions using client-conditioned masks to extract invariant spatio-temporal causal factors. It also incorporates a Causal Codebook for clustering causal prototypes and aligning local representations via contrastive learning, facilitating knowledge sharing among clients with heterogeneous data.
Results
The experimental evaluation on five diverse Spatio-Temporal Graph datasets indicates that SC-FSGL outperforms existing methods, demonstrating enhanced generalization and reduced negative transfer in federated learning scenarios.
Implications
The findings suggest that incorporating causal reasoning into Federated Learning can significantly improve model performance in dynamic environments, making SC-FSGL applicable in various fields such as traffic forecasting, sensor networks, and mobility analytics where data privacy is a concern.
Mind the Gap: A Framework for Assessing Pitfalls in Multimodal Active Learning
Multimodal
- Identification of three critical pitfalls in multimodal active learning: missing modalities, modality imbalance, and varying interaction structures.
- Development of a benchmarking framework using synthetic datasets to isolate and analyze the effects of these pitfalls.
- Empirical comparison of unimodal and multimodal query strategies, demonstrating that existing methods do not adequately address the identified challenges.
- Findings indicate that models often rely on a single modality, leading to imbalanced representations.
Read more
Mind the Gap: A Framework for Assessing Pitfalls in Multimodal Active Learning
Summary
This paper addresses the challenges of active learning (AL) in multimodal settings, where models integrate information from diverse sources such as images, text, and tabular data. The authors identify three key pitfalls that complicate multimodal AL: missing modalities, modality imbalance, and varying interaction structures among modalities. To systematically evaluate these challenges, they introduce a benchmarking framework that utilizes synthetic datasets designed to isolate each pitfall, allowing for a clearer analysis of their effects on AL performance. The framework is validated with real-world datasets, revealing that existing unimodal and multimodal query strategies often fail to address these pitfalls effectively. The study finds that models tend to develop imbalanced representations, relying heavily on one modality while neglecting others, and that multimodal strategies do not consistently outperform unimodal ones. The authors emphasize the need for modality-aware query strategies to improve performance in multimodal AL settings.
Methodology
The authors developed a benchmarking framework that includes synthetic datasets designed to isolate the identified pitfalls in multimodal active learning. They conducted extensive empirical comparisons of unimodal and multimodal query strategies, both on synthetic and real-world datasets, to evaluate the performance of these strategies under the influence of the identified pitfalls.
Results
The results indicate that all identified pitfalls significantly affect learning behavior, with models predominantly relying on one modality while largely ignoring others. Existing query methods were found to be ineffective in mitigating this issue, and multimodal query strategies did not consistently outperform unimodal strategies.
Implications
The findings highlight the limitations of current active learning methods in multimodal contexts and suggest a pressing need for the development of new, modality-aware query strategies. This research could have significant implications for fields that rely on multimodal data, such as biomedical applications, where effective labeling is crucial yet challenging.
Learning to Select Visual In-Context Demonstrations
Multimodal
Reinforcement Learning
Computer Vision
- Introduction of LSD, a framework that reformulates demonstration selection as a sequential decision-making problem.
- Utilization of a Dueling DQN agent to learn optimal demonstration sets that maximize MLLM performance.
- Identification of a task-dependent dichotomy in visual ICL, highlighting the effectiveness of kNN for subjective tasks and LSD for objective tasks.
- Comprehensive evaluation across five visual regression benchmarks demonstrating the superiority of the proposed method.
Read more
Learning to Select Visual In-Context Demonstrations
Summary
This paper addresses the challenge of selecting high-quality visual demonstrations for Multimodal Large Language Models (MLLMs) in the context of in-context learning (ICL). The authors identify that the traditional k-Nearest Neighbor (kNN) approach for demonstration selection is often suboptimal, particularly for complex factual regression tasks, as it tends to select redundant examples that do not adequately cover the task's output range. To overcome this limitation, the authors propose a novel framework called Learning to Select Demonstrations (LSD), which reframes demonstration selection as a sequential decision-making problem. They employ a Reinforcement Learning (RL) agent, specifically a Dueling DQN with a query-centric Transformer Decoder, to learn a policy that maximizes MLLM performance by balancing visual relevance and diversity in the selected demonstrations. The study evaluates LSD across five visual regression benchmarks and reveals a critical dichotomy: while kNN is effective for subjective tasks, LSD significantly outperforms it on objective tasks, demonstrating the necessity of learned selection strategies for optimal performance in visual ICL.
Methodology
The authors developed the LSD framework, which employs a Dueling DQN agent to sequentially select demonstrations for visual regression tasks. The agent uses a query-centric Transformer Decoder to learn a policy that optimally balances visual relevance and diversity, moving beyond traditional similarity-based selection methods.
Results
The evaluation of LSD across five benchmarks (UTKFace, AVA, SCUT-FBP5500, KonIQ-10k, and KADID-10k) showed that LSD significantly outperforms kNN on objective factual regression tasks, while kNN remains effective for subjective preference tasks. This highlights the necessity of learned selection strategies in achieving state-of-the-art performance in visual ICL.
Implications
The findings suggest that for effective visual ICL, especially in objective tasks, employing learned selection strategies can enhance model performance. This has potential applications in various domains requiring accurate visual regression and could influence future research on demonstration selection in multimodal learning contexts.
On the Mirage of Long-Range Dependency, with an Application to Integer Multiplication
Theory
- Long-range dependency in integer multiplication is a mirage, not an intrinsic property.
- Representing integers in a 2D grid allows multiplication to be performed with local operations.
- A neural cellular automaton with minimal parameters can achieve high generalization in multiplication tasks.
- Existing architectures like Transformers struggle with multiplication due to their reliance on 1D representations.
Read more
On the Mirage of Long-Range Dependency, with an Application to Integer Multiplication
Summary
This paper challenges the prevailing notion that integer multiplication is inherently difficult for neural networks due to long-range dependencies (LRD) caused by carry chains. The author argues that this difficulty is a mirage, resulting from the choice of computational spacetime rather than an intrinsic property of multiplication itself. By formalizing the concepts of computational spacetime and mirage, the paper demonstrates that when two n-bit binary integers are represented as a 2D outer-product grid, the long multiplication process can be reduced to local operations within a 3x3 neighborhood. A neural cellular automaton with only 321 parameters is shown to achieve perfect length generalization up to 683 times the training range, while five alternative architectures, including Transformers, fail under the same representation. The findings suggest that tasks diagnosed as requiring long-range dependency should be re-evaluated to determine if such dependencies are intrinsic or merely artifacts of the chosen computational framework.
Methodology
The author introduces the concepts of computational spacetime and mirage, providing a constructive proof through a 2D representation of integer multiplication. The performance of a neural cellular automaton is compared against traditional architectures like Transformers to demonstrate the effectiveness of the new representation.
Results
The neural cellular automaton achieved perfect accuracy in multiplication tasks with only 321 parameters and demonstrated length generalization up to 683 times the training range. In contrast, five alternative architectures, including Transformers, failed to perform effectively under the same representation.
Implications
The findings suggest that the design of neural network architectures should consider the representation of data and the computational framework to avoid misdiagnosing the nature of dependencies in tasks. This could lead to more effective models for problems traditionally viewed as requiring long-range dependencies.
ARCS: Autoregressive Circuit Synthesis with Topology-Aware Graph Attention and Spec Conditioning
Reinforcement Learning
Generative Models
Graph Learning
- ARCS generates complete analog circuit designs in milliseconds, significantly faster than traditional methods.
- Achieves 99.9% simulation validity with only 8 SPICE evaluations, a substantial reduction compared to genetic algorithms.
- Introduces Group Relative Policy Optimization (GRPO) to improve reinforcement learning for multi-topology circuit design.
- Utilizes grammar-constrained decoding to ensure 100% structural validity of generated circuits.
Read more
ARCS: Autoregressive Circuit Synthesis with Topology-Aware Graph Attention and Spec Conditioning
Summary
The paper introduces ARCS, a novel system for the rapid generation of analog circuits that produces complete, SPICE-simulatable designs in milliseconds, significantly faster than traditional search-based methods. ARCS employs a hybrid pipeline that integrates two learned generators—a graph Variational Autoencoder (VAE) and a flow-matching model—alongside SPICE-based ranking to achieve a remarkable 99.9% simulation validity with only 8 SPICE evaluations, which is 40 times fewer than genetic algorithms. The system utilizes a topology-aware Graph Transformer for single-model inference, achieving 85% simulation validity in just 97 milliseconds, which is over 600 times faster than random search. A key innovation is the adaptation of Group Relative Policy Optimization (GRPO) to address the limitations of the REINFORCE algorithm in multi-topology reinforcement learning, enhancing simulation validity by 9.6 percentage points with significantly fewer training steps. Additionally, grammar-constrained decoding ensures structural validity of the generated circuits. While ARCS does not yet match the quality of search-based optimization, its speed facilitates rapid prototyping and design-space exploration, demonstrating the potential for complementary use with genetic algorithms.
Methodology
ARCS employs a hybrid approach combining a graph VAE and a flow-matching model, enhanced by SPICE-based ranking. It utilizes a topology-aware Graph Transformer for inference and adapts GRPO for reinforcement learning to improve circuit design across multiple topologies. Grammar-constrained decoding is implemented to ensure structural validity.
Results
ARCS achieves 99.9% simulation validity with only 8 SPICE evaluations and 85% simulation validity in 97 ms for single-model inference. The adaptation of GRPO leads to a 9.6 percentage point improvement in simulation validity over REINFORCE with 10 times fewer training steps. Although ARCS's per-design quality is lower than search-based methods, it recovers 96.6% of genetic algorithm quality with 49% fewer simulations.
Implications
The rapid generation capability of ARCS can significantly streamline the analog circuit design process, enabling faster prototyping and exploration of design spaces. Its integration with traditional optimization methods may enhance overall design quality and efficiency.
MR-ImagenTime: Multi-Resolution Time Series Generation through Dual Image Representations
Time Series
Generative Models
- MR-CDM effectively handles variable-length time series without fixed input windows.
- The framework incorporates multi-scale trend decomposition to model temporal patterns at different resolutions.
- Experiments show MR-CDM significantly outperforms existing state-of-the-art models in forecasting accuracy.
- The proposed method enhances the robustness of forecasts across heterogeneous sequence lengths and temporal scales.
Read more
MR-ImagenTime: Multi-Resolution Time Series Generation through Dual Image Representations
Summary
The paper introduces MR-CDM, a novel framework for time series forecasting that addresses the limitations of existing models in handling fixed-length inputs and multi-scale temporal patterns. Traditional forecasting methods struggle with variable-length time series and often fail to capture complex dynamics due to their reliance on fixed input windows. MR-CDM combines hierarchical multi-resolution trend decomposition, an adaptive embedding mechanism for variable-length inputs, and a multi-scale conditional diffusion process. This innovative approach allows the model to effectively capture trends, seasonality, and noise across different temporal resolutions. The authors validate the effectiveness of MR-CDM through experiments on four real-world datasets, demonstrating significant improvements over state-of-the-art baselines such as CSDI and Informer, with reductions in Mean Absolute Error (MAE) and Root Mean Square Error (RMSE) by approximately 6-10%. The framework retains the generative strengths of diffusion models while enhancing robustness across heterogeneous sequence lengths and temporal scales, making it a promising solution for time series forecasting in various domains.
Methodology
The MR-CDM framework employs a five-stage pipeline that includes hierarchical multi-resolution trend decomposition, an adaptive embedding mechanism for variable-length sequences, and a conditionally guided diffusion process. This combination allows for the effective modeling of complex temporal dependencies and multi-scale dynamics in time series data.
Results
The experimental results indicate that MR-CDM outperforms traditional and diffusion-based forecasting models, achieving significant reductions in MAE and RMSE by approximately 6-10% across four real-world datasets, demonstrating its effectiveness in generating accurate time series forecasts.
Implications
The MR-CDM framework has potential applications in various fields requiring time series forecasting, such as finance, healthcare, and transportation. Its ability to handle variable-length inputs and capture multi-scale dynamics makes it suitable for real-world scenarios where data is often heterogeneous and complex.
Quality-Controlled Active Learning via Gaussian Processes for Robust Structure-Property Learning in Autonomous Microscopy
Optimization
Efficient ML
Robotics
- Introduces ActiveQC, a gated active learning framework that prioritizes high-quality data acquisition.
- Combines curiosity-driven sampling with physics-informed quality control to mitigate the effects of noisy data.
- Demonstrates superior performance over traditional active learning methods in structure-property learning tasks.
- Successfully applied in real-time autonomous microscopy experiments, validating its practical utility.
Read more
Quality-Controlled Active Learning via Gaussian Processes for Robust Structure-Property Learning in Autonomous Microscopy
Summary
This paper addresses the challenges of low-quality, noisy data in autonomous experimental systems used for materials research, particularly in structure-property learning tasks like Image-to-Spectrum (Im2Spec) and Spectrum-to-Image (Spec2Im) translations. The authors propose a novel gated active learning framework, termed ActiveQC, which integrates curiosity-driven sampling with a physics-informed quality control filter based on Simple Harmonic Oscillator model fits. This framework enables the automatic exclusion of low-fidelity data during data acquisition, enhancing the reliability of the learning process. Evaluations on a dataset of band-excitation piezoresponse spectroscopy (BEPS) data from PbTiO3 thin films demonstrate that ActiveQC outperforms traditional methods such as random sampling, standard active learning, and multitask learning strategies. The gated approach effectively mitigates the impact of noise during training and acquisition, leading to improved predictions in both forward and inverse tasks. Furthermore, the framework was successfully deployed in real-time experiments on BiFeO3 thin films, showcasing its practical applicability in autonomous microscopy. Overall, this work promotes a shift towards hybrid autonomy in self-driving labs, emphasizing the importance of integrating physics-informed quality assessment with active decision-making for more reliable scientific discovery.
Methodology
The proposed methodology involves a gated active learning framework that utilizes a physics-informed quality control filter based on Simple Harmonic Oscillator model fits. This approach allows the system to exclude low-fidelity data during acquisition by integrating curiosity-driven sampling techniques, which guide the exploration towards high-quality samples while avoiding misleading noisy data.
Results
The ActiveQC framework significantly outperformed random sampling, standard active learning, and multitask learning strategies in evaluations on BEPS data from PbTiO3 thin films. It effectively handled noise during training and acquisition, leading to more reliable predictions in both Im2Spec and Spec2Im tasks. The framework's deployment in real-time experiments on BiFeO3 thin films further confirmed its effectiveness in practical applications.
Implications
The findings of this research suggest that integrating physics-informed quality assessment with active learning can enhance the robustness and efficiency of data acquisition in autonomous experimental systems. This approach has the potential to accelerate scientific discovery in materials research by enabling more reliable structure-property learning, ultimately contributing to the development of advanced materials with desired properties.
Optimistic Actor-Critic with Parametric Policies for Linear Markov Decision Processes
Reinforcement Learning
Theory
Optimization
- Introduces an optimistic actor-critic framework for linear MDPs with parametric policies.
- Utilizes logit-matching regression for the actor and Langevin Monte Carlo for the critic.
- Achieves state-of-the-art sample complexity in both on-policy and off-policy settings.
- Demonstrates practical applicability through experiments in linear MDPs and Atari environments.
Read more
Optimistic Actor-Critic with Parametric Policies for Linear Markov Decision Processes
Summary
This paper addresses the limitations of existing actor-critic methods in reinforcement learning, particularly in the context of linear Markov Decision Processes (MDPs). The authors propose an optimistic actor-critic framework that utilizes parametric log-linear policies, which are more computationally efficient than traditional methods that rely on natural policy gradients (NPG) and implicit policies. The proposed framework includes a tractable logit-matching regression objective for the actor and employs approximate Thompson sampling via Langevin Monte Carlo for the critic to derive optimistic value estimates. The authors demonstrate that their algorithm achieves state-of-the-art sample complexity of eO(ϵ−4) in the on-policy setting and eO(ϵ−2) in the off-policy setting, aligning theoretical performance with practical applicability. The paper also includes experimental validation of the proposed methods in both linear MDPs and more complex environments such as Atari games, showcasing the effectiveness of the approach in real-world scenarios.
Methodology
The authors develop an optimistic actor-critic framework that employs parametric log-linear policies. The actor uses a logit-matching regression objective, while the critic estimates values using approximate Thompson sampling via Langevin Monte Carlo. The paper provides a theoretical analysis of the sample complexity for both on-policy and off-policy settings.
Results
The proposed algorithm achieves sample complexities of eO(ϵ−4) for on-policy and eO(ϵ−2) for off-policy settings, which are competitive with existing methods. Experimental results validate the effectiveness of the approach in both linear MDPs and Atari environments, demonstrating its practical utility.
Implications
The findings suggest that the optimistic actor-critic framework can enhance the efficiency and effectiveness of reinforcement learning algorithms, particularly in environments where exploration is critical. This could lead to improved performance in various applications, including robotics and game playing.
Hybrid Quantum-Classical Spatiotemporal Forecasting for 3D Cloud Fields
Time Series
- Introduction of QENO, a hybrid quantum-classical framework for 3D cloud forecasting.
- Utilization of a topology-aware quantum enhancement block to model nonlocal interactions.
- Development of a dynamic fusion temporal unit that integrates quantum features with classical memory.
- Demonstrated superior performance over existing forecasting models in terms of accuracy and structural fidelity.
Read more
Hybrid Quantum-Classical Spatiotemporal Forecasting for 3D Cloud Fields
Summary
This paper presents QENO, a novel hybrid quantum-inspired spatiotemporal forecasting framework designed to improve the accuracy of three-dimensional (3D) cloud field predictions. Traditional forecasting methods struggle with the complex dynamics of cloud evolution due to their reliance on locality-biased representations, which fail to capture fine cloud structures. The proposed QENO framework integrates a classical spatiotemporal encoder for latent representation with a topology-aware quantum enhancement block that models nonlocal interactions. Additionally, it employs a dynamic fusion temporal unit that combines quantum-derived features with recurrent memory to better capture multiscale cloud dynamics. Experimental evaluations on CMA-MESO 3D cloud fields demonstrate that QENO significantly outperforms existing models such as ConvLSTM and PredRNN++ in various metrics, including mean squared error (MSE) and structural similarity index (SSIM). The results suggest that incorporating quantum-inspired techniques can enhance the modeling of complex atmospheric phenomena, paving the way for improved weather prediction and atmospheric analysis.
Methodology
The QENO framework consists of four main components: a classical spatiotemporal encoder for generating compact latent representations, a quantum enhancement block that captures nonlocal couplings in the latent space, a dynamic fusion temporal unit that integrates quantum features with recurrent memory, and a decoder for reconstructing future cloud volumes. The quantum block employs topology-aware entanglement patterns to enhance feature extraction.
Results
QENO achieved an MSE of 0.2038, an RMSE of 0.4514, and an SSIM of 0.6291 on the CMA-MESO 3D cloud fields dataset, outperforming several baseline models including ConvLSTM, PredRNN++, and Earthformer across multiple evaluation metrics.
Implications
The findings suggest that hybrid quantum-classical approaches can effectively address the challenges of 3D cloud forecasting, potentially leading to more accurate weather predictions and better understanding of atmospheric processes. This could have significant implications for meteorology, climate science, and remote sensing applications.
Symbolic Density Estimation: A Decompositional Approach
Theory
Interpretability
- Introduction of AI-Kolmogorov for Symbolic Density Estimation (SymDE).
- Multi-stage pipeline includes decomposition, nonparametric estimation, support estimation, and symbolic regression.
- Demonstrated efficacy on synthetic and exotic distributions, including applications in high-energy physics.
- Addresses challenges in ensuring valid probability distributions and discovering complex symbolic expressions.
Read more
Symbolic Density Estimation: A Decompositional Approach
Summary
The paper introduces AI-Kolmogorov, a novel framework for Symbolic Density Estimation (SymDE), which aims to bridge the gap between nonparametric and parametric density estimation methods. Symbolic regression (SR) has been successful in supervised learning but has not been extensively applied to density estimation tasks. The proposed multi-stage pipeline consists of problem decomposition through clustering or probabilistic graphical model structure learning, followed by nonparametric density estimation, support estimation, and finally, symbolic regression on the density estimate. The authors demonstrate the effectiveness of AI-Kolmogorov on various datasets, including synthetic mixture models and multivariate normal distributions, as well as distributions relevant to high-energy physics. The framework is designed to discover underlying distributions and provide insights into the mathematical expressions that describe them, addressing challenges such as ensuring validity constraints on probability distributions and the curse of dimensionality.
Methodology
The methodology involves a multi-stage pipeline: (1) problem decomposition using clustering or probabilistic graphical models, (2) nonparametric density estimation, (3) support estimation, and (4) applying symbolic regression to the density estimate using tools like PySR. The approach emphasizes the search for general symbolic functional forms while ensuring constraints of non-negativity and normalization are met.
Results
The experiments conducted show that AI-Kolmogorov effectively discovers underlying distributions and provides interpretable mathematical expressions for various datasets. The framework outperforms traditional methods by avoiding restrictive model assumptions and offering insights into the structure of the data.
Implications
The implications of this work extend to fields requiring interpretable models for density estimation, such as high-energy physics and other scientific domains where understanding the underlying distribution is crucial. The framework could enhance model interpretability in complex data-driven tasks.
Physics-Guided Transformer (PGT): Physics-Aware Attention Mechanism for PINNs
Theory
Efficient ML
Optimization
- Introduction of Physics-Guided Transformer (PGT) that integrates physical structure into self-attention.
- PGT achieves significant improvements in reconstruction accuracy and stability compared to traditional PINNs and other methods.
- Utilizes a heat-kernel-derived additive bias to enforce physical consistency in attention mechanisms.
- Demonstrates effective performance on both diffusion-dominated and convection-dominated systems.
Read more
Physics-Guided Transformer (PGT): Physics-Aware Attention Mechanism for PINNs
Summary
The paper introduces the Physics-Guided Transformer (PGT), a novel neural architecture designed to reconstruct continuous physical fields from sparse observations, particularly for nonlinear systems governed by partial differential equations (PDEs). Traditional physics-informed approaches often struggle with issues like gradient imbalance and instability when data is limited. PGT addresses these challenges by embedding physical structures directly into its self-attention mechanism, using a heat-kernel-derived additive bias to ensure that the model respects diffusion physics and temporal causality. The architecture features a FiLM-modulated sinusoidal implicit decoder that adapts its spectral response based on the inferred global context. The authors evaluate PGT on two benchmark systems: the one-dimensional heat equation and the two-dimensional incompressible Navier–Stokes equations. Results show that PGT significantly outperforms existing methods, achieving a relative L2 error of 5.9×10−3 in 1D reconstruction with only 100 observations, and a governing-equation residual of 8.3 × 10−4 in the 2D cylinder-wake problem. The findings indicate that integrating physical priors at the representational level enhances optimization stability and physical coherence, making PGT a promising approach for reliable reconstruction of nonlinear dynamical systems governed by PDEs.
Methodology
The PGT architecture incorporates a physics-guided attention mechanism that embeds an additive bias derived from the heat-kernel Green’s function into the self-attention process. This allows the model to respect the causal and diffusive structure of PDEs. The architecture also features a FiLM-modulated SIREN decoder that adjusts its frequency response based on the learned context, enabling accurate reconstruction of high-frequency details. A composite uncertainty-weighted loss function is employed, combining various sources of supervision.
Results
In experiments, PGT achieved a relative L2 error of 5.9×10−3 in one-dimensional heat equation reconstruction with only 100 observations, outperforming PINNs by a factor of 38 and sinusoidal implicit representations by over 90-fold. For the two-dimensional cylinder-wake problem, PGT achieved a governing-equation residual of 8.3 × 10−4 and a relative L2 error of 0.034, demonstrating superior performance compared to all baseline methods.
Implications
The integration of physical inductive biases into the attention mechanism of PGT could lead to more reliable and efficient reconstruction of complex physical systems, making it applicable in fields such as climate modeling, fluid dynamics, and material science. This approach may also inspire further research into physics-informed machine learning models that leverage domain knowledge for improved performance.
From Physics to Surrogate Intelligence: A Unified Electro-Thermo-Optimization Framework for TSV Networks
Optimization
Graph Learning
- Introduces a unified framework for electro-thermal modeling and optimization of TSV networks.
- Combines physics-informed analytical modeling with GNN surrogates for efficient design-space exploration.
- Achieves significant reduction in computational time for TSV configuration evaluations.
- Demonstrates strong validation results against traditional FEM methods.
Read more
From Physics to Surrogate Intelligence: A Unified Electro-Thermo-Optimization Framework for TSV Networks
Summary
This paper addresses the challenges posed by high-density through-substrate vias (TSVs) in 2.5D/3D heterogeneous integration, particularly focusing on signal integrity and thermal reliability issues. The authors propose a scalable electro-thermal modeling and optimization framework that integrates physics-informed analytical modeling, graph neural network (GNN) surrogates, and full-wave finite-element method (FEM) validation. The framework allows for efficient exploration of large design spaces, overcoming the computational limitations of traditional FEM simulations. A multi-conductor analytical model computes broadband S-parameters and effective thermal conductivities of TSV arrays, achieving a relative Frobenius error (RFE) of 5%-10% across various array sizes. The GNN surrogate, trained on analytical data and fine-tuned with HFSS simulations, generalizes well to larger arrays, maintaining an RFE below 2%. This enables rapid multi-objective Pareto optimization of TSV configurations, significantly reducing evaluation time from hours to minutes. The final designs are validated against HFSS and Mechanical simulations, demonstrating strong agreement. Overall, the proposed framework facilitates rapid electro-thermal co-design of TSV arrays, enhancing the efficiency of 3D IC design automation.
Methodology
The methodology involves a physics-informed analytical model for computing S-parameters and thermal conductivities, complemented by a GNN surrogate trained on analytical data and fine-tuned with HFSS simulations. This combination allows for rapid exploration of TSV configurations using a multi-objective Pareto optimization framework.
Results
The analytical model achieves a relative Frobenius error of 5%-10% for TSV arrays up to 15x15 in size, while the GNN surrogate maintains an RFE below 2% for larger arrays. The surrogate enables exploration of millions of TSV configurations in minutes, significantly reducing evaluation time from hours to minutes compared to traditional FEM methods.
Implications
The proposed framework has significant implications for the design of high-performance computing systems and heterogeneous integration technologies, allowing for rapid and efficient optimization of TSV networks, which is crucial for enhancing bandwidth, reducing latency, and improving thermal reliability in advanced IC designs.
Physics-Informed Framework for Impact Identification in Aerospace Composites
Theory
Interpretability
- Introduction of a physics-informed framework for impact identification in aerospace composites.
- Integration of physical knowledge with data-driven inference to enhance reliability.
- Demonstrated capability to infer impact parameters with high accuracy under challenging conditions.
- Stable performance even with reduced data and increased noise, indicating robustness.
Read more
Physics-Informed Framework for Impact Identification in Aerospace Composites
Summary
This paper presents a novel physics-informed impact identification (Phy-ID) framework aimed at enhancing the reliability of impact identification in aerospace composite structures, which are susceptible to internal damage from low-velocity impacts. The challenge lies in estimating impact energy from measured responses, which is complicated by sparse and noisy sensor data, nonlinear structural responses, and missing excitation parameters. The proposed Phy-ID framework integrates observational, inductive, and learning biases to merge physical knowledge with data-driven inference, resulting in a unified modeling strategy that ensures physically consistent and numerically stable impact identification. The framework employs physics-based energy indicators to structure the input space, constrains solutions through architectural design, and enforces governing relations via hybrid loss formulations. A disjoint inference formulation is utilized to demonstrate the framework's capabilities, allowing for the decoupled inference of impact velocity and impactor mass, while ensuring kinetic energy consistency for impact energy computation. Experimental evaluations indicate that the framework achieves mean absolute percentage errors below 8% for inferred impact velocity and impactor mass, and below 10% for impact energy, demonstrating stable performance even under reduced data availability and increased measurement noise. The results underscore the potential of the Phy-ID framework for practical monitoring systems in aerospace applications.
Methodology
The methodology involves a physics-informed machine learning approach that combines observational, inductive, and learning biases. It structures the input space using physics-based energy indicators, constrains solutions through architectural design, and enforces governing relations via hybrid loss formulations. The framework utilizes a disjoint inference formulation for decoupled modeling of impact parameters.
Results
The experimental evaluations showed that the Phy-ID framework achieved mean absolute percentage errors below 8% for inferred impact velocity and impactor mass, and below 10% for impact energy. The framework demonstrated stable performance under conditions of reduced data availability and increased measurement noise, as well as generalization capabilities for out-of-distribution cases.
Implications
The findings suggest that the systematic integration of physics-informed biases can lead to reliable and efficient impact identification in aerospace composites, which is crucial for effective structural health monitoring. This approach has the potential to enhance safety and operational decision-making in aerospace applications.
Match or Replay: Self Imitating Proximal Policy Optimization
Reinforcement Learning
Robotics
Efficient ML
- Introduction of Self-Imitating Proximal Policy Optimization (SIPP) for improved exploration and sample efficiency.
- MATCH strategy utilizes optimal transport to prioritize rewarding state-action transitions in dense reward settings.
- REPLAY strategy enhances learning in sparse reward environments by replaying successful trajectories.
- Empirical validation shows SIPP outperforms state-of-the-art self-imitating RL methods.
Read more
Match or Replay: Self Imitating Proximal Policy Optimization
Summary
This paper addresses the challenges of inefficient exploration in Reinforcement Learning (RL), particularly in environments with sparse rewards. The authors propose a novel self-imitating on-policy algorithm called Self-Imitating Proximal Policy Optimization (SIPP), which enhances exploration and sample efficiency by leveraging past high-reward state-action pairs. The method incorporates two strategies: MATCH for dense reward environments, which uses optimal transport to prioritize state-action transitions that align with rewarding trajectories, and REPLAY for sparse reward environments, which replays successful trajectories to reinforce learning. Experimental results demonstrate that SIPP significantly improves learning efficiency and success rates across various environments, including MuJoCo, Animal-AI Olympics, and multi-goal PointMaze, outperforming existing self-imitating RL baselines. The findings highlight the potential of self-imitation as a robust strategy for enhancing exploration in RL, applicable to more complex tasks.
Methodology
The authors developed SIPP, which integrates self-imitation into the Proximal Policy Optimization (PPO) framework. The MATCH strategy employs optimal transport to guide exploration in dense reward environments, while the REPLAY strategy maintains an imitation buffer to replay successful trajectories in sparse reward settings. This approach avoids reliance on replay buffers or off-policy corrections, preserving PPO's stability.
Results
SIPP achieved faster convergence and significantly higher success rates compared to existing self-imitating RL baselines across various environments, including dense and sparse reward scenarios. The experimental results indicate substantial improvements in learning efficiency, validating the effectiveness of both MATCH and REPLAY strategies.
Implications
The findings suggest that self-imitation can be a powerful tool for enhancing exploration in RL, potentially leading to more efficient learning in complex tasks. This approach may be applicable in various domains, including robotics, navigation, and other areas requiring effective decision-making under uncertainty.
Meteorology-Driven GPT4AP: A Multi-Task Forecasting LLM for Atmospheric Air Pollution in Data-Scarce Settings
Large Language Models
Time Series
Efficient ML
- GPT4AP is a parameter-efficient multi-task forecasting model for air pollution.
- The model utilizes a pre-trained GPT-2 backbone with adaptations to reduce trainable parameters.
- It demonstrates superior performance in few-shot and zero-shot learning scenarios compared to existing models.
- GPT4AP maintains competitive accuracy in long-term forecasting with full training data.
Read more
Meteorology-Driven GPT4AP: A Multi-Task Forecasting LLM for Atmospheric Air Pollution in Data-Scarce Settings
Summary
The paper introduces Meteorology-Driven GPT for Air Pollution (GPT4AP), a novel multi-task forecasting framework designed to address the challenges of air pollution prediction in data-scarce environments. Leveraging a pre-trained GPT-2 backbone, GPT4AP employs a parameter-efficient approach by freezing the self-attention and feed-forward layers while adapting lightweight positional and output modules through Gaussian rank-stabilized low-rank adaptation (rsLoRA). This design significantly reduces the number of trainable parameters, enhancing the model's efficiency. The performance of GPT4AP is evaluated across six real-world air quality monitoring datasets under various settings, including few-shot, zero-shot, and long-term forecasting. In few-shot scenarios, GPT4AP achieves an average mean squared error (MSE) of 0.686 and mean absolute error (MAE) of 0.442, outperforming existing models such as DLinear and ETSformer. In zero-shot cross-station transfer, it achieves an average MSE of 0.529 and MAE of 0.403, showcasing improved generalization capabilities. Even in long-term forecasting with full training data, GPT4AP remains competitive, achieving an average MAE of 0.429. These results highlight GPT4AP's robustness and efficiency in providing accurate air quality predictions with minimal labeled data, making it a valuable tool for environmental monitoring and policy planning.
Methodology
GPT4AP is built on a pre-trained GPT-2 architecture, where the self-attention and feed-forward layers are frozen to prevent overfitting. It employs Gaussian rank-stabilized low-rank adaptation (rsLoRA) for lightweight adaptation of positional embeddings and prediction heads, significantly reducing the number of trainable parameters while enhancing generalization across different forecasting conditions.
Results
In few-shot learning, GPT4AP achieves an average MSE of 0.686 and MAE of 0.442, outperforming DLinear and ETSformer. In zero-shot cross-station transfer, it achieves an average MSE of 0.529 and MAE of 0.403. In long-term forecasting with full training data, it achieves an average MAE of 0.429, remaining competitive with specialized time-series models.
Implications
GPT4AP advances the application of foundation models in environmental forecasting, enabling accurate predictions with minimal data. This model can facilitate timely environmental assessments, policy planning, and public health protection, particularly in regions with limited monitoring infrastructure.
Distributed Online Submodular Maximization under Communication Delays: A Simultaneous Decision-Making Approach
Optimization
Robotics
Theory
- Introduces the DOG algorithm for distributed online submodular maximization.
- Addresses communication delays that hinder existing sequential and one-hop coordination methods.
- Establishes a trade-off between coordination performance and convergence time based on network structure.
- Provides theoretical performance guarantees and approximation ratios for DOG.
Read more
Distributed Online Submodular Maximization under Communication Delays: A Simultaneous Decision-Making Approach
Summary
This paper presents a novel distributed online algorithm, named Distributed Online Greedy (DOG), for multi-agent submodular maximization in environments characterized by communication delays. The authors address the limitations of existing methods that either rely on sequential communication, leading to excessive delays, or restrict coordination to one-hop neighborhoods, which hampers performance. DOG integrates adversarial bandit learning techniques with delayed feedback, allowing agents to make simultaneous decisions across arbitrary network topologies. The paper provides theoretical guarantees on the approximation performance of DOG relative to an optimal solution, highlighting the trade-off between coordination performance and convergence time influenced by communication delays. This work is particularly relevant for future multi-agent systems engaged in tasks such as target tracking and environmental mapping, where agents must operate under unpredictable and partially observable conditions.
Methodology
The authors developed the DOG algorithm by combining principles from adversarial bandit learning with delayed feedback mechanisms. This allows agents to make decisions simultaneously rather than sequentially, facilitating coordination across diverse network topologies without the need for extensive communication.
Results
The analysis shows that DOG achieves a specific approximation ratio against optimal solutions, quantifying the suboptimality cost associated with decentralization. The results indicate that DOG can effectively balance coordination performance and convergence time, outperforming existing methods under communication constraints.
Implications
The findings suggest that DOG can significantly enhance the efficiency of multi-agent systems in real-world applications where communication delays are prevalent. This includes scenarios in robotics, environmental monitoring, and collaborative mapping, where timely and effective decision-making is critical.
InkDrop: Invisible Backdoor Attacks Against Dataset Condensation
Computer Vision
Efficient ML
Theory
- InkDrop enhances stealthiness in backdoor attacks against Dataset Condensation.
- The method utilizes model uncertainty near decision boundaries to create effective perturbations.
- InkDrop maintains model utility while embedding malicious behavior into condensed datasets.
- Extensive experiments validate the effectiveness and imperceptibility of the proposed attack.
Read more
InkDrop: Invisible Backdoor Attacks Against Dataset Condensation
Summary
The paper introduces InkDrop, a novel approach to executing backdoor attacks within the context of Dataset Condensation (DC). While DC is a promising method for synthesizing compact datasets that retain the performance of larger datasets, it is vulnerable to backdoor attacks where malicious patterns can be embedded into the condensed datasets. Existing methods often prioritize attack effectiveness and model utility but neglect the stealthiness of the attacks. InkDrop addresses this gap by leveraging the uncertainty near model decision boundaries to create imperceptible perturbations that induce targeted misclassifications. The method selects candidate samples that are close to the decision boundary and learns instance-dependent perturbations that maintain perceptual and spatial consistency. Extensive experiments demonstrate that InkDrop effectively integrates adversarial intent into condensed datasets while preserving model utility and minimizing detectability.
Methodology
InkDrop identifies candidate samples near the decision boundary that show latent affinity to the target class. It employs a learnable attack model to generate customized, low-perceptibility perturbations for each input, trained under multiple objectives including contrastive loss, Earth Mover’s Distance loss, L2 regularization, and perceptual loss.
Results
The experiments conducted across various datasets show that InkDrop successfully integrates backdoor triggers into condensed datasets with high effectiveness while maintaining low detectability and preserving model performance.
Implications
The findings suggest that future backdoor attack strategies must consider stealthiness alongside effectiveness, particularly in safety-critical applications. InkDrop's approach could inform the development of more secure dataset condensation methods and enhance the understanding of vulnerabilities in machine learning models.
Realistic Market Impact Modeling for Reinforcement Learning Trading Environments
Reinforcement Learning
Optimization
Time Series
- Introduction of a suite of Gymnasium-compatible trading environments with realistic market impact models.
- Significant differences in trading behavior and performance metrics when using the AC model compared to a fixed cost model.
- Hyperparameter optimization is crucial for improving out-of-sample performance and preventing pathological trading behaviors.
- The choice of algorithm interacts with the cost model in environment-specific ways, affecting overall trading outcomes.
Read more
Realistic Market Impact Modeling for Reinforcement Learning Trading Environments
Summary
This paper addresses the limitations of traditional reinforcement learning (RL) trading environments that often neglect realistic transaction costs, leading to suboptimal trading behaviors in real-world scenarios. The authors introduce a suite of three Gymnasium-compatible trading environments—MACE stock trading, margin trading, and portfolio optimization—that incorporate nonlinear market impact models based on the Almgren–Chriss framework and the square-root impact law. These environments feature a unified infrastructure with pluggable cost models, permanent impact tracking, and detailed trade-level logging. The authors evaluate five deep reinforcement learning (DRL) algorithms (A2C, PPO, DDPG, SAC, TD3) on the NASDAQ 100, comparing a fixed 10 basis points cost model against the more realistic AC model with optimized hyperparameters. The findings reveal that the choice of cost model significantly influences both the performance and trading behavior of the algorithms, highlighting the necessity of hyperparameter optimization to avoid pathological trading behaviors. The paper concludes by releasing the environment suite as an open-source extension to FinRL-Meta, facilitating further research in RL-based trading.
Methodology
The authors developed three trading environments that integrate nonlinear market impact models. They systematically compared five DRL algorithms under different cost models, utilizing Optuna for hyperparameter optimization. The environments were designed to log trade-level data and track permanent market impacts, allowing for detailed performance analysis.
Results
The study found that the cost model significantly affects both absolute performance and the relative ranking of algorithms. For instance, optimized PPO achieved a 20% return under the baseline model but dropped to 15% under the AC model, while TD3 improved from 15% to 18%. In portfolio optimization, TD3 under the AC model achieved the best performance at 32%, compared to 26% under the baseline. The AC model also led to reduced trading costs and turnover, demonstrating its impact on trading behavior.
Implications
The findings suggest that incorporating realistic market impact models in RL trading environments can lead to more reliable and effective trading strategies. This work can inform the development of more sophisticated trading algorithms that better reflect real-world market conditions, potentially improving the performance of automated trading systems.
Subspace Optimization for Backpropagation-Free Continual Test-Time Adaptation
Optimization
Efficient ML
- Introduction of PACE, a backpropagation-free TTA method optimizing normalization layers.
- Utilization of CMA-ES and Fastfood projections to enhance adaptation capabilities.
- Dynamic stopping criterion to minimize computational overhead during stable domains.
- Integration of a domain-specialized vector bank for rapid adaptation to recurring domains.
Read more
Subspace Optimization for Backpropagation-Free Continual Test-Time Adaptation
Summary
This paper presents PACE, a novel backpropagation-free continual test-time adaptation (TTA) system that optimizes the affine parameters of normalization layers directly. Existing derivative-free methods often struggle with balancing runtime efficiency and learning capacity, either limiting updates to input prompts or requiring continuous adaptation that is resource-intensive. PACE addresses these issues by employing the Covariance Matrix Adaptation Evolution Strategy (CMA-ES) combined with Fastfood projections to optimize high-dimensional parameters within a low-dimensional subspace. This approach enhances adaptive performance while significantly improving runtime efficiency through a dynamic stopping criterion and a domain-specialized vector bank, which reduces computational redundancy. The framework achieves state-of-the-art accuracy across various benchmarks under continual distribution shifts, demonstrating over a 50% reduction in runtime compared to existing backpropagation-free methods. The findings indicate that PACE not only outperforms current baselines but also provides a tunable mechanism to balance accuracy and runtime, making it particularly suitable for deployment in resource-constrained environments.
Methodology
PACE employs the Covariance Matrix Adaptation Evolution Strategy (CMA-ES) to optimize the affine parameters of normalization layers. It uses Fastfood projections to navigate a low-dimensional subspace, allowing for efficient adaptation without backpropagation. The system includes a dynamic stopping criterion based on the mean shift of the CMA-ES distribution to halt adaptation during stable phases, and maintains a vector bank for knowledge accumulation across domain shifts.
Results
The proposed method outperforms existing backpropagation-free approaches in both efficiency and accuracy, achieving a significant reduction in runtime (over 50%) while maintaining high performance across multiple benchmarks under continual distribution shifts.
Implications
The advancements presented in PACE have significant implications for deploying neural networks in resource-constrained environments, such as edge devices, where traditional backpropagation methods are not feasible. The ability to adapt efficiently to changing data distributions without heavy computational costs enhances the robustness and usability of machine learning models in real-world applications.
Refined Detection for Gumbel Watermarking
NLP
Large Language Models
Theory
- Introduces a refined detection mechanism for Gumbel watermarking.
- Proven to be nearly optimal among model-agnostic watermarking schemes.
- Detection can be performed without access to the model.
- Establishes upper and lower bounds on token requirements for detection.
Read more
Refined Detection for Gumbel Watermarking
Summary
This paper presents a refined detection mechanism for the Gumbel watermarking scheme introduced by Aaronson in 2022. The proposed detection method is shown to be nearly optimal among model-agnostic watermarking schemes, assuming that the next-token distribution is sampled independently and identically distributed (i.i.d.). The authors focus on watermarking the outputs of large language models, where the goal is to enable the model owner to distinguish between sequences generated by the model and those that are not. The detection process involves reconstructing noise random variables from the text and applying a statistical test to determine if the observed sequences were sampled i.i.d. from a uniform distribution. The new detection test is theoretically more efficient than the previous one proposed by Aaronson, and the authors establish upper and lower bounds on the number of tokens required for effective detection, which depend on an entropy-like quantity of the next-token distributions. This work contributes to the growing field of watermarking in language models, providing a more efficient method for ensuring the integrity of generated text.
Methodology
The methodology involves a two-step process: first, watermarking the output of a language model using a secret key and pseudorandom variables; second, detecting the watermark by reconstructing noise variables from the generated text and applying a goodness-of-fit statistical test to assess whether the observed data deviates from the expected uniform distribution.
Results
The authors demonstrate that their detection mechanism is statistically more efficient than previous methods and provide theoretical bounds on the number of tokens needed for detection, which are dependent on the entropy of the next-token distributions. The results indicate that the new detection test can effectively identify watermarked text with high confidence.
Implications
The findings have significant implications for the development of watermarking techniques in natural language processing, particularly in ensuring the authenticity and ownership of generated content from large language models. This could enhance the security and integrity of AI-generated text in various applications.
The Geometry of Polynomial Group Convolutional Neural Networks
Theory
- Introduction of a mathematical framework for PGCNNs using graded group algebras.
- Two parametrization methods (Hadamard and Kronecker products) for polynomial activation functions.
- Dimension of the neuromanifold is determined by the number of layers and group size, not by polynomial degree.
- Description of the general fiber of the Kronecker parametrization and conjecture for Hadamard parametrization.
Read more
The Geometry of Polynomial Group Convolutional Neural Networks
Summary
This paper introduces a novel mathematical framework for Polynomial Group Convolutional Neural Networks (PGCNNs) using graded group algebras. The authors present two distinct parametrizations of the PGCNN architecture, derived from Hadamard and Kronecker products, which are interconnected through a linear mapping. The study computes the dimension of the associated neuromanifold, revealing that it is solely dependent on the number of layers and the size of the group, independent of the polynomial degree or group structure. Furthermore, the paper describes the general fiber of the Kronecker parametrization and conjectures a similar description for the Hadamard parametrization, supported by computations for small groups and shallow networks. The authors also provide a computational package, PGCNNGeometry2, to facilitate the analysis of these networks for arbitrary finite groups.
Methodology
The authors utilize tools from neuroalgebraic geometry to analyze the neuromanifolds associated with PGCNNs. They formalize the construction of PGCNNs in the context of graded group algebras and derive two parametrization maps based on polynomial activation functions. The analysis includes theoretical computations and conjectures supported by empirical results for specific groups and network configurations.
Results
The paper establishes that the neuromanifold MΦ and the image Mφ of the parametrization maps have the same dimension, specifically L(|G| - 1) + 1, where L is the number of layers and |G| is the size of the group. The authors also provide a detailed description of the general fiber for the Kronecker parametrization and conjecture a similar structure for the Hadamard parametrization.
Implications
The findings suggest that PGCNNs can effectively leverage group symmetries to reduce sample complexity and improve model expressivity. The mathematical framework and computational tools developed in this paper may facilitate further research in equivariant neural networks and their applications across various domains, including computer vision and theoretical physics.
FI-KAN: Fractal Interpolation Kolmogorov-Arnold Networks
Theory
Efficient ML
Interpretability
- Introduction of two FI-KAN architectures: Pure FI-KAN and Hybrid FI-KAN.
- Learnable fractal dimensions allow for adaptive basis functions that match target regularity.
- Hybrid FI-KAN shows substantial performance improvements over traditional KAN across various benchmarks.
- The study provides empirical evidence supporting the regularity-matching hypothesis in function approximation.
Read more
FI-KAN: Fractal Interpolation Kolmogorov-Arnold Networks
Summary
This paper introduces Fractal Interpolation Kolmogorov-Arnold Networks (FI-KAN), a novel architecture designed to enhance the approximation of non-smooth functions by integrating learnable fractal interpolation function (FIF) bases into Kolmogorov-Arnold Networks (KAN). Traditional KANs utilize B-spline bases, which are not optimal for functions exhibiting non-trivial geometric regularity. FI-KAN presents two variants: Pure FI-KAN, which completely replaces B-splines with FIF bases, and Hybrid FI-KAN, which retains B-splines while adding a fractal correction. The key innovation lies in treating the IFS contraction parameters as trainable, allowing the model to adapt the fractal dimension of the basis functions to the target function's regularity during training. Extensive experiments demonstrate that Hybrid FI-KAN significantly outperforms KAN across various regularity levels, achieving up to 79× improvement on non-smooth PDE solutions. The results validate the hypothesis that matching basis geometry to target regularity is crucial for effective function approximation, particularly in applications involving structured roughness.
Methodology
The paper employs a theoretical framework based on iterated function systems (IFS) to develop the FI-KAN architectures. It incorporates learnable parameters that adjust the fractal dimension of the basis functions, allowing the model to adapt to the regularity of the target functions. The performance of FI-KAN is evaluated through comprehensive experiments on H¨older regularity benchmarks and non-smooth PDE solutions, comparing results against traditional KAN architectures.
Results
Hybrid FI-KAN outperforms KAN at every regularity level, achieving improvements ranging from 1.3× to 33× on H¨older regularity benchmarks. On fractal targets, it achieves up to 6.3× reduction in mean squared error (MSE) compared to KAN, maintaining a 4.7× advantage at 5 dB SNR. For non-smooth PDE solutions, Hybrid FI-KAN demonstrates up to 79× improvement on rough-coefficient diffusion problems and 3.5× on L-shaped domain corner singularities.
Implications
The findings suggest that incorporating fractal geometry into neural architectures can significantly enhance the approximation of complex, non-smooth functions. This has potential applications in fields such as computational science, engineering, and finance, where modeling of irregular phenomena is critical. The regularity-matching principle could inform future designs of neural networks for improved performance in various domains.
AutoStan: Autonomous Bayesian Model Improvement via Predictive Feedback
Optimization
Theory
Interpretability
- AutoStan autonomously builds and improves Bayesian models using Stan without manual intervention.
- The framework utilizes NLPD and sampler diagnostics as feedback for iterative model enhancement.
- AutoStan demonstrates superior performance on diverse datasets compared to existing black-box methods.
- The approach is agent-agnostic, applicable to any CLI coding agent capable of executing shell commands.
Read more
AutoStan: Autonomous Bayesian Model Improvement via Predictive Feedback
Summary
The paper introduces AutoStan, a novel framework that enables a command-line interface (CLI) coding agent to autonomously create and iteratively enhance Bayesian models using the Stan programming language. The agent operates in a continuous loop where it writes a Stan model file, performs Markov Chain Monte Carlo (MCMC) sampling, and evaluates model performance based on two feedback signals: the negative log predictive density (NLPD) on held-out data and the diagnostics from the sampler. AutoStan is evaluated on five datasets representing various modeling structures, demonstrating its capability to evolve from simple linear regression to more complex models, such as those incorporating Student-t robustness and hierarchical structures. Notably, AutoStan matches or exceeds the performance of TabPFN, a leading black-box method, while maintaining interpretability. The framework is agent-agnostic, allowing any CLI coding agent with file editing and command execution capabilities to implement the optimization loop. This work marks a significant advancement in automating Bayesian modeling processes, reducing the need for manual intervention and domain-specific guidance.
Methodology
AutoStan employs a CLI coding agent that iteratively edits Stan model files and executes MCMC sampling. The agent evaluates model performance using NLPD and sampler diagnostics, making decisions to keep or revert changes based on these metrics. The agent operates autonomously, exploring data and generating models without specific domain instructions.
Results
The evaluation of AutoStan on five datasets revealed its ability to transition from basic models to complex ones, such as incorporating robustness to outliers and hierarchical structures. In particular, it successfully matched or outperformed TabPFN in predictive accuracy while ensuring model interpretability. The agent's autonomous exploration led to the discovery of effective modeling strategies across various statistical paradigms.
Implications
AutoStan's framework has the potential to significantly streamline the Bayesian modeling process, making it more accessible to practitioners by reducing the need for extensive manual tuning and monitoring. This could lead to broader adoption of Bayesian methods in various fields, enhancing model robustness and interpretability.
DiSGMM: A Method for Time-varying Microscopic Weight Completion on Road Networks
Graph Learning
Time Series
Optimization
- DiSGMM addresses two layers of data sparsity in microscopic weight completion.
- The method uses Gaussian mixture models for flexible and closed-form distribution representation.
- DiSGMM combines static and dynamic embeddings to balance known weights and inherent segment information.
- Experiments show significant performance improvements over existing methods.
Read more
DiSGMM: A Method for Time-varying Microscopic Weight Completion on Road Networks
Summary
The paper presents DiSGMM, a novel approach for time-varying microscopic weight completion in road networks, addressing the challenges of data sparsity at both the network and segment levels. Microscopic weights reflect fine-grained traffic conditions, such as individual vehicle speeds, which are crucial for tasks like traffic microsimulation and vehicle routing. The authors identify two main challenges: the sparsity of available weights across road segments and the need for a flexible representation of weight distributions that can capture complex traffic patterns. DiSGMM combines sparsity-aware embeddings with spatiotemporal modeling to effectively leverage known sparse weights and learn segment properties. It represents microscopic weight distributions as Gaussian mixture models (GMMs), allowing for closed-form representations that can adapt to various traffic conditions. Experimental results on real-world datasets demonstrate that DiSGMM outperforms existing state-of-the-art methods in accurately completing microscopic weights.
Methodology
DiSGMM employs a two-module approach: one for learning static embeddings that incorporate inherent segment information and another for dynamic embeddings that adapt to temporal changes while considering segment-layer sparsity. This dual approach allows for effective distribution estimation of microscopic weights using Gaussian mixture models.
Results
The experimental evaluation on two real-world datasets indicates that DiSGMM significantly outperforms current state-of-the-art methods in terms of accuracy in completing microscopic weights, demonstrating its effectiveness in handling the complexities of traffic data.
Implications
The proposed method has potential applications in traffic microsimulation, vehicle routing, and real-time traffic management systems, enhancing the reliability and efficiency of transportation networks.
Loss Gap Parity for Fairness in Heterogeneous Federated Learning
Federated Learning
Optimization
Theory
- EAGLE algorithm minimizes disparities in loss gaps among clients in federated learning.
- Focuses on fairness in relative improvements rather than loss parity, avoiding performance degradation for certain clients.
- Theoretical convergence guarantees are provided for non-convex loss functions.
- Empirical results show EAGLE reduces loss gap variance while maintaining or improving overall model performance.
Read more
Loss Gap Parity for Fairness in Heterogeneous Federated Learning
Summary
This paper addresses the challenge of fairness in heterogeneous federated learning, where clients may have different data distributions and performance expectations from the global model. The authors propose a new algorithm, EAGLE, which aims to minimize disparities in loss gaps among clients. The loss gap is defined as the difference between the global model's performance and the best local model that a client could train on their own data. Unlike existing methods that focus on loss parity, which can disadvantage clients with more complex or lower-quality data, EAGLE emphasizes fairness in relative improvements. The paper provides theoretical convergence guarantees for EAGLE under non-convex loss functions and introduces a novel heterogeneity measure to characterize the performance of its iterates relative to standard federated learning objectives. Empirical results demonstrate that EAGLE effectively reduces loss gap disparities while maintaining competitive performance compared to strong baselines, including FedAvg and other fairness-focused approaches.
Methodology
The authors developed the EAGLE algorithm by augmenting the standard FedAvg approach with a regularization term that penalizes the variance of client loss gaps. This allows for a trade-off between overall accuracy and fairness. Theoretical analysis was conducted to ensure convergence under non-convex conditions, and empirical evaluations were performed using synthetic data and established benchmarks (EMNIST and DirtyMNIST) with both linear and convolutional models.
Results
EAGLE was shown to consistently reduce the variance in loss gaps across clients while achieving comparable or improved overall performance relative to the FedAvg algorithm and other fairness-focused methods. The results indicate that EAGLE effectively addresses the fairness concerns in heterogeneous federated learning settings.
Implications
The findings suggest that EAGLE can enhance client participation in federated learning by ensuring fairer outcomes across diverse data distributions, which is particularly relevant in sensitive applications such as healthcare and finance. The approach can be adapted to various federated learning scenarios where fairness is a critical concern.
Squish and Release: Exposing Hidden Hallucinations by Making Them Surface as Safety Signals
NLP
Large Language Models
Interpretability
- Introduction of the Squish and Release (S&R) framework for detecting hidden hallucinations in AI models.
- Identification of a fixed detector body in the model's architecture that is crucial for safety evaluations.
- Demonstration that synthetically engineered cores significantly outperform empirically discovered cores in releasing hidden signals.
- Establishment of the Order-Gap Benchmark to evaluate model performance across various domains.
Read more
Squish and Release: Exposing Hidden Hallucinations by Making Them Surface as Safety Signals
Summary
This paper addresses the phenomenon of hallucination in conversational AI models, particularly when they produce confident outputs based on false premises. The authors introduce a novel framework called Squish and Release (S&R), which consists of a fixed detector body and a swappable detector core. The detector body is located in layers 24-31 of the model's residual stream, where safety evaluations occur. The core can be either a safety core, which shifts the model's focus from compliance to detection of hidden signals, or an absorb core, which suppresses detection and reinforces compliance. The authors empirically demonstrate the effectiveness of S&R using the Order-Gap Benchmark-500, which includes five-prompt chains across 500 domains. Key findings reveal that the model's ability to detect false premises is significantly suppressed under conversational pressure, with the safety core achieving a 76.6% release rate of hidden signals. The study highlights the importance of core engineering, showing that synthetically engineered cores outperform empirically discovered ones. The implications of this work extend to critical fields such as litigation and medical protocols, where accurate detection of false premises is essential. The authors propose further research to explore cross-architecture applications and core optimization.
Methodology
The authors developed the S&R framework, which includes a fixed detector body and swappable cores to manipulate the model's perception of safety signals. They conducted empirical evaluations using the Order-Gap Benchmark-500, analyzing the model's performance across 500 domains with five-prompt chains.
Results
The study found that the model exhibited near-total compliance (99.8%) under conversational pressure, with the detector body effectively shifting 93.6% of collapsed chains. The safety core released 76.6% of hidden signals, while the absorb core suppressed 58% of correctly detecting chains. The results confirm the need for routing cores rather than blending them and demonstrate the specificity of the detection mechanism.
Implications
The findings have significant implications for fields where accurate information is critical, such as legal and medical contexts. The S&R framework could enhance the reliability of AI systems by improving their ability to detect and surface hidden errors in outputs, potentially reducing the risk of harm caused by false premises.
OneComp: One-Line Revolution for Generative AI Model Compression
Efficient ML
Generative Models
Optimization
- OneComp automates the model compression process, making it accessible to practitioners.
- The framework adapts to available hardware, optimizing quantization stages accordingly.
- It integrates various compression techniques, ensuring improved model quality with each stage.
- OneComp serves as a bridge between theoretical research and practical application in model deployment.
Read more
OneComp: One-Line Revolution for Generative AI Model Compression
Summary
The paper introduces OneComp, an open-source framework designed to streamline the post-training compression of generative AI models. As the deployment of foundation models faces challenges due to their large memory footprints and computational demands, OneComp addresses these issues by providing a resource-adaptive pipeline that automates the quantization process. The framework inspects the model and plans mixed-precision assignments, executing progressive quantization stages that range from layer-wise to global refinement. A notable feature of OneComp is its ability to treat the first quantized checkpoint as a deployable pivot, ensuring that subsequent stages enhance the model's quality as more computational resources are allocated. By transforming complex compression workflows into a user-friendly, automated process, OneComp bridges the gap between theoretical advancements in model compression and practical deployment, making it easier for practitioners to utilize state-of-the-art techniques without extensive manual configuration.
Methodology
OneComp employs a multi-stage quantization pipeline that includes an investigation phase for profiling model sensitivity, followed by preprocessing to enhance quantization, and a core quantization cascade that adapts based on available resources. It utilizes layer-wise, block-wise, and global post-training quantization methods, ensuring a systematic approach to model compression.
Results
The framework demonstrates significant improvements in model quality and efficiency, achieving comparable results to more complex methods while requiring less manual intervention. OneComp's resource-adaptive engine ensures that users receive the best possible quantized model within their hardware constraints.
Implications
OneComp has the potential to democratize access to advanced model compression techniques, enabling organizations with limited resources to deploy large generative AI models effectively. Its open-source nature fosters collaboration and innovation in the field of model optimization.
Why not to use Cosine Similarity between Label Representations
Theory
- Cosine similarity does not correlate with model probabilities in softmax classifiers.
- It is possible to create models with identical probabilities but different cosine similarities.
- Translation of unembeddings can lead to misleading cosine similarity values.
- Centering or fixing the length of representations does not resolve the disconnect between cosine similarity and probabilities.
Read more
Why not to use Cosine Similarity between Label Representations
Summary
This paper critiques the use of cosine similarity as a measure of similarity between label representations (unembeddings) in softmax classifiers, such as image classifiers and autoregressive language models. The author demonstrates that cosine similarity does not reliably reflect the probabilities assigned by the model. Specifically, it is shown that for any two label representations, one can construct an alternative model that maintains the same probability distribution for all inputs while altering the cosine similarity to either 1 or -1. This disconnect arises because cosine similarity is sensitive to vector translations, while the probabilities assigned by the model remain invariant under such transformations. The paper provides theoretical proofs and examples to illustrate these points, ultimately arguing that cosine similarity should not be used to explain model probabilities in softmax classifiers.
Methodology
The paper employs theoretical proofs to establish the relationship between cosine similarity and model probabilities in softmax classifiers. It introduces lemmas that demonstrate how translations of unembeddings affect cosine similarity without altering the probability distributions. Specific examples are provided to illustrate these concepts.
Results
The main result is that cosine similarity between label representations is not a reliable indicator of the probabilities assigned by a softmax classifier. The paper proves that equivalent models can exhibit vastly different cosine similarities while maintaining the same probability outputs.
Implications
The findings suggest that researchers and practitioners should be cautious when using cosine similarity to interpret model behavior, particularly in applications involving softmax classifiers. This has implications for model evaluation and the understanding of neural network representations.
A Latent Risk-Aware Machine Learning Approach for Predicting Operational Success in Clinical Trials based on TrialsBank
Optimization
- Introduces a hierarchical latent risk-aware machine learning framework for predicting clinical trial success.
- Utilizes a curated dataset from TrialsBank comprising 13,700 trials.
- Achieves high F1-scores (0.93, 0.92, 0.91) across Phase I-III trials.
- Demonstrates improved discrimination of operational failures by incorporating latent risk factors.
Read more
A Latent Risk-Aware Machine Learning Approach for Predicting Operational Success in Clinical Trials based on TrialsBank
Summary
This paper presents a novel hierarchical latent risk-aware machine learning framework aimed at predicting the operational success of clinical trials. The framework utilizes a curated subset of TrialsBank, which includes data from 13,700 trials, to forecast the ability to initiate, conduct, and complete trials according to predefined timelines and recruitment targets. The authors highlight the limitations of existing AI approaches that often rely on metrics unavailable at the trial design phase. Their methodology decomposes the prediction of operational success into two stages: first, predicting intermediate latent operational risk factors using over 180 features available before trial initiation, and second, integrating these predicted risks into a downstream model to estimate the probability of operational success. The models were benchmarked using XGBoost, CatBoost, and Explainable Boosting Machines, achieving impressive F1-scores of 0.93, 0.92, and 0.91 across different trial phases. The results indicate that incorporating latent risk drivers significantly enhances the discrimination of operational failures, demonstrating the framework's robustness and applicability in real-world clinical development scenarios.
Methodology
The methodology involves a two-stage modeling approach where intermediate latent operational risk factors are predicted using over 180 drug- and trial-level features available prior to trial initiation. These predicted risks are then integrated into a downstream model to estimate the probability of operational success. A staged data-splitting strategy was employed to prevent information leakage, and various machine learning models including XGBoost, CatBoost, and Explainable Boosting Machines were used for benchmarking.
Results
The framework demonstrated strong out-of-sample performance with F1-scores of 0.93 for Phase I, 0.92 for Phase II, and 0.91 for Phase III trials. The incorporation of latent risk drivers significantly improved the model's ability to discriminate between operational successes and failures, maintaining robustness under independent inference evaluations.
Implications
The findings suggest that this latent risk-aware AI framework can facilitate early risk assessment in clinical trials, enabling better decision-making and resource allocation in clinical development. This could lead to more efficient trial designs and potentially reduce the high failure rates associated with drug development.
A Neural Tension Operator for Curve Subdivision across Constant Curvature Geometries
Computer Vision
Theory
Generative Models
- Introduction of a shared learned tension predictor for curve subdivision across different geometries.
- The method achieves lower bending energy and angular roughness compared to traditional fixed-tension methods.
- Theoretical guarantees ensure structural safety and convergence of the proposed approach.
- Empirical results demonstrate effective generalization beyond the training distribution.
Read more
A Neural Tension Operator for Curve Subdivision across Constant Curvature Geometries
Summary
This paper introduces a novel approach to curve subdivision across constant curvature geometries by employing a learned neural tension operator. Traditional interpolatory subdivision schemes utilize a single global tension parameter, which can be inadequate for varying curvature in control polygons. The authors propose a shared learned tension predictor that replaces this global parameter with per-edge insertion angles predicted by a 140,505-parameter neural network. This network takes local intrinsic features and a trainable geometry embedding as input, allowing for geometry-specific insertion operators across Euclidean, spherical, and hyperbolic spaces without requiring separate architectures. The method is supported by theoretical results, including a structural guarantee for tangent-safe insertions and a conditional convergence certificate for limit curves. Empirical evaluations demonstrate that the learned predictor significantly reduces bending energy and angular roughness compared to fixed-tension and manifold-lift baselines, achieving a favorable position on the fidelity-smoothness Pareto frontier. The study also highlights the generalization capabilities of the predictor on out-of-distribution examples, indicating its potential for broader applications in visual computing.
Methodology
The authors developed a residual neural network that predicts per-edge insertion angles based on local geometric features and a geometry embedding. The network's output is constrained to ensure valid angular ranges for vertex insertions. The method was evaluated against multiple baselines, including fixed-tension and Riemannian manifold lifts, using a comprehensive set of validation curves.
Results
The learned tension predictor outperformed fixed-tension methods in terms of bending energy and angular roughness, achieving reductions of 41% and 68% respectively on out-of-distribution examples. The results indicate that the proposed method occupies a unique position on the fidelity-smoothness Pareto frontier, demonstrating effective generalization capabilities.
Implications
This research has significant implications for the fields of geometric modeling and visual computing, particularly in applications requiring smooth curve generation across varying geometries, such as CAD, satellite-track visualization, and panoramic video rendering. The ability to learn adaptive tension parameters could enhance the design of complex geometric shapes and improve the efficiency of rendering processes.