AI-generated summaries
Today's ML research,
without the noise.
Daily summaries of the latest machine learning papers from arXiv, processed every 8 hours.
70
Papers today
8h
Update frequency
7
Days of history
Representation-Guided Discrete Molecular Graph Retrosynthesis
Generative Models
Graph Learning
- Introduction of Graph-oriented Representation Guidance (GRG) for molecular graph retrosynthesis.
- Systematic exploration of design choices for representation guidance, including teacher representations and alignment strategies.
- GRG achieves state-of-the-art performance on USPTO-50k with improved accuracy and diversity.
- Representation similarity can be used for lightweight inference-time verification, enhancing output quality.
Read more
Representation-Guided Discrete Molecular Graph Retrosynthesis
Summary
This paper addresses the challenge of improving template-free single-step retrosynthesis through the introduction of Graph-oriented Representation Guidance (GRG). Traditional stochastic process-based molecular graph generators typically rely on product-reactant pairs for training, which limits their ability to capture chemistry-relevant representations. The authors investigate whether representation guidance, inspired by advancements in computer vision, can enhance the performance of these generators. They conduct a systematic empirical study exploring various design choices, including teacher molecular representations, guidance schemes, and alignment strategies. The proposed GRG framework achieves significant improvements in accuracy and diversity on the USPTO-50k dataset, outperforming existing models. Notably, GRG enhances all top-k metrics in out-of-distribution settings and reduces training time and computational resources. Additionally, a representation-similarity-based reranking mechanism is introduced to further refine the output without requiring additional training. Overall, the study demonstrates that representation guidance can effectively distill chemical semantics, leading to better retrosynthesis outcomes.
Methodology
The authors conducted a systematic empirical study using a Markov-bridge retrosynthesis generator, exploring various guidance schemes (representation alignment and entanglement) and design choices (teacher representations, guidance location, alignment granularity, and correspondence strategies). They developed GRG to adapt representation guidance for molecular graphs under categorical corruption and validated their findings on the USPTO-50k dataset.
Results
The GRG framework achieved top-1, 3, 5, and 10 accuracies of 58.6%, 77.2%, 83.4%, and 87.1%, respectively, while increasing diversity to 15.5. The representation guidance improved all top-k metrics in out-of-distribution settings and reduced training epochs by 35% and wall-clock time by 30%. The representation-similarity reranking mechanism further enhanced the quality of the ranked outputs.
Implications
The findings suggest that incorporating representation guidance can significantly enhance the performance of molecular graph retrosynthesis models, making them more effective for drug discovery and synthesis planning. The lightweight reranking mechanism also offers a practical approach for improving model outputs without extensive retraining.
Relative Repairability: A Calibration-Based Diagnostic for High-Sparsity Post-Pruning Allocation
Efficient ML
- Introduces Relative Repairability (RR) as a diagnostic for evaluating pruning-induced damage and its recoverability.
- RR is most effective near the recoverability transition, where traditional allocation methods lose reliability.
- Demonstrates that RR can outperform existing sparsity allocation methods like ERK and LAMP in specific scenarios.
- Highlights the importance of considering repairable damage in high-sparsity pruning strategies.
Read more
Relative Repairability: A Calibration-Based Diagnostic for High-Sparsity Post-Pruning Allocation
Summary
This paper addresses the challenge of neural network pruning at very high sparsity levels, where the allocation of retained weights and the management of pruning-induced damage become critical. The authors introduce a novel diagnostic called Relative Repairability (RR), which evaluates the effectiveness of a lightweight repair procedure by comparing the raw activation distortion caused by layerwise pruning to the residual distortion after applying a channelwise variance-matching repair. RR provides a quantitative measure of how much damage remains after repair, allowing for more informed sparsity allocation across network layers. The study reveals that RR is particularly beneficial near an architecture-dependent recoverability transition, where traditional allocation methods begin to falter. Through experiments on ResNet18, ResNet34, and VGG16 BN with CIFAR10 and CIFAR100 datasets, the authors demonstrate that RR can outperform existing methods like ERK and LAMP in specific sparsity ranges. The findings suggest that effective high-sparsity pruning should consider not only the retention of weights but also the repairability of the damage caused by pruning.
Methodology
The authors developed the Relative Repairability (RR) diagnostic by measuring the raw activation shift caused by pruning and the residual shift after applying a lightweight repair method. This ratio estimates the fraction of distortion that remains after repair, guiding the allocation of sparsity across layers based on their repairability. Experiments were conducted on various neural network architectures using CIFAR10 and CIFAR100 datasets to validate the effectiveness of RR compared to existing allocation methods.
Results
The experiments showed that RR improved performance over ERK across a contiguous transition band of sparsity levels on CIFAR100 ResNet18 and surpassed LAMP in the upper part of this band. The analysis indicated that traditional methods could misallocate sparsity, leading to reduced post-repair recovery, while RR effectively identified layers that could tolerate more sparsity without compromising recoverability.
Implications
The findings suggest that incorporating repairability into pruning strategies can enhance the performance of sparse neural networks, particularly in resource-constrained environments. This approach could lead to more efficient deployment of models in practical applications where memory and computation are limited.
Hybrid Quantum-Classical Corrective Diffusion Modeling for Meteorological Downscaling
Generative Models
Theory
Efficient ML
- Introduction of a hybrid quantum-classical model for meteorological downscaling.
- Improved performance metrics (MAE and CRPS) compared to classical models.
- Preservation of key wind field characteristics while enabling controlled changes.
- Identification of limitations in real hardware deployment and generalization gaps.
Read more
Hybrid Quantum-Classical Corrective Diffusion Modeling for Meteorological Downscaling
Summary
This paper presents a novel approach to statistical downscaling in meteorology by integrating hybrid quantum-classical models into corrective diffusion frameworks. The authors propose a hybrid model that incorporates variational quantum circuit (VQC) layers into the bottleneck of a diffusion UNet, while maintaining a fully classical regression branch. This design aims to leverage quantum circuits as compact nonlinear feature maps for enhancing latent-channel mixing. The model is evaluated on 10 m wind components using a dataset from 2020, demonstrating stability and improved performance over classical models in terms of mean absolute error (MAE) and continuous ranked probability score (CRPS). Structural diagnostics indicate that the hybrid model preserves essential characteristics of wind fields, such as kinetic-energy spectra and windspeed distributions, while allowing for controlled modifications in tail behavior and extreme wind localization. The study also addresses the limitations posed by qubit availability and execution fidelity in real hardware deployments. Furthermore, the results reveal a generalization gap when tested on out-of-distribution data, highlighting the need for future work on stabilization and regularization techniques. Overall, this research showcases the potential of quantum hybridization in enhancing statistical downscaling methods for meteorological applications.
Methodology
The methodology involves a hybrid quantum-classical corrective diffusion model that integrates variational quantum circuit layers into the diffusion UNet architecture. The model operates by first predicting a mean output using a classical regression UNet, followed by a stochastic generation of residuals through the diffusion model. The evaluation is conducted on wind components using a dataset from 2020, with performance metrics including MAE and CRPS.
Results
The hybrid model demonstrates stability and improved performance over classical corrective diffusion models, with significant reductions in MAE and CRPS. Structural diagnostics confirm that the hybrid approach maintains the large-scale spatial organization of wind fields while allowing for controlled adjustments in extreme wind behavior. However, the model exhibits a generalization gap when applied to out-of-distribution data, indicating areas for further research.
Implications
The findings suggest that hybrid quantum-classical models can enhance the accuracy of meteorological downscaling, potentially improving weather forecasting and climate modeling applications. The research also emphasizes the importance of addressing hardware limitations and generalization challenges in future quantum machine learning endeavors.
Courtroom Analogy: New Perspective on Uncertainty-Aware Classification
Theory
Interpretability
Efficient ML
- Introduces a courtroom analogy for uncertainty-aware classification, framing it as a debate among class advocates.
- Models each advocate's opinion using Dirichlet distributions with interpretable parameters.
- Proposes MoDEX, a neural architecture that efficiently predicts uncertainty while maintaining interpretability.
- Demonstrates strong theoretical properties and state-of-the-art UQ performance across diverse benchmarks.
Read more
Courtroom Analogy: New Perspective on Uncertainty-Aware Classification
Summary
This paper introduces a novel framework for uncertainty-aware classification through a courtroom analogy, where classification is viewed as a structured debate among class-specific advocates. Each advocate represents a class and forms a probabilistic opinion modeled as a Dirichlet distribution. The concentration parameter of this distribution is decomposed into shared evidence and class-specific advocacy, allowing for a structured mixture of Dirichlet distributions. The proposed method, Mixture of Dirichlet EXperts (MoDEX), is a single-pass neural architecture that predicts these courtroom parameters, enabling efficient uncertainty quantification (UQ) while providing interpretable uncertainty estimates. The authors argue that this approach enhances the interpretability of predictive uncertainty and achieves state-of-the-art performance across various benchmarks, addressing the limitations of existing UQ methods that often lack clear semantics and insight into uncertainty aggregation.
Methodology
The authors develop a structured output family for uncertainty quantification by modeling class-specific opinions as Dirichlet distributions. They decompose the concentration parameters into shared evidence and class-specific advocacy, aggregating these opinions using input-dependent plausibility weights. The MoDEX architecture is trained end-to-end to predict these parameters, enabling efficient and interpretable uncertainty quantification.
Results
MoDEX achieves state-of-the-art performance in uncertainty quantification across various benchmarks, providing interpretable uncertainty estimates that reflect the underlying semantics of the classification task. The structured approach allows for a clearer understanding of how uncertainty is formed and aggregated, addressing the limitations of previous methods.
Implications
The courtroom analogy framework could be applied in high-stakes domains such as healthcare, finance, and autonomous systems, where understanding and quantifying uncertainty is crucial for decision-making. The interpretability of the proposed method may enhance trust in machine learning models deployed in safety-critical applications.
Global Convergence of Wasserstein Policy Gradient for Entropy-Regularized Reinforcement Learning
Reinforcement Learning
Theory
Optimization
- Develops a global convergence theory for Wasserstein Policy Gradient in entropy-regularized RL.
- Utilizes Bellman structure to replace traditional convexity arguments in convergence analysis.
- Establishes relationships between value improvement, Fisher information, and KL divergence.
- Demonstrates geometric convergence properties despite the non-convex nature of the RL objective.
Read more
Global Convergence of Wasserstein Policy Gradient for Entropy-Regularized Reinforcement Learning
Summary
This paper presents a global convergence theory for the Wasserstein Policy Gradient (WPG) method applied to entropy-regularized reinforcement learning (RL). The authors address the challenge of understanding the global convergence properties of WPG, which utilizes optimal-transport geometry for policy optimization. Traditional analyses based on convexity do not apply due to the unique structure of the RL objective, which is influenced by the Bellman recursion. The authors introduce a novel approach that leverages the Bellman structure, showing that the soft Bellman residual can be represented in terms of Kullback-Leibler (KL) divergence with respect to a Gibbs policy. They establish a connection between value improvement and Fisher information through a Bellman resolvent identity, and relate the optimality gap to KL divergence via Bellman contraction. Additionally, they provide uniform control over the evolving Gibbs family, leading to a distributional Polyak–Lojasiewicz condition that supports geometric convergence despite the non-convex nature of the RL objective. The findings suggest that WPG can achieve global convergence in continuous-action RL settings, offering a significant advancement in understanding policy optimization methods.
Methodology
The authors employ a theoretical approach that integrates Bellman equations with optimal transport geometry. They analyze the Wasserstein Policy Gradient method by establishing connections between the soft Bellman residual, Fisher information, and KL divergence. The analysis includes deriving a distributional Polyak–Lojasiewicz condition and ensuring uniform control over the evolving Gibbs family to manage discretization errors.
Results
The paper successfully demonstrates that the Wasserstein Policy Gradient method can achieve global convergence in entropy-regularized reinforcement learning. The authors provide a rigorous framework that shows how the Bellman structure can be leveraged to establish convergence properties, overcoming the limitations of traditional convexity-based analyses.
Implications
The findings have significant implications for the development of more robust and efficient reinforcement learning algorithms, particularly in continuous-action settings. The theoretical insights could lead to improved policy optimization techniques that are less sensitive to parameterization and step sizes, enhancing the performance of RL applications in various domains.
The Quantization Benefits of Residual-Free Transformers
Efficient ML
Optimization
Large Language Models
- Residual connections in transformers lead to non-Gaussian activations, increasing quantization error.
- Residual-free transformers maintain near-Gaussian activations, resulting in improved robustness to low-bit quantization.
- Orthogonal initialization and second-order optimization techniques can effectively train residual-free transformers.
- The study reveals an accuracy-compressibility trade-off, suggesting architectural changes can enhance quantization performance.
Read more
The Quantization Benefits of Residual-Free Transformers
Summary
This paper addresses the challenges of quantizing large-scale transformers, which are often hindered by heavy-tailed and non-Gaussian activation distributions, particularly due to the presence of residual connections. The authors demonstrate that these connections can exacerbate quantization errors, leading to significant performance degradation at low precision. By comparing residual and residual-free transformers, they show that the latter maintains near-Gaussian activations, which are more robust to quantization. The study employs excess kurtosis analysis to explain how residual mixing amplifies non-Gaussianity, while dense mixing in residual-free architectures helps contract it. The authors propose methods to effectively train residual-free transformers using orthogonal initialization, second-order optimization techniques, and depth-aware scaling of attention temperature. Although there is a slight performance drop in full precision, residual-free transformers exhibit superior robustness under aggressive quantization, highlighting an accuracy-compressibility trade-off in transformer design. This research motivates a shift towards architecture-level strategies for developing quantization-friendly foundation models.
Methodology
The authors conducted controlled comparisons between residual and residual-free transformers, utilizing excess kurtosis analysis to assess activation distributions. They implemented orthogonal initialization and second-order optimization methods to train residual-free models, while evaluating their performance on various language tasks under low-bit quantization conditions.
Results
Residual-free transformers demonstrated significantly lower accuracy degradation and higher signal-to-quantization noise ratios compared to residual models when subjected to low-bit quantization. The results indicated that residual-free architectures retained near-Gaussian activation distributions, which correlated with improved quantization robustness.
Implications
The findings suggest that redesigning transformer architectures to minimize the impact of residual connections could lead to more efficient training and deployment of large models, particularly in resource-constrained environments. This research could influence future developments in quantization-friendly foundation models and large-scale machine learning systems.
An Effective-Rank Audit of Alignment-Induced Activation Shifts: Confound Control, Constructive Calibration, and Limits
NLP
Large Language Models
Theory
- Introduces the effective rank of the alignment modification matrix as a continuous measure of activation shifts.
- Demonstrates confound-controlled measurement to isolate contributions to activation shifts in LLMs.
- Identifies the distinction between robust and brittle configurations in model calibration.
- Critiques the limitations of rank-based diagnostics in assessing model safety.
Read more
An Effective-Rank Audit of Alignment-Induced Activation Shifts: Confound Control, Constructive Calibration, and Limits
Summary
This paper presents an audit of alignment-induced shifts in the residual-stream activations of three instruction-tuned large language models (LLMs): Llama-3.1-8B-Instruct, Gemma-2-9B-it, and Qwen-2.5-7B-Instruct. The author introduces the effective rank of the alignment modification matrix, denoted as ρϵ, which quantifies the alignment-induced changes in activations on safety-relevant inputs. The paper's contributions are threefold: (1) It provides a confound-controlled measurement approach that decomposes the modification matrix into four variants, allowing for the isolation of contributions from chat-template formatting, alignment-stage shifts, and refusal-mediating directions. This analysis reveals that the refusal direction identified by Arditi et al. is not universally applicable across all models. (2) It discusses constructive calibration, highlighting the distinction between robust and brittle configurations in a multi-layer perceptron (MLP). The findings suggest that high ρϵ values achieved through mild rank-maximization regularization enhance robustness, while excessive regularization can lead to fragility. (3) It critiques the limitations of rank-based diagnostics, noting that low effective rank is not exclusively indicative of safety and that the spectral ordering of singular values does not align with causal effects. The paper concludes with open questions regarding the structural implications of low effective rank and its relationship to model safety.
Methodology
The study employs a four-variant decomposition of the alignment modification matrix to control for confounding factors in measuring activation shifts. It utilizes singular value decomposition (SVD) to analyze the effective rank of the modification matrix across different models and configurations. The paper also conducts experiments on a multi-layer perceptron (MLP) to evaluate the robustness of calibration strategies based on effective rank.
Results
The effective rank values for the models were found to be low (ρϵ values of 0.0029, 0.0048, and 0.0044 for Llama, Gemma, and Qwen, respectively), indicating a dominance of a single mean direction in the activation shifts. The study found that mild rank-maximization regularization improved robustness, while excessive regularization led to significant fragility. Additionally, the paper documented that low effective rank is a generic property of post-training linear concepts and does not necessarily correlate with safety.
Implications
The findings suggest that understanding the geometry of activation shifts in LLMs is crucial for improving model safety and robustness. The insights into calibration strategies could inform the development of more resilient LLMs, while the critiques of rank-based diagnostics may lead to more nuanced approaches in evaluating model performance and safety.
A perspective on fluid mechanical environments for challenges in reinforcement learning
Reinforcement Learning
- Fluid mechanics problems serve as a valuable testbed for developing RL agents.
- Agents can leverage preserved representations in fluid dynamics to learn efficiently.
- The paper outlines specific problem descriptions for RL agents in fluid environments.
- Open-source simulators like Dedalus are highlighted for RL method development.
Read more
A perspective on fluid mechanical environments for challenges in reinforcement learning
Summary
This paper explores the challenges of developing reinforcement learning (RL) agents that can efficiently interact with high-dimensional, evolving environments, particularly in the context of fluid mechanics. The authors argue that canonical fluid mechanics problems, such as nonlinear instabilities, provide a compelling testbed for RL methods. These problems, which include phenomena like droplet breakup and rogue waves, highlight the need for agents to leverage preserved representations across changing dynamics. The paper presents two specific problem descriptions involving agents in fluid mechanical environments, detailing their state and action spaces, as well as reward functions. The authors emphasize the importance of nonstationary aspects of the environment and the preserved invariances that agents can exploit. They also introduce open-source simulators like Dedalus and JAX-CFD for developing RL methods. The paper demonstrates the use of Dedalus to create RL agents capable of navigating a stationary environment, laying the groundwork for future research on RL agents that can interact meaningfully with complex fluid dynamics.
Methodology
The authors present two problem descriptions involving RL agents interacting with fluid mechanical environments, specifying state and action spaces, reward functions, and the nonstationary aspects of the environment. They utilize the Dedalus simulator to demonstrate RL agent navigation in a stationary environment, showcasing the potential for future RL applications in fluid dynamics.
Results
The paper successfully illustrates how RL agents can be structured to interact with fluid mechanical environments, emphasizing the importance of leveraging preserved representations across dynamic changes. The use of Dedalus for environment generation is demonstrated, setting a foundation for further exploration of RL in complex fluid dynamics.
Implications
This research has significant implications for advancing RL methods in real-world applications, particularly in industries dealing with fluid dynamics, such as aerospace, energy production, and environmental science. It opens avenues for developing agents that can adapt to and control complex, evolving systems.
Aligning Molecular Graph Explanations with Chemical Identity via InChIfied Invariants
Graph Learning
Interpretability
- Introduction of INCHIFIED INVARIANTS for molecular graph featurization.
- Achieves 99.62% consistency in representations for chemically equivalent graphs.
- Maintains predictive performance on MoleculeNet tasks while improving explanation consistency.
- Quantitative analysis shows significant improvement in attribution consistency.
Read more
Aligning Molecular Graph Explanations with Chemical Identity via InChIfied Invariants
Summary
This paper addresses the challenge of obtaining consistent explanations for machine learning models applied to molecular graphs, emphasizing the need for predictions and attributions to align with chemical identity. The authors introduce a new class of features called INCHIFIED INVARIANTS, which are designed to be invariant under transformations that preserve chemical identity, based on the International Chemical Identifier (InChI). By analyzing one million molecular graphs from PubChem Substances, the authors demonstrate that INCHIFIED INVARIANTS yield identical representations for chemically equivalent graphs in 99.62% of cases, a significant improvement over standard Daylight invariants, which achieve this in only 0.35% of cases. The paper further evaluates the predictive performance of INCHIFIED INVARIANTS across MoleculeNet tasks, showing that they maintain predictive accuracy while enhancing consistency in predictions across different graph depictions of the same molecules. A quantitative attribution analysis reveals that standard featurization methods produce varying attributions for chemically equivalent graphs, whereas INCHIFIED INVARIANTS ensure consistent attributions by design. The authors also provide open-source software implementing these invariants, allowing for easy integration into existing molecular machine learning workflows.
Methodology
The authors developed INCHIFIED INVARIANTS as node, edge, and graph features that leverage the InChI to ensure invariance under transformations preserving chemical identity. They validated these features using a large dataset from PubChem and conducted predictive performance evaluations on MoleculeNet benchmark tasks, alongside a quantitative analysis of attribution consistency.
Results
INCHIFIED INVARIANTS produced identical representations for chemically equivalent graphs in 99.62% of cases, significantly outperforming standard Daylight invariants. The new features preserved predictive performance across various tasks while ensuring consistent attributions, contrasting with the substantial variability seen with standard molecular featurization methods.
Implications
The introduction of INCHIFIED INVARIANTS could enhance the reliability and interpretability of machine learning models in chemistry, fostering greater trust among chemists in model predictions. The open-source nature of the software allows for widespread adoption and further research in molecular machine learning.
Graph-based Complexity Forecasts in UK En Route Airspace Using Relevant Aircraft Interactions
Graph Learning
Time Series
Optimization
- Introduces a probabilistic approach to forecast airspace complexity using relevant aircraft interactions.
- Implements a refined relevant aircraft filter algorithm tailored for the London Middle Sector.
- Achieves improved prediction accuracy with an F1-score of 0.84, outperforming traditional models.
- Forecasts ATCO workload up to 45 minutes in advance with significant correlation to actual interactions.
Read more
Graph-based Complexity Forecasts in UK En Route Airspace Using Relevant Aircraft Interactions
Summary
This paper addresses the challenge of managing Air Traffic Control Officer (ATCO) workload by forecasting airspace complexity in the London Middle Sector (LMS) using a graph-based probabilistic approach. Traditional models often fail to accurately predict upcoming traffic complexity due to their reliance on heuristic measures that do not account for the nuances of air traffic interactions. The authors propose a novel method that uses the number of relevant aircraft pairs—those requiring monitoring or deconfliction—as a proxy for ATCO workload. By adapting an existing relevant aircraft filter algorithm and refining it through feedback from licensed ATCOs, the authors improved the algorithm's performance, achieving an F1-score of 0.84 compared to 0.69 for the original model. The forecasting method incorporates a resampled graph representation of the route network, allowing for standardized spatial fidelity and accounting for uncertainties in aircraft arrival times. Predictions can be made up to 45 minutes in advance, showing a stronger correlation with actual relevant interactions than standard traffic volume predictions. This data-driven tool aims to assist group supervisors in making informed decisions regarding sector configuration and ATCO rostering.
Methodology
The study adapts a relevant aircraft filter algorithm for the London Middle Sector, incorporating feedback from ATCOs. It constructs a graph-based representation of the airspace and uses historical traffic data alongside live operational data to predict the number of relevant aircraft pairs, accounting for uncertainties in arrival times.
Results
The updated algorithm demonstrated an F1-score of 0.84 on a set of 50 traffic scenarios, significantly improving upon the original model's score of 0.69. The forecasting method showed a correlation of Spearman’s ρ = 0.68 with actual relevant interactions, compared to ρ = 0.55 for standard traffic volume predictions.
Implications
The proposed forecasting method has the potential to enhance air traffic management by providing more accurate predictions of ATCO workload, thereby improving operational safety and efficiency. It can inform decisions related to sector configuration and ATCO rostering, ultimately contributing to better air traffic control practices.
Hidden-State Privacy Has an Empty Middle
NLP
Large Language Models
Theory
- No Gaussian release mechanisms tested achieved both moderate utility and privacy against adaptive attackers.
- A diagonal inverse-Fisher mechanism is identified as the unique minimax-optimal solution in the Gaussian class.
- A new split-memory transformer architecture outperforms existing models in both privacy and utility metrics.
- The study highlights the need for redesigning models to achieve better privacy outcomes.
Read more
Hidden-State Privacy Has an Empty Middle
Summary
This paper investigates the limitations of Gaussian release mechanisms in achieving both moderate utility and privacy in single-layer hidden-state privacy against adaptive retrieval attackers. The author tests 1,536 Gaussian release covariances and finds that none meet the criteria for both privacy and utility. A complementary Fisher-ball lower bound is proven, indicating that every full-rank Gaussian release at O(1) Fisher utility has a direction where the Mahalanobis signal increases linearly with hidden width, thus ruling out uniform safety in this class. The paper introduces a unique diagonal inverse-Fisher mechanism that is minimax-optimal and demonstrates superior performance against adaptive attackers. Additionally, a novel split-memory transformer architecture is proposed, which achieves significant improvements in privacy and utility compared to traditional Gaussian mechanisms. The findings suggest a shift in focus from mechanism design to co-design of architecture and release strategies for hidden-state privacy.
Methodology
The paper employs a gradient-covariance decomposition of hidden states to analyze the utility-privacy tradeoff. It tests various Gaussian release mechanisms and introduces a diagonal inverse-Fisher mechanism. The performance of a newly designed split-memory transformer is evaluated against existing models across multiple configurations.
Results
The study found that none of the 1,536 Gaussian release covariances tested achieved both moderate utility and privacy. The diagonal inverse-Fisher mechanism was the only one that maintained a worst-attacker top-1 ≤ 0.001 across a 32-model-layer grid. The split-memory transformer demonstrated a GMah score between 20 and 33, significantly outperforming GPT baselines in privacy and utility metrics.
Implications
The findings suggest that traditional Gaussian mechanisms are insufficient for ensuring hidden-state privacy in language models. The proposed split-memory architecture could lead to more secure and efficient designs in machine learning applications, particularly in natural language processing tasks where privacy is critical.
The Normalized Maximum Likelihood for Regular Non-Smooth Models: Measure-Theoretic Foundations and Geometric Sampling
Theory
Optimization
Efficient ML
- Establishes a well-posed NML stochastic complexity for path-differentiable Lipschitz estimators.
- Introduces the Propose-and-Project Metropolis-Hastings (PDL-PPMH) sampler for non-differentiable models.
- Demonstrates the method's robustness in high-dimensional Lasso regression problems.
- Provides a data-efficient alternative to cross-validation using the exact NML criterion.
Read more
The Normalized Maximum Likelihood for Regular Non-Smooth Models: Measure-Theoretic Foundations and Geometric Sampling
Summary
This paper addresses the computation of the Normalized Maximum Likelihood (NML) codelength for regular non-smooth models, which are prevalent in modern machine learning applications such as Lasso and Sparse SVMs. The authors establish a rigorous framework for calculating the NML by applying classical geometric measure theory and bridging the coarea formula with conservative Jacobians. They demonstrate that the stochastic complexity for non-smooth models is well-defined and consistent with outputs from Automatic Differentiation. To compute the NML effectively, they introduce the Propose-and-Project Metropolis-Hastings (PDL-PPMH) sampler, a geometric MCMC algorithm designed to navigate the non-differentiable level sets of the maximum likelihood estimator. The paper validates this method by sampling from a high-dimensional Lasso posterior and highlights the computational trade-offs involved. The authors also show that their exact NML criterion offers a data-efficient alternative to cross-validation, achieving comparable predictive performance without the need for data splitting. Overall, this work lays the groundwork for the theoretical analysis of NML codelengths in regular non-smooth models, addressing a significant gap in the existing literature.
Methodology
The authors utilize classical geometric measure theory to derive the NML codelength for regular non-smooth models. They introduce the PDL-PPMH sampler, which employs a stochastic tangent space proposal and a convergent non-smooth projection solver to navigate the level sets of the maximum likelihood estimator. The methodology is validated through empirical sampling from a high-dimensional Lasso posterior.
Results
The PDL-PPMH sampler successfully computes the NML for a high-dimensional Lasso regression problem (P = 2000), demonstrating geometric ergodicity and maintaining performance despite computational challenges. The exact NML criterion achieved statistically indistinguishable predictive performance compared to traditional cross-validation methods without requiring data splitting.
Implications
This work has significant implications for model selection in machine learning, particularly for non-smooth models. The established framework and computational methods can enhance the efficiency and accuracy of model evaluation, potentially influencing practices in regression analysis and other areas where non-smooth estimators are used.
Synheart Capacity: A Theory-Driven Physiological Representation of Cognitive Capacity Dynamics from Wearable Signals
Multimodal
Theory
Time Series
- Introduces a theory-driven framework for modeling cognitive capacity dynamics using wearable physiological signals.
- Demonstrates significant cross-individual generalization in estimating stress and effort states.
- Enables differentiation between productive engagement and cognitive overload.
- Highlights the potential for real-time monitoring of cognitive states to enhance adaptive system interactions.
Read more
Synheart Capacity: A Theory-Driven Physiological Representation of Cognitive Capacity Dynamics from Wearable Signals
Summary
This paper addresses the challenge of continuously estimating cognitive capacity dynamics using physiological signals from wearables. The authors propose a multimodal learning framework that models cognitive states as a two-dimensional representation defined by voluntary resource allocation (mental effort) and overload-related strain (stress). The architecture integrates cardiac (IBI/HRV) and electrodermal (EDA) signals through dual-stream encoding, late fusion, and task-specific output heads to independently estimate probabilistic effort and stress states. Evaluations on the SWELL-KW dataset demonstrate cross-individual generalization with balanced accuracies of 70.0% for stress and 72.2% for effort. The framework allows for a structured differentiation between productive engagement and overload-related strain, revealing significant demand-sensitive shifts in predicted state trajectories under controlled workload manipulations. The findings suggest that multidimensional state representations grounded in physiological data can enable adaptive systems for continuous capacity-aware monitoring and human-centered interaction.
Methodology
The authors developed a multimodal learning framework that employs dual-stream encoding of cardiac and electrodermal signals, utilizing late fusion and task-specific output heads to estimate cognitive states of effort and stress. The evaluation was conducted on the SWELL-KW dataset using strict leave-one-subject-out cross-validation.
Results
The proposed framework achieved a balanced accuracy of 70.0% for stress estimation and 72.2% for effort estimation, indicating effective cross-individual generalization. The integration of multimodal data and theory-guided supervision led to significant improvements in performance compared to traditional methods.
Implications
The findings suggest that continuous, physiological-based monitoring of cognitive capacity can inform adaptive systems, potentially improving user experience in various applications such as tutoring platforms, decision-support tools, and human-automation interfaces. This approach could lead to more personalized interactions that preemptively address cognitive overload.
Learning Permutation from Structure Without Supervision
Optimization
Computer Vision
Theory
- Introduces a unified framework for unsupervised permutation learning using task-specific structural losses.
- Develops an entropy-adaptive Gumbel-Sinkhorn method that modulates temperature based on assignment uncertainty.
- Proposes a new interpretation of adaptive temperature control in terms of optimal transport over the Birkhoff polytope.
- Demonstrates improved performance in permutation quality and training stability across various tasks.
Read more
Learning Permutation from Structure Without Supervision
Summary
This paper addresses the challenge of learning permutations from unordered data without supervision, focusing on tasks like sorting and jigsaw reconstruction where the correct ordering reveals underlying structure. The authors propose a novel entropy-adaptive formulation of the Gumbel-Sinkhorn method, which replaces the traditional single global temperature with a locally varying temperature field. This adaptation allows for confident assignments to discretize early while maintaining exploration in areas of uncertainty. The proposed method enhances training stability and improves the quality of learned permutations, particularly in larger and more ambiguous problems. The authors demonstrate the effectiveness of their approach across various tasks, including sorting, jigsaw reconstruction, and the unsupervised Traveling Salesman Problem (TSP), showing that their method outperforms fixed-temperature baselines in terms of convergence reliability and permutation quality.
Methodology
The authors utilize an entropy-adaptive version of the Gumbel-Sinkhorn algorithm, which allows for a locally varying inverse temperature based on the uncertainty of assignments. This method is integrated into a framework that treats permutations as latent operators, optimizing directly from structural properties of reordered data.
Results
The entropy-adaptive method significantly improves the reliability of convergence and the quality of permutations compared to fixed or globally annealed temperature approaches. This improvement is particularly notable as the size of the problem and the ambiguity of assignments increase.
Implications
The proposed method has potential applications in various fields requiring permutation learning, such as computer vision (for tasks like image reconstruction and sorting), optimization problems, and any domain where the structure of unordered data needs to be uncovered without ground-truth labels.
Si'multaneous 'S'patial-'T'emporal Message Passing for Dynamic Graph Representation Learning
Graph Learning
- Introduces SiST-GNN, a new paradigm for dynamic graph representation learning that integrates spatial and temporal information simultaneously.
- Addresses the architectural bottleneck in existing DGNNs that limits joint reasoning over topology and evolution.
- Achieves state-of-the-art performance in link prediction tasks, outperforming prior methods by significant margins.
- Demonstrates effectiveness in dynamic node classification tasks, matching or exceeding the performance of continuous-time models.
Read more
Si'multaneous 'S'patial-'T'emporal Message Passing for Dynamic Graph Representation Learning
Summary
The paper introduces SiST-GNN (Simultaneous Spatial-Temporal Graph Neural Network), a novel approach for dynamic graph representation learning that addresses the limitations of existing dynamic graph neural networks (DGNNs). Traditional DGNNs typically follow a temporal-first or spatial-first paradigm, which imposes a strict order on how temporal and spatial information is processed. This rigid sequencing prevents effective joint reasoning over the topology and evolution of the graph. SiST-GNN overcomes this limitation by fusing spatial and temporal signals within a single message-passing operation. It maintains a recurrent hidden state for each node that summarizes its historical information and combines it with the node's current feature vector. This creates a temporally augmented graph that allows for richer message passing, enabling nodes to consider both current features and historical trajectories simultaneously. The empirical evaluation of SiST-GNN across nine public benchmarks demonstrates its superiority in link prediction tasks, achieving state-of-the-art performance improvements over existing methods. Additionally, it shows competitive results in dynamic node classification tasks, indicating its versatility and effectiveness in handling dynamic graph data.
Methodology
SiST-GNN employs a simultaneous message-passing mechanism that combines recurrent hidden states summarizing historical node information with current feature vectors. This is achieved by constructing a temporally augmented graph for each snapshot, where nodes are connected by both spatial and cross-time edges. The model runs a standard graph convolution on this enriched graph, allowing for a more nuanced aggregation of information from both temporal and spatial perspectives.
Results
SiST-GNN sets new benchmarks in link prediction tasks, improving the mean reciprocal rank (MRR) by 109–277% in fixed-split settings and by 68–194% in live-update settings compared to the strongest prior methods. In dynamic node classification tasks, it outperforms leading discrete-time baselines by 7–22% and matches the performance of continuous-time models.
Implications
The proposed SiST-GNN framework has significant implications for applications involving dynamic graphs, such as social network analysis, financial transaction monitoring, and anomaly detection in evolving systems. Its ability to effectively integrate temporal and spatial information can enhance predictive modeling and decision-making in various domains.
Polymorphism Is Rotation: Operational Mechanistic Interpretability from a Two-Layer Transformer to Pythia-70m
Interpretability
Large Language Models
Theory
- Introduces the concept of 'polymorphism' in transformer models, highlighting the impact of random rotations on internal coordinates.
- Demonstrates that a single orthogonal Procrustes rotation can align features between independently trained models without retraining.
- Challenges the interpretation of high decoder-column cosine similarity as evidence of universality, revealing encoder failures.
- Validates findings across different model sizes, confirming the robustness of the rotation phenomenon.
Read more
Polymorphism Is Rotation: Operational Mechanistic Interpretability from a Two-Layer Transformer to Pythia-70m
Summary
This paper introduces the concept of 'polymorphism' in independently trained transformers, where models compute the same function but have different internal coordinate representations due to a uniform random rotation in their residual-stream bases. The author demonstrates that a single matrix multiplication, specifically an orthogonal Procrustes fit, can effectively transfer feature dictionaries and steering vectors between these models without retraining. The study reveals that while decoder-column cosine similarity appears high (98%), it masks significant encoder failures when applying sparse-autoencoder (SAE) universality metrics. The paper validates the findings on both a small two-layer transformer and nine independently trained Pythia-70m models, establishing that the rotation phenomenon is not an artifact of smaller models but holds at larger scales. The research employs a rigorous methodological framework, including operational bars and analytical lenses, to substantiate its claims and offers insights into the implications for model interpretability and universality in machine learning.
Methodology
The study employs a four-bar operational framework to evaluate model behavior and performance, alongside a five-lens analytical stack for comprehensive analysis. The Procrustes formula is used to fit an orthogonal rotation between model activations, allowing for the transfer of features across models.
Results
The application of the Procrustes rotation significantly improved the reconstruction of activations across different seeds, achieving an explained variance close to the within-seed ceiling. The rotation's properties were statistically validated, showing a Haar-distributed nature and confirming the uniform random orthogonal sample hypothesis.
Implications
The findings suggest that current metrics for assessing model universality may be misleading, emphasizing the need for more nuanced approaches to interpretability in machine learning. This research could influence future studies on model alignment, feature transferability, and the understanding of transformer architectures.
FastKernels: Benchmarking GPU Kernel Generation in Production
Large Language Models
Optimization
Efficient ML
- FASTKERNELS provides a benchmark-as-framework approach that integrates evaluation and deployment.
- It features a compositional task hierarchy that allows for the reuse of optimizations across different levels.
- The framework evaluates kernels with production baselines, capturing real-world tensor inputs and multi-GPU communication patterns.
- FASTKERNELS covers a wide range of architectures, ensuring broad applicability across various domains.
Read more
FastKernels: Benchmarking GPU Kernel Generation in Production
Summary
The paper introduces FASTKERNELS, a novel benchmarking framework designed to address the misalignment between existing GPU kernel benchmarks and real-world production environments. Current benchmarks often evaluate kernels in isolation, using synthetic inputs and simplified interfaces, which leads to misleading performance signals. FASTKERNELS aims to provide a more accurate evaluation by incorporating a minimal set of 46 representative architectures across 8 categories, ensuring that the kernels generated are compatible with production inference frameworks. The framework supports continuous batching, multimodal inputs, and a production-grade serving API, allowing for direct deployment of optimized kernels. The authors evaluate state-of-the-art kernel agents using FASTKERNELS and find that even the best-performing agent achieves only a 0.94× speedup over production baselines, highlighting the critical impact of benchmark-production misalignment. The paper emphasizes the importance of aligning benchmarks with real-world applications to improve kernel generation and performance in production settings.
Methodology
The authors developed FASTKERNELS as a self-contained inference framework that mirrors production environments. It organizes tasks hierarchically from primitive operations to full models, allowing for dynamic optimization. The framework captures real tensor inputs and evaluates kernels in the context of multi-GPU communication and production workloads, ensuring a more faithful assessment of performance.
Results
Evaluation of state-of-the-art kernel agents on FASTKERNELS revealed that the best agent achieved only a 0.94× speedup over existing production baselines, with weaker agents performing even worse at 0.78× and 0.53×. This highlights the significant performance limitations imposed by current benchmark methodologies.
Implications
FASTKERNELS has the potential to improve the development and deployment of GPU kernels in production environments by providing a more accurate benchmarking framework. This could lead to better optimization strategies and enhanced performance for various applications, particularly in large-scale machine learning and AI systems.
Knowledge Graph Modulated Deep Learning for Limited-Sample Clinical Data Analysis
Graph Learning
- Introduction of Graph-in-Graph (GiG) framework for clinical data analysis.
- GiG allows integration of patient-specific gene expression data with biological knowledge graphs.
- Significant performance improvements in clinical tasks, particularly in limited-sample settings.
- Prostate cancer diagnosis task shows up to 49 percentage points improvement in macro-F1 score.
Read more
Knowledge Graph Modulated Deep Learning for Limited-Sample Clinical Data Analysis
Summary
This paper presents a novel framework called Graph-in-Graph (GiG) that integrates knowledge graphs into deep learning models for analyzing limited-sample clinical data. Traditional biomedical AI models often compress biological knowledge into low-dimensional representations, leading to information loss and reduced performance, especially in clinical settings with limited patient samples. GiG addresses this by representing each patient as a modular graph, where curated biological knowledge graphs are incorporated as edges and patient-specific gene expression profiles serve as node features. This design allows for the integration of multiple biological knowledge graphs while enabling the model to learn patient-specific disease representations that maintain biologically relevant information. The authors demonstrate the effectiveness of GiG across various clinical tasks involving nearly 9,700 patients, showing significant performance improvements over traditional methods, particularly in limited-sample scenarios. Notably, GiG achieves a macro-F1 performance increase of up to 49 percentage points in prostate cancer diagnosis compared to state-of-the-art methods. Control experiments confirm that the performance gains stem from the biologically grounded structure of the knowledge graphs rather than graph modeling alone. The findings suggest that knowledge graph-modulated deep learning enhances robustness, interpretability, and sample efficiency in clinical data analysis, providing a principled approach for integrating biological knowledge into predictive modeling.
Methodology
The GiG framework constructs patient-specific molecular interaction graphs by integrating transcriptomic profiles with curated biological pathway knowledge. Each patient's data is represented as a modular graph, where nodes represent genes and edges encode biological relationships. The framework employs graph neural networks (GNNs) to learn from these graphs, enabling effective predictive modeling across clinical tasks.
Results
GiG outperformed traditional methods across multiple clinical tasks involving nearly 9,700 patients. In the prostate cancer diagnosis task, GiG achieved a macro-F1 performance improvement of up to 49 percentage points compared to state-of-the-art methods. Control experiments confirmed that the performance enhancements were due to the biologically informed structure of the knowledge graphs.
Implications
The findings indicate that integrating knowledge graphs into deep learning models can significantly enhance the analysis of clinical data, particularly in scenarios with limited samples. This approach could lead to more robust and interpretable predictive models in biomedical research and clinical practice.
Active Learning for Stochastic Contextual Linear Bandits
Reinforcement Learning
Theory
Efficient ML
- Introduces an active learning framework for stochastic contextual linear bandits.
- Demonstrates that active context sampling can improve sample efficiency over passive methods.
- Provides theoretical guarantees showing performance improvements by a factor of √d.
- Empirical results validate the effectiveness of the proposed algorithm in real-world applications.
Read more
Active Learning for Stochastic Contextual Linear Bandits
Summary
This paper addresses the challenge of efficiently learning a near-optimal policy in stochastic contextual linear bandits (SCLBs) by introducing an active learning approach that strategically samples both contexts and actions. Traditional methods typically rely on passive context sampling, which limits their efficiency in real-world applications where prior knowledge of context distributions exists. The authors propose a novel algorithm that leverages this knowledge to actively select context-action pairs, thereby improving sample efficiency. They provide theoretical guarantees that demonstrate the potential of their approach to outperform existing methods by a factor of √d, where d is the linear dimension. Empirical evaluations in practical scenarios, such as warfarin dose prediction and joke recommendation, show that their algorithm significantly reduces the number of samples required to achieve a near-optimal policy. This work highlights the importance of active context sampling in enhancing the performance of contextual bandit algorithms.
Methodology
The authors develop an exploration algorithm that actively samples both contexts and actions based on prior knowledge of the context distribution. They analyze the theoretical performance of this approach, providing instance-dependent guarantees that demonstrate its superiority over traditional passive sampling methods. The algorithm is evaluated empirically in various practical scenarios to assess its effectiveness in reducing sample complexity.
Results
The proposed algorithm shows a significant reduction in the number of samples needed to learn a near-optimal policy compared to traditional methods. The theoretical analysis confirms that the active context sampling strategy can improve performance by up to a factor of √d, and empirical results in tasks such as warfarin dose prediction and joke recommendation support these findings.
Implications
This research has important implications for fields such as healthcare, marketing, and recommendation systems, where leveraging prior knowledge about context distributions can lead to more efficient and effective decision-making processes. The findings suggest that incorporating active learning strategies into contextual bandit frameworks can enhance the performance of algorithms in real-world applications.
Feature Learning in Wide Neural Networks under $μ$P: Identifiability and Sparse-Dictionary Decomposition of the Mean-Field Limit
Theory
- Establishes global existence and uniqueness of mean-field limit under µP.
- Characterizes identifiability of network functions based on active components.
- Describes sparse-dictionary decomposition of long-time limit measures.
- Derives total feature-learning-error decomposition into various components.
Read more
Feature Learning in Wide Neural Networks under $μ$P: Identifiability and Sparse-Dictionary Decomposition of the Mean-Field Limit
Summary
This paper presents four significant structural results regarding feature learning in wide two-layer neural networks using the Maximal Update Parametrization (µP). The first result establishes the global existence and uniqueness of the mean-field limit of noisy gradient descent under µP, identifying the maximal admissible weight (w*) that governs the moment sequence of initialization. The second result characterizes the identifiability of the mean-field limit, showing that two admissible parameter measures yield the same network function in L2 if their active components align, considering the finite-rank realization symmetry of the architecture. The third result provides a sparse-dictionary decomposition of the long-time limit measure, indicating that the active component is supported on a limited number of atoms, bounded by a specific coefficient-threshold number. The fourth result details the total feature-learning-error decomposition into statistical, optimization, propagation-of-chaos, and sparse-residual components, with a focus on the target-dependent Hermite/Barron tail. These results are interconnected through an architectural identity that defines the learning cell of the architecture-data pair, linking the maximal admissible weight, orbit identifiability depth, and sparse-dictionary depth. The proofs are comprehensive, relying on established results from µP and mean-field Langevin theory.
Methodology
The paper employs theoretical analysis and proofs to establish results related to mean-field limits, identifiability, and error decomposition in the context of wide neural networks under the Maximal Update Parametrization. It utilizes concepts from mean-field Langevin dynamics and tensor programs to derive its conclusions.
Results
The main results include the identification of the maximal admissible weight w*, a characterization of identifiability in terms of active components, a sparse-dictionary decomposition of the long-time limit measure, and a comprehensive total error decomposition that accounts for various contributing factors.
Implications
The findings have implications for understanding the dynamics of feature learning in neural networks, particularly in optimizing architectures for better performance and interpretability. The results could inform future research on neural network training and the design of architectures that leverage these insights.
Open Multimodal Datasets and Open-Source Software for Data-Driven Modeling of Multiphase Transport and Thermal Systems
Multimodal
Time Series
Computer Vision
- Introduction of the S+TD framework for classifying thermal-fluid datasets.
- Organization of public NED3 datasets for easier access and usability.
- Development of open-source software packages to support diverse data analysis tasks.
- Emphasis on SeqReg for sequence regression applications in thermal-fluid systems.
Read more
Open Multimodal Datasets and Open-Source Software for Data-Driven Modeling of Multiphase Transport and Thermal Systems
Summary
This paper addresses the challenges in data-driven modeling of multiphase transport and thermal systems, which are often hindered by fragmented datasets and complex data formats. The authors propose an open ecosystem comprising multimodal datasets and open-source software to facilitate reproducible AI-enabled thermal-fluid research. They introduce a spatial-plus-temporal dimensionality framework (S+TD) that classifies datasets based on the dimensionality of the measured or simulated fields, ranging from point values to volumetric fields. The paper organizes various public datasets from the NED3 Laboratory, including boiling images, acoustic measurements, and thermal data, and describes complementary software packages designed for tasks such as computer vision, sequence regression, and surrogate modeling. A particular focus is given to SeqReg, a sequence-regression library that supports various data types and applications. The authors advocate for community efforts to develop interoperable thermal-fluid databanks and AI/ML tool libraries to enhance the accessibility and usability of data-driven research.
Methodology
The authors developed a framework (S+TD) to classify datasets based on spatial and temporal dimensions. They organized existing datasets and created open-source software tools to facilitate data analysis and modeling. The paper also reviews existing literature on AI models and datasets relevant to thermal-fluid systems.
Results
The paper successfully categorizes various datasets using the S+TD framework and presents a suite of open-source software tools that enhance the analysis of thermal-fluid data. The introduction of SeqReg demonstrates the potential for nonintrusive heat-flux estimation and other applications.
Implications
The proposed open ecosystem can significantly improve the reproducibility and accessibility of data-driven research in thermal-fluid systems, enabling better modeling and diagnostics. The community-driven approach encourages collaboration and the development of standardized tools and datasets.
Iterative Refinement Neural Operators are Learned Fixed-Point Solvers: A Principled Approach to Spectral Bias Mitigation
Theory
Optimization
Efficient ML
- Introduction of the Iterative Refinement Neural Operator (IRNO) to mitigate spectral bias in neural operators.
- Establishment of a contraction-based analysis for IRNO, ensuring convergence to a unique fixed point.
- Demonstration of consistent error reduction across multiple physical systems and tasks.
- Effective spectral bias mitigation and cross-operator transferability, enhancing the versatility of neural operators.
Read more
Iterative Refinement Neural Operators are Learned Fixed-Point Solvers: A Principled Approach to Spectral Bias Mitigation
Summary
This paper introduces the Iterative Refinement Neural Operator (IRNO), a novel approach designed to enhance the performance of neural operators in scientific modeling by addressing the issue of spectral bias. Traditional neural operators, while effective as surrogates for solving parametric Partial Differential Equations (PDEs), often struggle with high-frequency detail resolution due to their reliance on a single-pass inference method. The IRNO framework augments pre-trained operators with a learned refinement module that iteratively corrects predictions through fixed-point iterations. This method decomposes the prediction process into an initial coarse estimate followed by successive residual corrections, akin to classical numerical solvers. The authors establish a theoretical foundation for IRNO, demonstrating that it converges to a unique fixed point under local assumptions. A progressive spectral loss function is introduced to specifically target high-frequency errors during training. Empirical results across various physical systems show that IRNO consistently reduces error rates, achieving up to 56.05% improvement in turbulent flow scenarios and significantly lowering normalized error ratios across frequency bands. The framework also exhibits stability in extrapolation beyond training iterations and allows for cross-operator transferability, making it a versatile tool for enhancing the accuracy of neural operators without the need for retraining.
Methodology
The IRNO framework is formalized as a two-stage dynamical system where a pre-trained base operator generates an initial prediction, which is then iteratively refined through a learned correction operator. The refinement process is modeled as a fixed-point iteration in function space, allowing for systematic error reduction and convergence analysis.
Results
IRNO demonstrated up to 56.05% improvement in error rates for turbulent flow and significantly reduced normalized error ratios across frequency bands (27.72–36.10% in low, 5.07–6.68% in mid, and 1.48–2.04% in high frequencies). The method also showed stable extrapolation capabilities and effective cross-operator transferability.
Implications
The IRNO framework presents a significant advancement in the field of scientific modeling, particularly for applications involving complex physical systems. Its ability to refine predictions without retraining opens new avenues for real-time applications and enhances the accuracy of neural operators in various scientific domains.
Continuous-Depth Field Theory for Transformer Patching and Mechanistic Interpretability
NLP
Large Language Models
Interpretability
- Introduces a field-theoretic framework for Transformer patching.
- Empirically tests the framework on GPT-2-style models.
- Establishes a bounded local linear regime for patch effects.
- Demonstrates structured anisotropic propagation of interventions.
Read more
Continuous-Depth Field Theory for Transformer Patching and Mechanistic Interpretability
Summary
This paper introduces a field-theoretic framework for mechanistic interpretability in Transformers, specifically focusing on activation patching and its predictive capabilities. The authors conceptualize the residual stream of a Transformer as a depth-token field, allowing for the formulation of patching as localized source insertions. The effects of these patches are described through sensitivity fields and empirical Green-function responses, enabling a structured approach to understanding how interventions propagate through the model. The methodology is empirically tested on GPT-2-style autoregressive Transformers, where localized interventions are applied to observe changes in residual fields and logit differences. The findings reveal a bounded local linear regime, where first-order sensitivities can predict patch effects, and demonstrate structured anisotropic propagation across depth and token positions. The paper establishes response objects such as sensitivities and Green-operator slices as practical tools for organizing patching experiments and formulating inference across scales, ultimately contributing to a deeper understanding of Transformer behavior.
Methodology
The authors treat the residual stream of Transformers as a depth-token field and formulate patching as localized source insertions. They derive sensitivity fields and Green-response structures to predict the effects of interventions. Empirical tests involve applying localized residual-field interventions and measuring the resulting changes in residual fields and logits.
Results
The empirical tests confirm the existence of a bounded local linear regime, where first-order sensitivities accurately predict patch effects. The study also identifies structured anisotropic propagation of responses across depth and token positions, and shows that prompt-induced residual displacements can influence answer behavior.
Implications
The proposed framework offers a systematic approach to mechanistic interpretability in Transformers, enabling researchers to predict the effects of interventions more efficiently. It also lays the groundwork for future studies on model scaling and patch-site inference, potentially enhancing the understanding of large language models.
Looped Diffusion Language Models
NLP
Generative Models
Efficient ML
- LoopMDM improves training efficiency by reducing FLOPs without adding parameters.
- Selective looping enhances performance on reasoning tasks, outperforming non-looped models.
- Adaptive loop counts during inference lead to further gains in compute efficiency.
- Attention analysis reveals increased interactions among masked positions with looping.
Read more
Looped Diffusion Language Models
Summary
This paper introduces the Looped Masked Diffusion Model (LoopMDM), a novel architecture for masked diffusion language models (MDMs) that enhances training efficiency and model performance by selectively looping early-middle transformer layers. The authors demonstrate that this selective looping approach allows for a depth-scaling effect without increasing the number of parameters, and it offers flexible compute scaling during inference. LoopMDM achieves significant reductions in training floating-point operations (FLOPs) while maintaining or improving performance on various reasoning benchmarks. The results indicate that LoopMDM can match the performance of non-looped MDMs with up to 3.3 times fewer training FLOPs and outperforms deeper non-looped models trained with similar compute. Additionally, the paper provides insights into how looping promotes interactions among masked positions, leading to improved model capabilities, particularly in complex tasks like math reasoning. The findings suggest that architectural choices, such as selective looping, can substantially enhance the efficiency and effectiveness of MDMs, paving the way for further research in this area.
Methodology
The authors conducted a systematic study to integrate looping into MDMs, focusing on selectively applying loops to early-middle transformer layers. They compared LoopMDM against traditional non-looped MDMs across multiple pre-training corpora, analyzing training compute efficiency, performance on reasoning benchmarks, and the effects of adaptive inference strategies.
Results
LoopMDM achieved the same test negative log-likelihood (NLL) as non-looped MDMs with up to 3.3 times fewer training FLOPs. It outperformed non-looped models on various benchmarks, including an 8.5% improvement on the GSM8K math reasoning task. The model also demonstrated enhanced performance through adaptive adjustments in loop counts during inference.
Implications
The findings suggest that selective architectural choices can significantly enhance the efficiency and performance of language models, encouraging further exploration of looping mechanisms in MDMs and potentially influencing future designs of transformer architectures.
Anytime Training with Schedule-Free Spectral Optimization
Optimization
Theory
Large Language Models
- SF-NorMuon outperforms SF-AdamW and matches tuned AdamW baselines.
- The method enables high-quality checkpoints at any training point without a fixed horizon.
- Theoretical guarantees for stationarity and the importance of weight decay for stability are provided.
- The approach addresses the optimization challenges in continual learning scenarios.
Read more
Anytime Training with Schedule-Free Spectral Optimization
Summary
This paper addresses the limitations of standard neural network training, which relies on fixed learning-rate schedules that can lead to path dependence and require extensive re-tuning as data availability changes. The authors introduce SF-NorMuon, a schedule-free spectral optimizer that outperforms the current state-of-the-art anytime optimizer, SF-AdamW, by matching or exceeding the performance of tuned AdamW baselines across various parameter sizes and training horizons. The theoretical contributions include a stationarity guarantee for schedule-free spectral dynamics and the identification of weight decay as essential for stability in long-horizon training. SF-NorMuon allows practitioners to obtain high-quality model checkpoints at any point during training without needing to commit to a specific training horizon, thus facilitating continual learning and making horizon-free optimization more practical.
Methodology
The authors propose SF-NorMuon, a schedule-free spectral optimizer that utilizes spectral norm optimization techniques to improve upon existing methods. The framework maintains three iterate sequences and incorporates weight decay to ensure stability during long-horizon training. The performance is evaluated on large language models with varying parameter sizes and training horizons.
Results
SF-NorMuon consistently outperformed SF-AdamW and matched the performance of well-tuned AdamW optimizers on large language models with 125M and 772M parameters across different training horizons, demonstrating its effectiveness in horizon-free optimization.
Implications
The development of SF-NorMuon has significant implications for practitioners in machine learning, particularly in scenarios involving continual learning and dynamic data availability. It simplifies the training process by removing the need for extensive hyperparameter tuning and allows for more flexible model updates.
Active Query Synthesis for Preference Learning
Optimization
Robotics
Efficient ML
- Introduction of a confidence-aware response model for pairwise comparisons to enhance query reliability.
- Development of Info-Synth, an active query synthesis framework that generates optimal continuous queries.
- Two approximation strategies, Pair M-dist and Pair Opt-dist, for effective query selection in finite pools.
- Empirical validation of the framework across synthetic data, real-world preference learning, and robotic gain tuning.
Read more
Active Query Synthesis for Preference Learning
Summary
This paper addresses the challenge of efficiently learning user preferences in decision-making systems, which often require costly labeled data. The authors propose an active learning framework called Info-Synth that synthesizes optimal queries by maximizing a mutual information-based objective within a continuous space, thus reducing computational costs associated with traditional pool-based evaluation methods. A key innovation is the introduction of a confidence-aware response model that accounts for the reliability of feedback in pairwise comparisons, particularly in cases where items are nearly identical or entirely dissimilar. The framework includes two strategies, Pair M-dist and Pair Opt-dist, to effectively select queries even when limited to finite pools. The authors validate their approach through empirical evaluations across three distinct settings: synthetic preference learning, constrained text summary datasets, and tuning controller gains for a simulated mobile robot. The results demonstrate the versatility and effectiveness of the proposed methods in improving preference learning and parameter tuning tasks.
Methodology
The authors employ a Bayesian framework to estimate user preferences through active pairwise comparison queries. The Info-Synth framework synthesizes queries by maximizing mutual information, while the confidence-aware response model ensures that the selected queries yield high-confidence feedback. The two approximation strategies, Pair M-dist and Pair Opt-dist, are designed to adapt the synthesized queries for lower and higher-dimensional datasets, respectively.
Results
The proposed framework shows significant improvements in query efficiency and reliability across various settings. The empirical evaluations indicate that Info-Synth outperforms traditional methods in terms of both computational cost and the quality of preference estimation, effectively demonstrating its applicability in real-world scenarios such as robotic gain tuning.
Implications
The findings suggest that the active query synthesis approach can be widely applied in systems requiring user preference learning, such as recommendation systems and adaptive user interfaces. Additionally, the methods can enhance parameter tuning processes in cyber-physical systems, leading to more efficient and user-centered designs.
JacQuant: STE-Free Quantization-Aware Training via Learned Jacobian Surrogates
Large Language Models
Optimization
Efficient ML
- JacQuant replaces the STE's identity Jacobian with a learned surrogate sensitivity map, improving gradient alignment.
- The framework is computationally efficient, with negligible overhead and compatibility with existing quantization methods.
- Theoretical convergence guarantees are provided for non-convex objectives under the learned sensitivity model.
- Empirical results show significant accuracy improvements in ultra-low-bit quantization scenarios on LLM benchmarks.
Read more
JacQuant: STE-Free Quantization-Aware Training via Learned Jacobian Surrogates
Summary
The paper introduces JacQuant, a novel framework for quantization-aware training (QAT) that eliminates the reliance on the Straight-Through Estimator (STE) by learning a surrogate Jacobian to stabilize and accelerate training. Traditional QAT methods using STE often face challenges near bin boundaries, leading to instability and suboptimal training outcomes. JacQuant addresses this by learning a lightweight surrogate sensitivity map that adapts the gradient propagation based on the quantizer's local behavior. The method is designed to be computationally efficient, requiring minimal overhead while maintaining compatibility with existing quantization techniques. The authors provide theoretical guarantees for convergence in non-convex settings and demonstrate that JacQuant consistently outperforms STE-based QAT in terms of accuracy on large language models (LLMs) with ultra-low-bit quantization. The empirical results show that JacQuant is a drop-in replacement for STE, enhancing both stability and accuracy without significant runtime costs.
Methodology
JacQuant employs a learned surrogate Jacobian to replace the fixed identity Jacobian used in STE. It utilizes either random probes or subtractive dithering to estimate local sensitivity, allowing for adaptive gradient propagation that aligns with the quantizer's behavior. The surrogate Jacobian is updated intermittently to maintain computational efficiency while ensuring accurate gradient flow during training.
Results
JacQuant demonstrates higher accuracy than STE-based QAT across various LLM benchmarks at quantization levels of 2 bits or less. The method shows improved stability and reduced oscillations near quantization boundaries, with empirical evaluations confirming its effectiveness on models like LLaMA3 and Qwen3. The additional computational cost of JacQuant is shown to be negligible, making it a practical choice for ultra-low-bit QAT.
Implications
The development of JacQuant has significant implications for deploying large language models in resource-constrained environments, where memory and latency are critical. By improving the training process for ultra-low-bit quantization, JacQuant can facilitate more efficient model deployment without sacrificing accuracy, potentially broadening the accessibility of advanced AI models.
Building a privacy-preserving Federated Recommender system for mobile devices
Federated Learning
- Introduction of a federated learning approach to enhance user privacy in mobile recommendations.
- Development of a two-stage recommendation pipeline for candidate generation and ranking.
- Implementation of the system on Kotlin Multiplatform for cross-platform deployment.
- Utilization of diverse datasets for system validation and performance evaluation.
Read more
Building a privacy-preserving Federated Recommender system for mobile devices
Summary
This paper addresses the growing concerns around user privacy in mobile applications that traditionally rely on centralized data collection for recommendations. With the introduction of privacy regulations like GDPR and increasing user awareness, there is a pressing need for new methods to deliver personalized content without compromising user data. The author proposes a privacy-preserving federated recommender system that utilizes federated learning, allowing models to be trained locally on user devices while sharing only model updates. The proposed system consists of a two-stage pipeline: the first stage generates candidate recommendations, and the second stage ranks these candidates based on hyper-personalized user data. This second stage is designed to operate in a federated manner, ensuring that sensitive user information remains on the device. The author employs various datasets, including MovieLens and Human Activity Recognition datasets, to validate the system. The final implementation is a machine learning library developed on Kotlin Multiplatform, enabling deployment on both Android and iOS devices. The work emphasizes the balance between effective recommendation systems and user privacy, showcasing a proof-of-concept that can be tested in real-world scenarios.
Methodology
The methodology involves a two-stage federated learning pipeline where the first stage focuses on generating candidate recommendations based on local data, and the second stage ranks these candidates using sensitive user data, all while ensuring that no actual data leaves the user's device. The system leverages various open-source datasets for training and evaluation.
Results
The proposed system successfully demonstrates the feasibility of a privacy-preserving federated recommendation engine, achieving effective personalization without compromising user data privacy. The implementation is validated using multiple datasets, showcasing its potential for real-world application.
Implications
This work has significant implications for the development of privacy-conscious applications in the mobile ecosystem, particularly in the context of personalized content delivery. It highlights the potential for federated learning to transform how recommendations are generated while adhering to privacy regulations.
Trajectory-Based Difficulty Scoring for Reliable Learning on Tabular Data
Interpretability
- Introduction of Trajectory-based Difficulty Score (TDS) for estimating instance difficulty in boosted ensembles.
- TDS utilizes interpretable trajectory descriptors to predict held-out loss effectively.
- Demonstrated strong rank correlation with error and outperformance of existing baselines in classification tasks.
- TDS enhances active learning, selective prediction, and conformal prediction workflows.
Read more
Trajectory-Based Difficulty Scoring for Reliable Learning on Tabular Data
Summary
This paper introduces a novel method called Trajectory-based Difficulty Score (TDS) aimed at improving the performance of gradient-boosted trees on tabular data, which often suffer from a long tail of poorly predicted instances. TDS is an instance-level difficulty estimator derived from the cumulative prediction trajectories of individual trees in an ensemble. By analyzing trajectory descriptors such as variance, oscillation peaks, and sign switches, the authors develop a lightweight regression model that predicts held-out loss. The resulting scores, calibrated using an empirical cumulative distribution function, allow for effective ranking of difficult cases. The paper demonstrates that TDS outperforms existing instance-hardness and uncertainty measures across various classification benchmarks and remains competitive in regression tasks. Furthermore, the authors showcase how TDS can enhance multiple data mining workflows, including active learning, selective prediction, and conformal prediction. The clustering of high-TDS instances using SHAP attributions reveals coherent failure modes, facilitating error analysis and targeted data acquisition. Overall, TDS represents a significant advancement in addressing the challenges of learning from tabular data.
Methodology
The authors compute trajectory descriptors from the cumulative predictions of each tree in a boosted ensemble. They extract features such as variance, oscillation peaks, and sign switches, which are then used to train a lightweight regression model to predict held-out loss. An empirical cumulative distribution function calibrates these predictions into a difficulty score.
Results
The empirical evaluation shows that TDS scores correlate strongly with actual prediction errors, outperforming established difficulty and uncertainty measures. TDS-driven methods improve active learning efficiency and enhance the reliability of selective and conformal predictions across diverse tabular datasets.
Implications
The TDS framework can significantly improve the training and evaluation processes for models dealing with tabular data, enabling more effective sample prioritization, better risk-coverage trade-offs, and enhanced interpretability through error analysis. This approach can be applied across various domains where tabular data is prevalent, such as healthcare, finance, and cybersecurity.
On the Impact of Class Imbalance on the Learning Dynamics of Deep Neural Networks: An Intuitive Insight
Theory
- Class imbalance severely impacts the learning dynamics of DNNs, leading to underfitting of minority classes.
- DNNs initially focus on the majority class, which results in poor performance on minority class samples.
- Even when minority samples are learned, the representations are often overfitted and non-generalizable.
- A systematic investigation of learning patterns is essential for developing effective methods to address class imbalance.
Read more
On the Impact of Class Imbalance on the Learning Dynamics of Deep Neural Networks: An Intuitive Insight
Summary
This paper investigates the impact of class imbalance on the learning dynamics of deep neural networks (DNNs), a critical issue that has gained attention in recent years. The authors highlight that while DNNs have shown remarkable success in various applications, they are not immune to the challenges posed by imbalanced datasets. The study systematically monitors the learning patterns of DNNs on datasets with varying imbalance ratios, revealing that class imbalance significantly deteriorates DNN performance. Specifically, the findings indicate that during early training epochs, DNNs tend to underfit minority class samples while primarily learning from the majority class. Although DNNs eventually learn minority samples, the representations learned are often non-generalizable at the test phase, leading to overfitting. The paper emphasizes the need for a deeper understanding of how class imbalance affects DNN learning dynamics to develop more effective methods for handling imbalanced data.
Methodology
The authors conducted experiments by monitoring the learning patterns of DNN models on datasets with varying class imbalance ratios. They analyzed the performance of DNNs on both majority and minority classes throughout the training epochs.
Results
The experimental findings demonstrate that class imbalance leads to significant performance deterioration in DNNs, particularly in the early training phases. DNNs tend to underfit minority class samples while overfitting to the majority class, resulting in non-generalizable representations for minority samples during testing.
Implications
The insights from this study can inform the development of more effective DNN-based methods for handling class imbalance, which is prevalent in many real-world applications such as fraud detection and disease diagnosis.
Batch Normalization Amplifies Memorization and Privacy Risks
Theory
Optimization
- Batch Normalization significantly increases the memorization of outlier samples in deep neural networks.
- Models with Batch Normalization show higher susceptibility to membership inference attacks, indicating privacy risks.
- The study employs a multifaceted approach, combining empirical experiments and theoretical analysis.
- Theoretical insights reveal that BN amplifies the influence of outlier samples during training.
Read more
Batch Normalization Amplifies Memorization and Privacy Risks
Summary
This paper investigates the impact of Batch Normalization (BN) on the memorization of atypical or outlier samples in deep neural networks and its implications for privacy risks. While BN is widely used to enhance training stability and convergence speed, the authors reveal that it significantly increases the memorization of outliers, which can lead to privacy vulnerabilities. The study employs three complementary approaches: (1) experiments on out-of-distribution samples, (2) membership inference attacks to assess model susceptibility, and (3) per-sample influence analysis through gradient norms. The findings indicate that models with BN exhibit faster and stronger memorization of outliers and are more susceptible to membership inference attacks. The authors also provide a theoretical analysis that explains how BN amplifies the influence of outlier samples during training, highlighting a critical tension between optimization benefits and privacy guarantees in deep learning. This research underscores the need for careful consideration of BN's use in privacy-sensitive applications.
Methodology
The authors conducted an empirical study using three approaches: (1) controlled experiments with label-flip and out-of-distribution samples to assess memorization, (2) membership inference attacks to quantify privacy leakage, and (3) gradient-based metrics to analyze per-sample influence on model training.
Results
The results consistently demonstrate that Batch Normalization accelerates the memorization of outlier samples and increases the model's vulnerability to membership inference attacks. The theoretical analysis shows that BN amplifies the per-step influence of outlier samples, providing a mechanistic understanding of the observed phenomena.
Implications
The findings highlight significant privacy risks associated with the use of Batch Normalization in deep learning models, particularly in applications involving sensitive data. This research calls for a reassessment of BN's role in privacy-sensitive contexts and suggests that practitioners should consider alternative normalization techniques or strategies to mitigate these risks.
GEMQ: Global Expert-Level Mixed-Precision Quantization for MoE LLMs
Large Language Models
Efficient ML
NLP
- GEMQ introduces a global approach to mixed-precision quantization for MoE-LLMs, improving expert importance estimation.
- The method includes a linear programming formulation for optimal bit-width allocation based on quantization error analysis.
- Router fine-tuning is employed to adapt to quantized experts, enhancing expert selection accuracy.
- GEMQ is integrated into a progressive quantization framework, refining expert importance estimation iteratively.
Read more
GEMQ: Global Expert-Level Mixed-Precision Quantization for MoE LLMs
Summary
The paper introduces Global Expert-level Mixed-Precision Quantization (GEMQ), a novel approach designed to optimize the memory efficiency of Mixture-of-Experts Large Language Models (MoE-LLMs) while maintaining performance. MoE-LLMs, despite their strong performance, face significant memory overhead due to the large number of expert parameters. Existing mixed-precision quantization methods often rely on layer-wise importance estimation, which fails to account for the global importance of experts and the shifts in routing caused by quantization. GEMQ addresses these limitations through a two-pronged approach: first, it employs a global linear programming formulation to assess expert importance based on quantization error analysis, allowing for more effective bit-width allocation across the entire model rather than within individual layers. Second, it includes efficient router fine-tuning to adapt the routing mechanisms to the quantized experts, thereby improving the accuracy of expert selection. The proposed method is integrated into a progressive quantization framework that iteratively refines the estimation and allocation of expert importance. Experimental results demonstrate that GEMQ significantly reduces memory usage and accelerates inference with minimal accuracy degradation, showcasing its effectiveness for practical deployment of MoE-LLMs.
Methodology
GEMQ utilizes a global linear programming model to determine expert bit-width allocation based on quantization error analysis. It also incorporates a parameter-efficient fine-tuning strategy for routers to adapt to the quantized experts, allowing for improved routing and expert selection. The method is implemented within a progressive quantization framework that iteratively refines the allocation process.
Results
The experiments conducted on various MoE-LLMs demonstrate that GEMQ achieves significant reductions in memory usage while largely preserving the model's capabilities. The method allows for effective low-bit quantization without substantial accuracy loss, making it suitable for practical applications.
Implications
The findings suggest that GEMQ can facilitate the deployment of large-scale MoE-LLMs in resource-constrained environments, enabling more efficient use of memory and computational resources while maintaining high performance in natural language processing tasks.
Merge-Bench: Resolve Merge Conflicts with Large Language Models
Large Language Models
Reinforcement Learning
Optimization
- Introduction of Merge-Bench, a scalable dataset for merge conflict resolution.
- Development of LLMergeJ, the first model trained with online reinforcement learning for this task.
- LLMergeJ outperforms three commercial LLMs in resolving Java merge conflicts.
- The paper proposes a test-free evaluation paradigm to enhance the reliability of model assessments.
Read more
Merge-Bench: Resolve Merge Conflicts with Large Language Models
Summary
This paper addresses the challenge of resolving merge conflicts in version control systems using machine learning techniques. The authors introduce a new dataset, Merge-Bench, consisting of 7938 real-world merge conflict hunks sourced from 1439 GitHub repositories, with ground truth based on actual developer resolutions. This dataset is notable for its scalability and lack of manual labeling, making it a robust benchmark for evaluating merge conflict resolution tools. The authors developed LLMergeJ, a 14B parameter model trained using Group Relative Policy Optimization (GRPO), an online reinforcement learning method. LLMergeJ is specifically designed to resolve merge conflicts in Java programs. The paper presents evaluations showing that LLMergeJ outperforms three commercial large language models (LLMs), achieving 49% exact match accuracy and 59% source code match accuracy on real-world conflicts. Furthermore, the study highlights the consistent performance of commercial LLMs across multiple programming languages, although none achieve satisfactory resolution rates, with the best models resolving less than 60% of conflicts. The authors also propose a test-free evaluation paradigm that avoids common pitfalls associated with traditional testing methods, such as reward hacking, thereby enhancing the reliability of model assessments.
Methodology
The authors constructed the Merge-Bench dataset without manual labeling, using real-world merge conflicts from GitHub. They trained LLMergeJ using Group Relative Policy Optimization (GRPO), an online reinforcement learning approach, to resolve conflicts in Java code. The evaluation methodology introduced a test-free paradigm to assess model performance without relying on traditional test execution.
Results
LLMergeJ achieved 49% exact match accuracy and 59% source code match accuracy on real-world merge conflicts, outperforming three commercial LLMs and trailing only Gemini 2.5 Pro. The study also found that commercial LLM performance remained stable across 11 programming languages, with the best models resolving less than 60% of merge conflicts.
Implications
The findings suggest that large language models can significantly improve the automation of merge conflict resolution, potentially saving developer time and reducing software defects. The introduction of a scalable and reliable dataset like Merge-Bench could facilitate further research and development in this area, leading to more effective tools for software engineering.
Capture-Calibrate-Coach: A Graph-Based Framework for Knowledge Monitoring Estimation and Adaptive Feedback
Graph Learning
- Introduces a novel framework for knowledge monitoring in educational contexts.
- Utilizes heterogeneous graph neural networks for inferring latent perceived states.
- Classifies learners into metacognitive patterns to deliver personalized feedback.
- Demonstrates high accuracy in predicting learners' perceived knowledge states.
Read more
Capture-Calibrate-Coach: A Graph-Based Framework for Knowledge Monitoring Estimation and Adaptive Feedback
Summary
The paper presents the Capture-Calibrate-Coach (3C) framework designed to enhance adaptive learning support by focusing on knowledge monitoring, which is the ability of learners to accurately assess their own understanding. The framework consists of three phases: Capture, Calibrate, and Coach. In the Capture phase, learners' perceived knowledge states are extracted from open-ended self-reports, forming a heterogeneous graph that connects learners to knowledge concepts. The Calibrate phase employs a heterogeneous graph neural network (HGNN) to infer latent perceived states for concepts not explicitly mentioned, addressing the challenge of unobserved perceptions. Finally, the Coach phase classifies learners into five metacognitive patterns and provides personalized feedback that targets knowledge gaps and calibration errors. Evaluation results indicate that the framework achieves an AUC of 85.21% in predicting latent perceived states, significantly outperforming baseline methods. A user study further confirms the positive reception of the feedback provided, highlighting its effectiveness in guiding learners towards improved self-awareness and knowledge growth.
Methodology
The 3C framework integrates test logs and learning reflections through three phases: Capture, where learners' perceptions are extracted using large language models; Calibrate, where a heterogeneous graph neural network infers latent perceived states as a link prediction problem; and Coach, where learners are classified into metacognitive patterns for personalized feedback.
Results
The framework achieved an AUC of 85.21% in predicting latent perceived states, outperforming baseline methods. A user study with 47 participants indicated a positive reception of the feedback quality, particularly in terms of actionable guidance on knowledge gaps.
Implications
The findings suggest that AI-based learning support can be enhanced by incorporating metacognitive elements, leading to improved self-awareness among learners and more effective study strategies. This framework could be applied in various educational settings to foster better learning outcomes.
DiLaDiff: Distilled Latent-Augmented Diffusion for Language Modeling
NLP
Generative Models
Large Language Models
- DiLaDiff improves sampling quality and throughput in language modeling by addressing token correlation issues.
- The model integrates a continuous latent space with a latent diffusion model and a consistency distillation approach.
- DiLaDiff achieves significant speed improvements, outperforming masked diffusion baselines while maintaining quality.
- The distillation process allows for efficient latent variable sampling, reducing computational overhead.
Read more
DiLaDiff: Distilled Latent-Augmented Diffusion for Language Modeling
Summary
The paper introduces DiLaDiff, a novel approach to language modeling that addresses the limitations of traditional diffusion models in capturing token correlations. DiLaDiff combines a continuous latent space, a latent diffusion model, and a consistency model to enhance sampling quality and throughput. The continuous latent space is learned from an existing masked diffusion language model, allowing for semantic representation. The latent diffusion model learns the prior over the encoder distribution, while the consistency model distills this prior into a few-step generative model. The authors demonstrate that DiLaDiff outperforms masked diffusion baselines in terms of both quality and inference speed, achieving a 7x acceleration at a batch size of 32. Furthermore, the distillation process allows for significant reductions in computational overhead, enabling rapid latent variable sampling compared to traditional discrete decoding methods.
Methodology
The methodology involves training a text auto-encoder with a decoder initialized from a pre-trained discrete diffusion model. A continuous diffusion model learns the latent prior, resulting in a hybrid continuous-discrete diffusion model. The consistency distillation process is then applied to create a few-step generative model from the latent diffusion component.
Results
DiLaDiff demonstrates superior performance compared to the masked diffusion baseline, achieving high-quality unconditional generation while accelerating inference by a factor of 7x at a batch size of 32. The distillation process allows DiLaDiff to reach performance levels close to its LaDiff teacher model using significantly fewer steps (5 steps versus 200).
Implications
The advancements presented in DiLaDiff could lead to more efficient language models that can generate high-quality text with reduced computational resources. This has potential applications in various NLP tasks, including text generation, summarization, and dialogue systems.
An Open-Source Training Dataset for Foundation Models for Black-box Optimization
Optimization
- Introduction of BBO-Pile, the largest open-source dataset for black-box optimization.
- Dataset consists of 557,100 optimization trajectories from 6 optimizers across 3,095 black-boxes.
- Foundation models trained on this dataset demonstrate effective imitation of existing optimization methods.
- Scaling behavior of models is analyzed with respect to parameter count and token budget.
Read more
An Open-Source Training Dataset for Foundation Models for Black-box Optimization
Summary
This paper addresses the limitations of existing black-box optimization methods, which often require extensive hyperparameter tuning and do not generalize well across different domains. The authors introduce BBO-Pile, the first open-source dataset containing over 500,000 optimization trajectories evaluated across 3,095 different black-box functions. This dataset is designed to facilitate the training of foundation models that can learn optimization principles from diverse optimization trajectories. The authors train a range of models, from 2M to 80M parameters, and analyze their performance in imitating state-of-the-art black-box optimization methods. The results indicate that large-scale pre-training can effectively mimic these methods, thus enhancing the potential for future research in black-box optimization.
Methodology
The authors created the BBO-Pile dataset by collecting optimization trajectories from various optimizers and black-box functions. They trained a family of decoder-based transformer models at different scales and evaluated their performance in mimicking state-of-the-art black-box optimization methods. The study systematically analyzed the models' scaling behavior concerning compute resources.
Results
The empirical evaluation showed that larger models trained on the BBO-Pile dataset could effectively imitate existing black-box optimization methods, indicating that large-scale pre-training is a viable approach. The models demonstrated varying degrees of success based on their parameter count and the amount of training data used.
Implications
The introduction of BBO-Pile could significantly enhance the research landscape in black-box optimization by providing a publicly available dataset that fosters reproducibility and generalization. It opens avenues for developing more advanced foundation models that can adapt to various optimization tasks across different domains.
On the Stability and Realizability of Recurrent Polynomial Surrogate Ternary Logic Gate Networks
Theory
Time Series
Robotics
- Introduction of R-DTLGN, a recurrent architecture that operates in Kleene's three-valued logic.
- Establishment of structural guarantees for graceful degradation under input uncertainty.
- Development of a novel hardening routine for converting learned networks into logic circuits.
- Demonstration of principled abstention and input certainty monotonicity in R-DTLGN.
Read more
On the Stability and Realizability of Recurrent Polynomial Surrogate Ternary Logic Gate Networks
Summary
This paper introduces the Recurrent Differentiable Ternary Logic Gate Network (R-DTLGN), a novel recurrent neural network architecture designed to predict Signal Temporal Logic (STL) verdicts while ensuring graceful degradation under input uncertainty. Traditional recurrent architectures like RNNs lack structural guarantees when sensor inputs degrade, potentially leading to incorrect outputs. The R-DTLGN operates over Kleene's three-valued logic, incorporating a third value (0) to represent unknown states. This architecture utilizes continuous polynomial surrogates during training, transitioning to discrete ternary logic circuits at inference. The authors demonstrate that R-DTLGN can maintain stable recurrent dynamics and principled abstention, ensuring that unknown inputs do not lead to erroneous outputs. The paper also establishes a realizability bound that sizes the network's hidden state based on the STL formula's temporal operators, thus eliminating the need for hyperparameter tuning. Evaluations on D4RL PointMaze navigation data show that R-DTLGN effectively balances prediction accuracy with safety, making it a significant advancement in the field of safety-critical systems.
Methodology
The R-DTLGN architecture employs recurrent cells that utilize degree-(2, 2) polynomial surrogates during training, transitioning to exact ternary logic gates during inference. The architecture is designed to handle inputs in Kleene's three-valued logic, allowing for principled handling of unknown inputs. The paper also outlines a hardening routine that distills the learned network into a logic circuit format, ensuring stability and robustness.
Results
The R-DTLGN was evaluated on STL specifications using D4RL PointMaze navigation data, demonstrating effective prediction accuracy and graceful degradation behavior under input dropout. The results indicate that R-DTLGN maintains safety guarantees while improving the accuracy-versus-safety tradeoff compared to traditional architectures.
Implications
The R-DTLGN architecture has significant implications for the deployment of neural networks in safety-critical applications, where maintaining accurate predictions under uncertain conditions is crucial. Its structural guarantees against erroneous outputs in the presence of degraded inputs could enhance the reliability of monitoring systems in various domains, including autonomous vehicles and robotics.
LLMTabBench: Evaluating LLMs on Binary Tabular Classification From Zero to Few Shots
Large Language Models
NLP
Efficient ML
- LLMTabBench is introduced as a benchmark for evaluating LLMs in tabular classification with limited labeled data.
- LLMs demonstrate strong performance in zero-shot settings, often surpassing few-shot learning models.
- Incorporating few-shot examples can conflict with LLMs' prior knowledge, potentially degrading performance.
- A data complexity threshold exists, beyond which LLM performance declines and few-shot examples are less effective.
Read more
LLMTabBench: Evaluating LLMs on Binary Tabular Classification From Zero to Few Shots
Summary
This paper introduces LLMTabBench, a benchmark designed to evaluate the performance of Large Language Models (LLMs) on binary tabular classification tasks under conditions of limited labeled data. The authors highlight the challenges faced by traditional machine learning models in low-data scenarios and propose LLMs as a flexible alternative capable of zero-shot and few-shot learning. The benchmark systematically investigates how LLMs utilize prior knowledge and in-context information (such as task descriptions and few-shot examples) to perform classification. The study reveals that LLMs can be highly competitive in zero-shot settings, sometimes outperforming models that rely on few-shot examples. However, the incorporation of additional few-shot examples can sometimes conflict with the LLMs' prior knowledge, leading to degraded performance. The authors also identify a threshold of data complexity beyond which LLM performance declines, suggesting that few-shot examples become less effective in more complex scenarios. Overall, the findings provide insights into the constraints of in-context learning for tabular data and offer practical guidance for deploying LLMs in low-data environments.
Methodology
The authors developed LLMTabBench to systematically evaluate LLMs on binary tabular classification tasks. They conducted experiments using both real-world and synthetic datasets, analyzing how LLMs interact with prior knowledge and in-context information. The study also introduced a concept of data complexity to assess task difficulty and performed statistical significance testing to ensure reliable model comparisons.
Results
The results indicate that LLMs can effectively perform zero-shot classification, often outperforming traditional models and few-shot learning approaches. However, the performance of LLMs declines when faced with increased data complexity, and the addition of few-shot examples can sometimes hinder performance due to conflicts with prior knowledge.
Implications
The findings suggest that LLMs can serve as a powerful tool for tabular classification in scenarios where labeled data is scarce. The insights gained from this research can guide practitioners in effectively deploying LLMs for real-world applications in various domains, including healthcare and finance, where data scarcity is a common challenge.
Test-Time Training Undermines Safety Guardrails
NLP
Large Language Models
Generative Models
- TTT introduces new vulnerabilities that can be exploited to bypass safety filters in models.
- Three threat models for TTT are identified: self-supervised, few-shot, and generation-phase.
- TTT can significantly increase the Attack Success Rate (ASR), with rates up to 95% in certain configurations.
- Standard evaluation methods may overestimate ASR due to TTT-induced overfitting, necessitating a validity-aware evaluation approach.
Read more
Test-Time Training Undermines Safety Guardrails
Summary
This paper investigates the vulnerabilities introduced by Test-Time Training (TTT), a method that allows models to adapt their parameters during inference to enhance performance on various tasks. The authors identify three threat models associated with TTT that adversaries can exploit to bypass safety mechanisms in models. Their experiments reveal that TTT significantly increases the Attack Success Rate (ASR), with rates reaching as high as 95% and 93% for different threat models across various model families. The study also highlights the fragility of safety alignment under TTT, demonstrating that even benign adaptations can degrade safety. To address these vulnerabilities, the authors propose a validity-aware evaluation method to correct for overfitting and a lightweight provider-side detector to flag potentially harmful TTT requests. The findings underscore the need for robust defenses against TTT-induced vulnerabilities in production settings.
Methodology
The authors formalize the TTT attack surface through three distinct threat models and conduct experiments across several open-weight models to assess the impact of TTT on safety alignment. They also introduce a validity-aware evaluation pipeline to correct for overfitting effects and propose a detection mechanism based on perplexity shifts.
Results
The experiments show that TTT can lead to an average ASR@10 of 95% and 93% for the few-shot and generation-phase threat models, respectively. Even self-supervised TTT on clean prompts raised ASR from 4% to 17%. The proposed validity-aware evaluation method corrected for overestimation of ASR by up to 13 percentage points. The lightweight detector effectively flagged harmful requests with a high true positive rate.
Implications
The findings suggest that TTT poses significant risks to the safety of machine learning models, particularly in production environments. The proposed detection methods could help mitigate these risks, but further research is needed to develop robust defenses against TTT vulnerabilities.
Predicting Stock Price Direction on Earnings Announcement Days using Multi-modal Deep Learning
Multimodal
Time Series
NLP
- The study integrates multiple data modalities: firm fundamentals, market dynamics, and news sentiment.
- LSTM and Transformer models are evaluated against a logistic regression baseline for predicting stock price direction on EA days.
- The Transformer model outperforms the LSTM in identifying volatile price movements, achieving a higher macro F1-score.
- Incorporating sentiment analysis from financial news significantly improves prediction accuracy.
Read more
Predicting Stock Price Direction on Earnings Announcement Days using Multi-modal Deep Learning
Summary
This paper addresses the challenge of predicting stock price movements during Earnings Announcements (EAs), which are often characterized by significant price volatility. The authors propose a multi-modal deep learning approach that integrates pre-announcement news sentiment, firm fundamentals, and recent market dynamics to predict the directional price movement of stocks on EA days. They construct a feature space comprising 15 fundamental metrics, 3 technical indicators, and sentiment scores derived from financial news processed using FinBERT. The study compares the performance of a Long Short-Term Memory (LSTM) network and a Transformer-based architecture against a logistic regression baseline. The results indicate that while the LSTM model is more precise, the Transformer model excels in sensitivity to volatile movements, achieving a higher macro F1-score. The ablation studies demonstrate that incorporating news sentiment consistently enhances model performance, highlighting the importance of soft information in predicting stock price movements during critical events.
Methodology
The authors define the task as Event-Driven Price Movement Prediction, utilizing a multi-modal temporal sequence of features from the 30 days prior to an EA. They extract data from financial news and firm fundamentals, construct a heterogeneous feature space, and preprocess the data for modeling. The models are trained using LSTM and Transformer architectures, with an ablation study to assess the impact of sentiment features.
Results
The LSTM model shows higher precision through a conservative strategy, while the Transformer model achieves superior sensitivity to price volatility, reflected in a higher macro F1-score. The inclusion of sentiment features consistently enhances model performance, indicating their value in predicting stock price movements during EAs.
Implications
The findings suggest that integrating sentiment analysis with traditional financial metrics can improve stock price prediction models, particularly during high-volatility events like Earnings Announcements. This approach may aid investors in making more informed decisions based on a comprehensive understanding of market dynamics.
Complete-muE: Optimal Hyperparameter Transfer and Scaling for MoE Models
NLP
Large Language Models
Generative Models
- Complete-muE provides a systematic approach for hyperparameter transfer across dense FFN and MoE models.
- The framework utilizes a two-bridge system to address the complexities of architecture and token count changes.
- Empirical results show that hyperparameters tuned on a dense model can be effectively transferred to various MoE configurations.
- The method enables significant convergence speedups in MoE models, reducing the need for extensive hyperparameter searches.
Read more
Complete-muE: Optimal Hyperparameter Transfer and Scaling for MoE Models
Summary
The paper introduces Complete-muE, a novel framework designed to facilitate the transfer of hyperparameters across dense Feed-Forward Networks (FFN) and Mixture-of-Experts (MoE) setups within transformer architectures. Existing methods like µP and SDE are limited in their ability to handle the complexities of hyperparameter transfer when both architecture and token counts change, which is common in MoE configurations. Complete-muE addresses this challenge through a two-bridge system: Bridge I connects dense FFN to Dense MoE using active-width µP, while Bridge II links Dense MoE to sparse MoE through activated-expert scaling. This framework allows for the effective transfer of hyperparameters across various configurations, including changes in activated experts, total capacity, and network dimensions. The authors conducted extensive experiments on language modeling and diffusion tasks, demonstrating that Complete-muE achieves stable hyperparameter transfer with minimal drift, enabling significant speedups in convergence for MoE models compared to dense models. The findings suggest that tuning hyperparameters on a single dense reference model can be effectively transferred to all MoE configurations, streamlining the tuning process and enhancing model performance.
Methodology
The authors developed a two-bridge system for hyperparameter transfer: Bridge I maps dense FFN to Dense MoE using active-width µP, while Bridge II connects Dense MoE to sparse MoE through activated-expert scaling. They validated the framework through controlled experiments and large-scale training on language modeling and diffusion tasks, assessing the stability of hyperparameter transfer across different configurations.
Results
The experiments confirmed that Complete-muE yields stable hyperparameter optima across various MoE architectures, with only minor drift observed. The framework facilitated a convergence speedup of approximately 4.5× for video diffusion models and 5.3×–5.5× for large language models with 100k training iterations, demonstrating its effectiveness in practical applications.
Implications
Complete-muE has the potential to simplify the hyperparameter tuning process for MoE models, making it easier to scale model capacity without extensive searches. This could lead to more efficient training practices in various applications, particularly in natural language processing and generative modeling.
Push Your Agent: Measuring and Enforcing Quantitative Goal Persistence in Long-Horizon LLM Agents
NLP
Large Language Models
- Introduces Quantitative Goal Persistence (QGP) as a new evaluation target for language agents.
- Develops PushBench, a benchmark for measuring agent performance in long-horizon tasks.
- Evaluates controller-level interventions that improve task completion rates and reduce errors.
- Demonstrates that traditional success metrics can obscure important failure modes in agent performance.
Read more
Push Your Agent: Measuring and Enforcing Quantitative Goal Persistence in Long-Horizon LLM Agents
Summary
This paper introduces the concept of Quantitative Goal Persistence (QGP), which addresses the challenge of ensuring that long-horizon language agents persist in their tasks until a specified number of valid outputs are confirmed by an external verifier. The authors develop PushBench, a benchmark designed to evaluate agents' performance in repository-artifact collection and verifier-backed work units. The benchmark measures various failure modes such as duplicate submissions and premature stopping, which are often obscured by traditional success metrics. The study evaluates different controller strategies, including a state-tracking retrieval controller and a backlog-tracking work-unit controller, demonstrating their effectiveness in reducing repeated work and improving task completion rates. The results indicate that while advanced models can solve many tasks, they still struggle with maintaining verified progress when the task complexity increases. Overall, the paper emphasizes the importance of QGP as a distinct reliability requirement for agents operating in complex environments.
Methodology
The authors formalize QGP and create PushBench, which includes two controlled task families: repository-artifact collection and verifier-backed work units. They implement state-tracking and backlog-tracking controllers to monitor agent progress and evaluate their effectiveness against passive and memory-agent baselines. The evaluation includes coding-style assessments and black-box frontier-agent tests to assess the transferability of findings to more realistic settings.
Results
The state-tracking retrieval controller achieves a success rate of 69-78% while eliminating duplicate submissions, whereas the backlog-tracking controller reaches 25-50% success in verifier-backed tasks. In contrast, standard and completion-gated controllers fail to complete any task instances in these evaluations. The findings reveal that QGP failures persist even with stronger models and memory mechanisms, particularly as task complexity increases.
Implications
The study suggests that ensuring QGP can significantly enhance the reliability of language agents in practical applications, particularly in environments requiring extensive data collection and verification. This could lead to improved performance in various domains, including software development, data analysis, and automated research.
Grow-Prune-Freeze Networks: Adaptive & Continual Learning Technique for Olfactory Navigation
Robotics
Reinforcement Learning
Theory
- Introduction of GPF networks for continual learning in dynamic environments.
- Theoretical grounding in random matrix theory to ensure stability in learning.
- Empirical success in olfactory navigation tasks with a 94% success rate.
- Generalization of GPFs to other machine learning tasks.
Read more
Grow-Prune-Freeze Networks: Adaptive & Continual Learning Technique for Olfactory Navigation
Summary
The paper introduces Grow-Prune-Freeze (GPF) networks, a novel framework designed for adaptive and continual learning in the context of olfactory navigation, which is characterized by dynamic and non-stationary environments. Traditional deep learning models struggle in such settings due to their static nature and reliance on large, standardized datasets, which are often unavailable in the field of olfaction. GPF networks allow an agent to dynamically adjust its architecture by growing, pruning, and freezing layers of its neural network based on the complexity of the environment. The authors ground their approach in non-linear random matrix theory, extending previous work to multi-layer models and demonstrating that the eigenvalue composition of network weights remains stable as layers are added. Empirical results show that GPFs achieve a 94% success rate in turbulent plume navigation, a task that simulates real-world challenges in robotics. The framework is also shown to generalize well to other machine learning tasks, including reinforcement learning in Atari games and image classification. The authors provide open-source code and data to promote further research in olfactory robotics.
Methodology
The GPF framework employs a combination of growing, pruning, and freezing mechanisms to adapt the neural network architecture in response to environmental complexity. It utilizes Expected SARSA as the learning algorithm and is grounded in random matrix theory to ensure spectral stability of the model as it adapts.
Results
GPF networks achieved a 94% success rate in turbulent plume navigation tasks, demonstrating effective continual learning in partially observable environments. The framework also showed promising results in other tasks, indicating its versatility and potential for broader applications.
Implications
The GPF framework has significant implications for robotics, particularly in applications requiring real-time learning and adaptation to changing environments. It opens avenues for improved olfactory navigation in robots and may enhance performance in various machine learning tasks beyond olfaction.
Learned Relay Representations for Forward-Thinking Discrete Diffusion Models
NLP
Large Language Models
Generative Models
- Introduction of Learned Relay Representations (Relay) for MDMs.
- Relay allows MDMs to retain and utilize latent information across decoding steps.
- Demonstrated effectiveness on a Sudoku-based planning task and Fast-dLLM v2.
- Outperformed standard supervised finetuning on coding tasks with reduced inference latency.
Read more
Learned Relay Representations for Forward-Thinking Discrete Diffusion Models
Summary
This paper introduces Learned Relay Representations (Relay) to enhance Masked Diffusion Models (MDMs) by enabling them to retain and propagate latent information across denoising steps. Traditional MDMs face a 'hard reset' problem where valuable internal computations are discarded after each step, limiting their ability to leverage previous information. Relay addresses this by incorporating a differentiable per-token channel that forwards hidden states from one step to the next, trained via truncated backpropagation through time (BPTT). The authors validate Relay on a Sudoku-based planning task and scale it to Fast-dLLM v2, a state-of-the-art Diffusion Language Model (DLM). The results show that Relay not only outperforms standard supervised finetuning on coding tasks but also reduces inference latency by up to 32%. This work demonstrates that MDMs can be effectively trained to relay latent information, thus improving performance while maintaining efficiency.
Methodology
The authors propose Relay, which integrates recurrent computation into MDMs by training the model to pass learned latent states forward across decoding steps. This is achieved through truncated backpropagation through time (BPTT), allowing the model to maintain continuity in its computations and enhance its predictive capabilities.
Results
Relay was validated on a Sudoku planning task and scaled to Fast-dLLM v2, achieving superior performance compared to standard supervised finetuning on coding tasks. Additionally, it demonstrated a reduction in inference latency by up to 32%, indicating significant efficiency improvements.
Implications
The introduction of Relay has the potential to advance the capabilities of MDMs in various applications, particularly in tasks requiring complex reasoning and sequential decision-making. This method could lead to more efficient and effective models in natural language processing and other domains reliant on iterative refinement.
Complement Submodular Information Measures for Balanced and Robust Data Selection
Optimization
Theory
- Introduction of Complement Submodular Information (CSI) as a new class of submodular objectives.
- Derivation of complement-aware variants of classical submodular functions with theoretical guarantees.
- Empirical evidence showing improved performance of CSI objectives in robust subset selection tasks.
- CSI objectives effectively preserve structural balance across selected subsets and their complements.
Read more
Complement Submodular Information Measures for Balanced and Robust Data Selection
Summary
This paper introduces Complement Submodular Information (CSI), a novel framework for complement-aware submodular objectives that enhances data selection processes in machine learning. Traditional submodular optimization focuses solely on the selected subset, neglecting the structural information of the remaining data. The CSI framework addresses this limitation by quantifying the shared structural information between a selected subset and its complement, thereby promoting balanced representation across both. The paper derives complement-aware variants of several classical submodular functions, including Facility Location and Graph Cut, and establishes theoretical properties such as approximate monotonicity and near-(1 − 1/e) approximation guarantees. Empirical evaluations demonstrate that CSI objectives outperform standard submodular objectives in robust subset selection tasks, particularly in preserving coherent rare/tail semantic structures while minimizing the impact of noisy outliers. The findings suggest that CSI objectives can significantly enhance downstream predictive performance in various machine learning applications, including train/validation/test splitting and benchmark construction.
Methodology
The paper develops a framework for CSI objectives by defining complement information between a subset and its complement. It analyzes the theoretical properties of these objectives, demonstrating their approximate monotonicity under certain conditions. The methodology includes deriving complement-aware variants of classical submodular functions and conducting empirical evaluations through synthetic experiments and real-world applications.
Results
The results indicate that CSI objectives consistently outperform traditional submodular objectives in tasks requiring robust subset selection. They show significant improvements in preserving the balance between rare and common semantic structures while effectively suppressing outliers. The theoretical analysis provides near-optimal approximation guarantees for greedy optimization of CSI objectives.
Implications
The introduction of CSI objectives has potential applications in various machine learning domains, particularly in scenarios where balanced representation and robustness are critical, such as dataset construction, active learning, and representation learning. The framework can enhance the quality of data selection processes, leading to improved model performance and generalization.
Don't Retrain, Just Reuse: Recovering Dual-Target Molecules from Single-Target Diffusion Models
Generative Models
Optimization
- Introduces a novel method for dual-target molecular design without retraining generative models.
- Formulates the problem as a constrained multi-objective optimization over the input space of a frozen model.
- Proposes REUSE, an evolutionary input-space search framework that balances efficiency and quality.
- Achieves a 20.9-percentage-point improvement in Dual High Affinity compared to the best prior methods.
Read more
Don't Retrain, Just Reuse: Recovering Dual-Target Molecules from Single-Target Diffusion Models
Summary
This paper addresses the challenge of designing dual-target molecules, which are essential for polypharmacology but difficult to generate due to the need for compatibility with two binding sites while maintaining drug-like properties. Traditional methods either retrain generative models or modify the diffusion process, both of which have drawbacks such as high costs and instability. The authors propose a novel approach called REUSE, which formulates dual-target molecular design as a constrained multi-objective optimization problem over the input space of a frozen single-target diffusion model. This method allows for the recovery of dual-target candidates without altering the model's parameters or denoising dynamics. REUSE employs a hierarchical evolutionary input-space search framework that combines pair-conditioned exploration with structured multi-stage selection to ensure dual-target affinity, chemical quality, and diversity. Experimental results demonstrate that REUSE outperforms existing methods that modify the diffusion process, achieving a significant improvement in dual-target affinity while maintaining competitive molecular quality.
Methodology
The authors propose REUSE, a hierarchical evolutionary input-space search framework that iteratively refines a population of input states. It decodes these states through a frozen single-target diffusion model, utilizing family-level evidence for population updates. The method incorporates a cost-aware multi-stage selection process to efficiently prune candidates based on affinity and chemical quality, concentrating computational resources on the most promising candidates.
Results
REUSE significantly improves dual-target affinity, achieving a 20.9-percentage-point gain in Dual High Affinity over the strongest prior baseline. The method also maintains competitive molecular quality, demonstrating its effectiveness in recovering balanced dual-target candidates.
Implications
This research has potential applications in drug discovery, particularly for complex diseases that require polypharmacological approaches. By enabling the efficient design of dual-target molecules, it could lead to more effective treatments with reduced side effects and improved patient outcomes.
Any-Dimensional Invariant Universality
Theory
Graph Learning
Efficient ML
- Develops a systematic approach to establish any-dimensional universality in machine learning models.
- Identifies the importance of an infinite-dimensional limit space for analyzing universality across varying input sizes.
- Critiques existing architectures for their failure to achieve universality and proposes modifications to restore it.
- Highlights the role of symmetries and norm choices in proving universality in infinite-dimensional spaces.
Read more
Any-Dimensional Invariant Universality
Summary
This paper addresses the concept of universality in any-dimensional machine learning models, which are designed to handle inputs of varying sizes, such as graphs and point clouds. Traditional universality is typically studied in the context of fixed-size inputs, leading to a gap in understanding how these any-dimensional models can universally approximate functions across different dimensions. The authors propose a systematic approach to establish any-dimensional universality by identifying these models with a unique function in an infinite-dimensional limit space that encompasses all finite sizes and their limits. They demonstrate that this limit space has a natural topology that supports the establishment of universality. The paper also critiques existing architectures, showing that many fail to achieve universality and proposes modifications to restore it. The contributions include a general strategy for proving any-dimensional invariant universality, which emphasizes the importance of exploiting symmetries and the choice of norms in infinite-dimensional spaces. The authors apply their framework to sets, graphs, and point clouds, revealing discontinuities in widely used architectures and suggesting new designs that ensure continuity and universality.
Methodology
The authors adopt a framework that views input spaces as a nested sequence of increasing-dimensional vector spaces, embedding them into an infinite-dimensional limit space. They establish a unique extension of any-dimensional models in this space and prove universality by leveraging symmetries and compact sets. The methodology includes analyzing existing architectures and proposing necessary modifications to achieve continuity and universality.
Results
The paper shows that several widely used architectures for any-dimensional models either yield discontinuous extensions in the natural topology or fail to be universal. By applying their framework, the authors propose modifications and new architectures that restore continuity and achieve universality across dimensions.
Implications
The findings have significant implications for the design and evaluation of machine learning models that operate on inputs of varying dimensions. By establishing a clearer understanding of universality in this context, the work can guide future research and development of more robust and expressive architectures in fields such as graph learning and point cloud processing.
Joint Optimization of Training and Inference in Federated Edge Learning via Constrained Multi-Objective Deep Reinforcement Learning
Federated Learning
Reinforcement Learning
Optimization
- Introduces a joint optimization framework for federated training and inference in FEEL systems.
- Develops a tandem-queue-inspired mechanism to link inference requests with training data.
- Proposes the C-MOPPO algorithm to address the NP-hard multi-objective optimization problem.
- Demonstrates significant performance improvements over baseline methods in terms of accuracy, latency, and energy consumption.
Read more
Joint Optimization of Training and Inference in Federated Edge Learning via Constrained Multi-Objective Deep Reinforcement Learning
Summary
This paper presents a novel framework for optimizing federated edge learning (FEEL) systems by jointly managing federated training and inference on resource-constrained edge devices. The authors introduce a tandem-queue-inspired conversion mechanism that connects inference requests with training data, while considering both data and model freshness to enhance accuracy in dynamic environments. The optimization problem is framed as a multi-objective optimization challenge, which is NP-hard and complicated by the online nature of the setting. To tackle this, the authors develop a constrained multi-objective proximal policy optimization (C-MOPPO) algorithm that learns a set of policies catering to different objectives and employs constrained policy optimization to enrich the Pareto front for high-quality solutions. Extensive experiments show that C-MOPPO effectively balances trade-offs among objectives, outperforming baseline methods across various configurations.
Methodology
The authors transform the optimization problem into a multi-objective Markov decision process (MOMDP) and develop the C-MOPPO algorithm, which learns diverse policies for different objectives and utilizes constrained policy optimization to enhance the quality of solutions in the Pareto front.
Results
The C-MOPPO algorithm achieves well-balanced trade-offs among multiple objectives, significantly outperforming baseline approaches in various system configurations, thereby demonstrating its effectiveness in optimizing both inference accuracy and resource utilization.
Implications
The proposed framework has significant implications for real-time applications in edge computing, particularly in scenarios requiring low latency and high privacy, such as autonomous vehicles and IoT systems. It paves the way for more efficient resource management in federated learning environments.
Scale When Needed: Adaptive Neuron-level Mixed Precision Quantization Aware Training
Efficient ML
- NMP-QAT allows each neuron to learn its own quantization precision, enhancing adaptability.
- The framework supports both weights-only and weights + activations quantization.
- Empirical results show improved compression-accuracy trade-offs over traditional mixed-precision methods.
- NMP-QAT is designed for efficient deployment on resource-constrained 6G edge devices.
Read more
Scale When Needed: Adaptive Neuron-level Mixed Precision Quantization Aware Training
Summary
This paper presents Neuron-Level Mixed-Precision Quantization Aware Training (NMP-QAT), a novel framework designed to optimize the deployment of deep neural networks on resource-constrained 6G edge devices. Traditional mixed-precision quantization methods operate at a coarse granularity (layer or channel level), which can overlook the fine-grained variability of individual neurons. NMP-QAT addresses this by allowing each neuron to independently learn its quantization precision during training, starting from low-bit precision and expanding only when necessary based on training signals. This adaptability applies to both weights and activations, significantly reducing memory movement and improving efficiency. The methodology employs differentiable surrogates and straight-through estimators to maintain a fully discrete inference graph while enabling fine-grained control over quantization. The framework was empirically validated on telecom and general-purpose datasets using MLP and tabular foundation model architectures, demonstrating superior compression-accuracy trade-offs compared to existing mixed-precision QAT baselines. The results indicate that NMP-QAT is particularly well-suited for Green AI applications at the network edge, where efficiency and minimal accuracy loss are critical.
Methodology
NMP-QAT employs a training paradigm where each neuron learns its quantization precision through a combination of differentiable surrogates and straight-through estimators. The approach initializes neurons at low-bit precision and allows for bit-width expansion based on the training signals. It supports two configurations: weights-only quantization and weights + activations quantization, enabling fine-grained control over both weights and activations.
Results
The empirical evaluation of NMP-QAT on telecom and non-telecom datasets revealed that it outperforms existing mixed-precision QAT baselines in terms of accuracy and efficiency. The method achieves significant compression with minimal accuracy loss, making it a viable option for deployment in resource-constrained environments.
Implications
The findings suggest that NMP-QAT can facilitate the deployment of deep learning models in edge computing scenarios, particularly in regions with limited resources. This approach aligns with the goals of Green AI by promoting energy-efficient and sustainable AI solutions.
Disentangled Double Machine Learning for Accurate Causal Effect Estimation
Theory
- DDML improves causal effect estimation by addressing confounding bias more effectively than traditional methods.
- The Causal Role Disentanglement strategy enhances nuisance function estimation by separating covariates into distinct causal roles.
- The Residual Dependence Orthogonalization strategy mitigates residual dependence, improving the precision of causal estimates.
- Extensive experiments show that DDML outperforms existing algorithms in various datasets, indicating its robustness.
Read more
Disentangled Double Machine Learning for Accurate Causal Effect Estimation
Summary
This paper addresses the challenge of confounding bias in causal effect estimation from observational data, which is particularly problematic in high-dimensional or finite-sample scenarios. The authors propose a novel algorithm called Disentangled Double Machine Learning (DDML) that improves upon the traditional Double Machine Learning (DML) framework. DDML introduces two key strategies: a Causal Role Disentanglement (CRD) strategy that decomposes covariates into confounders, treatment-specific factors, and outcome-specific factors, and a Residual Dependence Orthogonalization (RDO) strategy that reduces residual dependence caused by errors in nuisance function estimation. By disentangling the roles of covariates, DDML enhances the reliability of nuisance function estimation, which is crucial for accurate causal effect estimation. The authors validate their approach through extensive experiments on synthetic, semi-synthetic, and real-world datasets, demonstrating that DDML significantly outperforms 13 state-of-the-art baseline algorithms in terms of Mean Absolute Error (MAE) and Root Mean Square Error (RMSE).
Methodology
The authors developed the DDML algorithm, which integrates two main strategies: Causal Role Disentanglement (CRD) to categorize covariates into confounders, treatment-specific, and outcome-specific factors, and Residual Dependence Orthogonalization (RDO) to reduce residual dependence between treatment and outcome errors. This approach allows for more accurate estimation of nuisance functions and causal effects.
Results
Experimental results indicate that DDML significantly outperforms 13 state-of-the-art baseline algorithms in both Mean Absolute Error (MAE) and Root Mean Square Error (RMSE) across synthetic, semi-synthetic, and real-world datasets, demonstrating its effectiveness in improving causal effect estimation.
Implications
The findings suggest that DDML can be a valuable tool in fields requiring accurate causal effect estimation from observational data, such as economics, healthcare, and social sciences. Its ability to handle high-dimensional covariates and complex relationships makes it suitable for a wide range of applications.
On the Benefits of Free Exploration for Regret Minimization in Multi-Armed Bandits
Reinforcement Learning
Robotics
Theory
- Introduces a new framework for regret minimization that includes a free exploration phase.
- Develops the UFE-KLUCB-H algorithm, combining free exploration and regret minimization strategies.
- Establishes logarithmic scaling of the free exploration budget with respect to the time horizon.
- Demonstrates significant regret savings through simulations and theoretical analysis.
Read more
On the Benefits of Free Exploration for Regret Minimization in Multi-Armed Bandits
Summary
This paper investigates a novel stochastic multi-armed bandit problem where an agent is allowed a free exploration phase before any regret is incurred. This approach contrasts with traditional regret minimization strategies that require balancing exploration and exploitation from the start. The authors propose a two-phase algorithm, UFE-KLUCB-H, which incorporates a free exploration policy (UFE) followed by a history-aware regret minimization policy (KLUCB-H). They establish that the free exploration budget can scale logarithmically with the time horizon, leading to significant regret savings. The paper introduces (α, β)-probably saving policies to quantify the benefits of the free exploration phase and derives instance-dependent upper and lower bounds for the proposed algorithm, demonstrating its effectiveness compared to traditional methods. Simulations validate that the adaptive exploration strategy enhances regret minimization, revealing sharp phase transitions in accumulated regret based on the free exploration budget.
Methodology
The authors formalize the regret minimization problem with free exploration and propose a two-phase algorithm, UFE-KLUCB-H, which consists of a free exploration policy and a history-aware regret minimization policy. They derive instance-dependent upper and lower bounds using novel perturbation arguments tailored to the free exploration setting.
Results
The UFE-KLUCB-H algorithm accumulates strictly less regret than traditional policies without free exploration. The theoretical analysis reveals sharp phase transitions in accumulated regret based on the free exploration budget, and simulations confirm the effectiveness of the adaptive exploration strategy.
Implications
This work has practical implications for various applications, such as robotics and A/B testing, where initial exploration can occur without immediate consequences. It provides a theoretical foundation for designing adaptive learning systems that can effectively balance exploration and exploitation in real-world scenarios.
It's the humans, not the data: Geopolitical bias in LLMs originates in post-training, amplified by the language of the prompt
Large Language Models
NLP
- Geopolitical bias in LLMs originates from post-training, not pre-training.
- The language of the prompt significantly amplifies biases in LLM responses.
- Six out of seven models showed bias shifts towards the preferences of their developers after post-training.
- The study utilized a paired-scenario forced-choice probe across multiple languages.
Read more
It's the humans, not the data: Geopolitical bias in LLMs originates in post-training, amplified by the language of the prompt
Summary
This paper challenges the prevailing assumption that geopolitical bias in language models (LLMs) primarily arises from the pre-training data. Instead, the authors demonstrate that such biases are introduced during the post-training phase, influenced significantly by the language used in prompts. The study evaluates seven pairs of open-weight LLMs, comparing their base models (pre-training only) with their chat variants (post-training included) across multiple languages (English, French, Chinese) and geopolitical scenarios. The findings reveal that six out of seven models exhibited a shift in bias towards the geopolitical preferences of their developers after post-training, with the most pronounced shift observed in Alibaba's Qwen 2.5 model. The results indicate that the language of the prompt can further amplify these biases, suggesting that geopolitical preferences are actively shaped during post-training rather than merely inherited from training data. This highlights the need for increased transparency and oversight in the alignment processes of LLMs.
Methodology
The authors conducted a forced-choice multiple-choice question (MCQ) probe on seven pairs of open-weight LLMs (base and post-trained variants) from different labs. They evaluated the models' responses to 79 geopolitical scenarios in English, French, and Chinese to assess bias shifts.
Results
The study found that post-training introduced significant geopolitical bias in LLMs, with the most notable shift in Alibaba's Qwen 2.5 model, which changed from neutral to pro-China after post-training. Additionally, the language used in prompts influenced the degree of bias, with the French-made Mistral model showing pro-France bias only under French prompting.
Implications
The findings suggest that the biases in LLMs are not solely a reflection of their training data but are actively shaped during post-training. This has important implications for the development and deployment of LLMs, emphasizing the need for careful oversight and transparency in how these models are aligned with cultural and political perspectives.
Spectral Probe-Circuits: A Three-Step Recipe for Identifying Attention-Head Circuits in Pretrained Transformers
NLP
Large Language Models
Interpretability
- Introduces a three-step recipe for identifying attention-head circuits in transformers.
- Demonstrates that a small induction circuit is necessary in all tested models.
- Shows that the per-head PR signal can predict seed-specific circuits without task labels.
- Finds a consistent fraction of heads (17-19%) engaged in specialized computation across different model sizes.
Read more
Spectral Probe-Circuits: A Three-Step Recipe for Identifying Attention-Head Circuits in Pretrained Transformers
Summary
This paper introduces a novel methodology for identifying attention-head circuits in pretrained transformer models through a three-step recipe. The first step involves utilizing a spectral signal, specifically the time-integrated participation ratio (PR) of per-head attention outputs, to rank heads that perform sustained content-dependent computations without the need for labels or attribution gradients. The second step applies a task-pattern screen to filter the general indicator into a task-specific candidate circuit by measuring attention from relevant query positions to canonical target positions. The final step is a causal verification process that group-ablates the candidate circuit against a matched-random control, ensuring the causal claims are valid. The methodology is validated across a wide parameter range (51M to 1B parameters) and across different architecture families and pretraining pipelines. The findings reveal that a small induction circuit is causally necessary in all tested models, the PR signal is predictive of seed-specific circuits, and the fraction of heads engaged in identifiable specialized computation remains consistent across varying scales.
Methodology
The methodology consists of three steps: (1) calculating the time-integrated participation ratio (PR) for per-head attention outputs to identify heads performing sustained computations; (2) applying a task-pattern screen to filter heads into task-specific circuits; and (3) conducting causal verification through group-ablation against matched-random controls.
Results
The results validate the proposed methodology across various models, confirming that a small induction circuit is necessary in all configurations. The PR signal effectively identifies seed-specific circuits without task labels, and the proportion of heads engaged in specialized computation remains stable across an 8× parameter scale.
Implications
The findings suggest that the proposed methodology can enhance the interpretability of transformer models by providing a systematic approach to identify and verify attention-head circuits. This could lead to better understanding of model behavior and capabilities, as well as inform future research in model design and training.
Approaching I/O-optimality for Approximate Attention
NLP
Large Language Models
Efficient ML
- Introduces a technique for computing attention with I/O costs that depend almost-linearly on sequence length.
- Establishes tight and nearly-tight bounds for I/O complexity across different parameter regimes.
- Demonstrates that the proposed method outperforms existing algorithms like FlashAttention in terms of I/O efficiency.
- Categorizes I/O complexity based on the interplay between fast memory size, feature dimension, and polynomial degree.
Read more
Approaching I/O-optimality for Approximate Attention
Summary
This paper addresses the I/O complexity of computing the attention matrix in large language models, focusing on minimizing data transfers between fast and slow memory. The authors present a novel technique that reduces the I/O cost of attention computation to almost-linear dependence on the sequence length (n), contrasting with existing methods like FlashAttention, which incur quadratic I/O costs. By leveraging the approximate attention framework proposed by Alman and Song, the authors develop I/O-efficient algorithms and establish lower bounds for various parameter regimes, demonstrating that their approach is close to I/O-optimal. The paper categorizes different scenarios based on the relationship between fast memory capacity (M), feature dimension (d), and the polynomial degree used for approximation, providing tight bounds for I/O complexity in each case. Overall, the findings indicate that the proposed approximate attention algorithm significantly outperforms classical attention methods in both computational and I/O complexity.
Methodology
The authors develop I/O-efficient algorithms inspired by the approximate attention framework of Alman and Song. They analyze the I/O complexity of attention computation under various conditions related to fast memory capacity, feature dimensions, and polynomial approximation degrees. The methodology includes theoretical proofs of upper and lower bounds for I/O complexity in different scenarios.
Results
The paper presents several theorems that outline the I/O complexity of the proposed approximate attention algorithm under different conditions. For instance, when fast memory is sufficiently large, the optimal I/O complexity is Θ(n·d). In cases where fast memory is smaller relative to the feature dimension and polynomial degree, the I/O complexity is bounded by more complex expressions, demonstrating significant improvements over traditional methods.
Implications
The findings have significant implications for the design of efficient algorithms in large language models, particularly in optimizing memory usage and computational efficiency. This work could lead to advancements in real-time applications where memory bandwidth is a critical constraint, enhancing the performance of models in NLP tasks.
Automated Random Embedding for Practical Bayesian Optimization with Unknown Effective Dimension
Optimization
- DSEBO automatically adjusts subspace dimensions during optimization, addressing the challenge of unknown effective dimensions.
- The shared embedding technique enhances initialization and convergence in higher-dimensional spaces.
- Theoretical analysis establishes a regret bound, indicating improved performance over traditional methods.
- Extensive experiments confirm DSEBO's superiority in high-dimensional optimization tasks.
Read more
Automated Random Embedding for Practical Bayesian Optimization with Unknown Effective Dimension
Summary
This paper addresses the challenge of optimizing complex black-box functions using Bayesian optimization (BO) in high-dimensional spaces, particularly when the effective dimension of the task is unknown. Traditional methods for dimension reduction, such as random embedding (RE), struggle with determining the appropriate subspace dimension, often relying on expert input or trial-and-error approaches that are resource-intensive. The authors propose a novel algorithm called Dynamic Shared Embedding Bayesian Optimization (DSEBO), which begins optimization in a low-dimensional subspace and dynamically adjusts to higher dimensions based on the convergence of solutions. DSEBO incorporates a shared embedding technique that utilizes solutions from lower dimensions to inform and initialize higher-dimensional searches, thereby enhancing convergence speed. Theoretical analysis provides a regret bound for DSEBO, demonstrating its ability to balance approximation and optimization errors more effectively than existing methods. Extensive experiments on synthetic and real-world tasks reveal that DSEBO significantly outperforms state-of-the-art approaches in terms of optimization regret and computational efficiency, showcasing its robustness across various hyper-parameter settings.
Methodology
The DSEBO algorithm initiates optimization in a low-dimensional subspace and dynamically transitions to higher dimensions based on the quality of solutions. It employs a shared embedding technique to leverage information from lower-dimensional solutions to improve the initialization of higher-dimensional searches. Theoretical analysis is conducted to derive a regret bound, and extensive empirical evaluations are performed on both synthetic and real-world tasks.
Results
DSEBO demonstrates significant improvements in optimization regret and computational efficiency compared to state-of-the-art methods. The algorithm effectively balances approximation and optimization errors, as evidenced by its performance across various dimensionalities and real-world applications.
Implications
The proposed DSEBO algorithm has potential applications in fields requiring efficient optimization of high-dimensional functions, such as machine learning, engineering design, and resource allocation problems. Its ability to adaptively determine effective dimensions can lead to more efficient use of computational resources in practical scenarios.
SeqRoute: Global Budget-Aware Sequential LLM Routing via Offline Reinforcement Learning
Large Language Models
Reinforcement Learning
Optimization
- SeqRoute addresses the limitations of traditional LLM routing by incorporating sequential decision-making under budget constraints.
- The introduction of Hindsight Budget Relabeling (HBR) allows for the generation of a large dataset enriched with bankruptcy signals, overcoming data starvation.
- Conservative Q-Learning (CQL) is utilized to ensure safe exploration and prevent the model from making costly decisions in budget-tight situations.
- The λ-sweep mechanism enables a single policy to navigate the cost-quality Pareto frontier dynamically during deployment.
Read more
SeqRoute: Global Budget-Aware Sequential LLM Routing via Offline Reinforcement Learning
Summary
The paper introduces SeqRoute, a novel framework for routing queries to large language models (LLMs) that considers the sequential nature of user interactions and the constraints of global computational budgets. Traditional routing methods treat each query independently, which can lead to 'budget bankruptcy' where resources are exhausted on early queries, leaving insufficient capacity for more complex subsequent queries. SeqRoute formulates the multi-turn routing problem as a finite-horizon Markov Decision Process (MDP) and employs offline reinforcement learning to optimize routing decisions. A key innovation is Hindsight Budget Relabeling (HBR), which simulates historical interactions under various hypothetical budgets, generating a large dataset of transitions that include critical bankruptcy signals. The framework uses Conservative Q-Learning (CQL) to ensure safety during training by penalizing out-of-distribution actions. Additionally, a dynamic λ-sweep mechanism allows the model to adapt to varying budget constraints at deployment without retraining. The extensive evaluations show that SeqRoute significantly reduces operational costs while maintaining or improving quality, effectively suppressing bankruptcy rates.
Methodology
SeqRoute formulates the routing problem as a finite-horizon MDP, incorporating the remaining budget into the state space. It employs offline reinforcement learning with Conservative Q-Learning to learn optimal routing strategies while mitigating risks associated with budget constraints. Hindsight Budget Relabeling is used to create a rich dataset from historical interactions, and a λ-sweep mechanism is implemented for flexible deployment.
Results
SeqRoute achieves a reduction in operational costs by 6.0–73.5% while maintaining or enhancing the quality of responses. It successfully keeps bankruptcy rates below 1%, outperforming behavior cloning, budget-aware heuristics, and static baselines across various budget levels.
Implications
The findings suggest that SeqRoute can be effectively applied in real-world applications where LLMs are deployed under strict budget constraints, improving resource allocation and user experience in multi-turn interactions. This framework could be beneficial for businesses looking to optimize costs while maintaining high-quality service.
The Readout Shortcut: Positional Number Copying Dominates Arithmetic CoT Readout in Small Language Models
NLP
Large Language Models
- CoT prompting is crucial for arithmetic tasks in small language models, but the order of steps is less important than previously believed.
- The study identifies a positional readout mechanism where models copy the trailing number before the answer delimiter, leading to high accuracy.
- Gold-answer presence significantly boosts model accuracy, indicating a reliance on positional copying over logical reasoning.
- Different models exhibit varying degrees of content gating and distractor handling, with implications for their architecture-specific behaviors.
Read more
The Readout Shortcut: Positional Number Copying Dominates Arithmetic CoT Readout in Small Language Models
Summary
This paper investigates the effectiveness of Chain-of-Thought (CoT) prompting in small language models (LMs) for arithmetic tasks, specifically focusing on the readout mechanism. The author demonstrates that while CoT prompting is essential for achieving high accuracy in arithmetic, the logical sequencing of CoT steps is less critical than previously thought, as shuffling the steps retains most performance. The study isolates the answer-readout stage using prefix completion and identifies a positional shortcut where the model predominantly copies the number in the trailing position before the answer delimiter, irrespective of the reasoning steps taken. This positional copying accounts for a significant portion of accuracy, with findings showing that the presence of the correct answer can lead to a 54-92 percentage point increase in accuracy. The paper also explores the behavior of different models (Qwen, Llama, and Gemma) in terms of their response to distractor numbers and the architecture-specific mechanisms that influence their readout processes. The results indicate that the readout mechanism can bypass intermediate reasoning, raising questions about the faithfulness of CoT-based evaluations and suggesting that the readout may not rely on genuine computation when a copyable number is available.
Methodology
The research employs a series of experiments on three instruction-tuned language models (Qwen, Llama, and Gemma) using the GSM8K dataset. The methodology includes prefix completion to isolate the answer-readout stage, corruption decomposition to assess the impact of gold-answer presence, and head-level ablation studies to understand architecture-specific behaviors. Statistical analyses, including Wilson confidence intervals and McNemar's test, are used to evaluate the results.
Results
The results reveal that the positional readout mechanism is dominant, with models copying the last number in the answer context 87-95% of the time. The presence of the correct answer can lead to a 54-92 percentage point increase in accuracy. Additionally, the study finds that the readout mechanism is architecture-specific, with varying degrees of content gating across models. The findings also indicate that the readout mechanism can suppress genuine computation when a copyable number is present.
Implications
These findings have significant implications for understanding the limitations of CoT prompting in evaluating model reasoning. They suggest that reliance on positional copying may lead to overestimation of model capabilities in arithmetic tasks. This could influence future research on model evaluation and the design of more robust prompting strategies that ensure genuine reasoning is utilized.
ThriftAttention: Selective Mixed Precision for Long-Context FP4 Attention
NLP
Large Language Models
Efficient ML
- ThriftAttention achieves near-FP16 quality at FP4 inference efficiency.
- The method selectively computes important query-key interactions in higher precision.
- It recovers up to 89.1% of the quality gap between FP4 and FP16 with only 5% of blocks in FP16.
- The performance advantage increases with longer context lengths.
Read more
ThriftAttention: Selective Mixed Precision for Long-Context FP4 Attention
Summary
ThriftAttention introduces a novel approach to address the inefficiencies of attention mechanisms in long-context workloads, particularly in the context of using low-bit precision for inference. Traditional methods that utilize block-scaled quantization to FP4 precision often lead to significant quality degradation, especially in long-context scenarios. The paper identifies that quantization error is not uniformly distributed; instead, it concentrates on a few critical query-key interactions that are essential for the model's output. To mitigate this issue, ThriftAttention employs a two-stage process: first, a heuristic selects a small number of important query-key block pairs to be computed in FP16 precision, while the rest are computed in FP4. The results are then merged using an online softmax function. The proposed method demonstrates that by allocating only 5% of the query-key blocks to FP16, ThriftAttention can recover approximately 89.1% of the performance gap between FP4 and FP16. This recovery rate improves with larger FP16 budgets and longer sequence lengths, showcasing the method's effectiveness in maintaining quality while enhancing inference efficiency.
Methodology
ThriftAttention employs a two-stage approach where a heuristic scores query-key block pairs based on their importance. The top-scoring blocks are computed in FP16, while the remaining blocks are computed in FP4. The outputs from both sets are merged using an online softmax function to produce a single output.
Results
ThriftAttention successfully recovers an average of 89.1% of the performance gap between FP4 and FP16 attention at a 5% FP16 block budget, with recovery rates increasing to 91.8% and 92.4% at 10% and 25% budgets, respectively. The method also reduces inference latency by up to 2× in long-context scenarios.
Implications
ThriftAttention has significant implications for the deployment of large language models, particularly in applications requiring efficient inference with long-context data. It enables the use of lower precision without sacrificing output quality, making it suitable for real-time applications and resource-constrained environments.
MindAlign: Bridging EEG, Vision, and Language for Zero-Shot Visual Decoding
Multimodal
- Introduction of a tri-modal EEG-image-text alignment framework for visual decoding.
- Significant improvement in decoding accuracy on the Things-EEG2 benchmark compared to prior methods.
- Utilization of a masked autoencoder for pre-training the EEG encoder, enhancing performance.
- Demonstration that compact embedding spaces outperform larger models in EEG-to-image retrieval.
Read more
MindAlign: Bridging EEG, Vision, and Language for Zero-Shot Visual Decoding
Summary
The paper presents MindAlign, a novel tri-modal contrastive framework designed to enhance visual decoding from electroencephalography (EEG) signals by aligning EEG, visual, and textual representations in a unified latent space. The authors identify the challenges posed by the low signal-to-noise ratio of EEG and inter-subject variability, which complicate direct EEG-image supervision. To overcome these issues, the framework employs a two-stage approach: first, it pre-trains an EEG encoder using masked reconstruction on unlabeled EEG trials to learn spatio-temporal regularities; second, it aligns EEG, image, and language model-generated textual descriptions through contrastive learning, where text supervision serves as a semantic regularizer. The proposed method integrates subject-specific adaptation, graph-attention over channels, and temporal-spatial convolutional embeddings. On the Things-EEG2 benchmark, MindAlign achieves a Top-1 accuracy of 54.1% and Top-5 accuracy of 83.4%, significantly outperforming the previous best baseline results of 32.4% and 64.0%, respectively. The findings suggest that compact embedding geometries are more effective than larger models for EEG-to-image retrieval, and the decoding aligns with established neurophysiological principles of visual processing, marking a significant advancement in the field of brain-computer interfaces.
Methodology
The methodology involves a two-stage design: first, pre-training an EEG encoder using masked reconstruction on unlabeled EEG data to capture spatio-temporal patterns; second, applying contrastive learning to align EEG, image, and text embeddings in a shared latent space, where textual descriptions provide additional semantic context.
Results
The MindAlign framework achieved a Top-1 accuracy of 54.1% and Top-5 accuracy of 83.4% on the Things-EEG2 benchmark, significantly surpassing the previous best results of 32.4% and 64.0%, respectively. Statistical tests confirmed the significance of these improvements (p < 0.01).
Implications
The findings suggest that the integration of EEG with visual and textual modalities can lead to more robust brain-computer interfaces, enabling applications in assistive technologies, cognitive neuroscience research, and enhanced human-computer interaction.
Steered Generation via Gradient-Based Optimization on Sparse Query Features
NLP
Large Language Models
Optimization
- Introduction of Prototype-Based Sparse Steering for controlled text generation.
- Demonstration of attention query activations as a high-fidelity control mechanism.
- Validation through experiments in both rigid planning and educational feedback contexts.
- Establishment of sparse query representations for improved interpretability and control.
Read more
Steered Generation via Gradient-Based Optimization on Sparse Query Features
Summary
This paper explores a novel approach to controlled text generation in Large Language Models (LLMs) by focusing on attention query activations as a means for precise steering. The authors introduce a framework called Prototype-Based Sparse Steering, which utilizes Sparse Autoencoders (SAEs) to decompose query activations into interpretable features. By applying gradient-based optimization during inference, the framework aligns these sparse representations with class prototypes of desired behaviors. The research is validated through experiments in two distinct environments: Textualized Gridworld, which tests rigid planning constraints, and a high-dimensional educational domain that assesses cognitive complexity in feedback. The findings indicate that optimizing sparse query features allows for effective navigation of planning requirements and provides a unified control mechanism over both logical and stylistic attributes, outperforming traditional prompt-based methods and static steering vectors.
Methodology
The authors propose a framework that combines Sparse Autoencoders (SAEs) with gradient-based optimization to manipulate attention query activations. This approach treats control as an optimization problem, steering encoded features toward desired prototype distributions while keeping the base model weights fixed. The framework is tested in controlled environments, including Textualized Gridworld and an educational feedback setting, to assess its effectiveness in achieving specific semantic and behavioral targets.
Results
The experiments demonstrate that the proposed method enables effective navigation of planning constraints in Textualized Gridworld, allowing for the generation of paths that meet safety and length requirements. In the educational domain, the framework successfully steers cognitive complexity according to Bloom's Taxonomy. Additionally, the results show that query-level interventions result in lower distortion and improved control fidelity compared to traditional static steering methods.
Implications
The findings suggest that manipulating attention query activations can lead to more precise control in LLMs, making this approach valuable for applications requiring adherence to specific stylistic, safety, or logical constraints. This could enhance the usability of LLMs in real-world scenarios, particularly in education and planning tasks.
Cluster Frequency Conformal Prediction for Local Coverage
Theory
Interpretability
Efficient ML
- CFCP improves classwise reliability in many-class classification by utilizing local cluster-level label frequencies.
- The framework preserves standard conformal prediction validity while adapting to local representation structures.
- CFCP achieves superior class coverage in 15 out of 16 dataset comparisons against strong baselines.
- The method is computationally efficient and maintains competitive prediction set sizes.
Read more
Cluster Frequency Conformal Prediction for Local Coverage
Summary
The paper introduces Cluster Frequency Conformal Prediction (CFCP), a novel framework aimed at enhancing the reliability of conformal prediction in many-class classification scenarios. Traditional conformal prediction methods provide distribution-free coverage guarantees but often fail to adequately cover specific classes or subpopulations, particularly when calibration data is sparse. CFCP addresses this issue by leveraging local structure in learned representation spaces. The approach involves clustering learned embeddings, estimating cluster-level label frequency distributions from calibration data, and constructing sample-specific probability vectors by mixing nearby cluster distributions. This localized adaptation allows CFCP to maintain standard conformal prediction validity while improving classwise coverage. The empirical evaluation demonstrates that CFCP outperforms existing methods in achieving better class coverage across various image and text benchmarks, suggesting that local cluster-frequency information is a valuable signal for improving reliability in many-class settings.
Methodology
CFCP clusters learned embeddings to estimate label frequency distributions at the cluster level. For each test point, it retrieves nearby clusters, forms a soft mixture of their label distributions, and regularizes this mixture with global priors. The resulting sample-specific probability vector is then used in standard conformal set construction, ensuring both statistical validity and interpretability.
Results
CFCP demonstrated the best class coverage in 15 out of 16 dataset and score-family comparisons, while also maintaining competitive prediction set sizes. The results indicate that CFCP effectively balances classwise coverage with practical efficiency, particularly in high-dimensional, many-class settings.
Implications
The CFCP framework can be applied in high-stakes applications where reliable predictions are crucial, such as medical diagnosis and financial forecasting. Its ability to adapt to local structures in data representation can enhance the deployment of machine learning models in diverse real-world scenarios.
Deployment-complete benchmarking
Theory
- Deployment-complete benchmarking ensures that benchmark evidence directly supports deployment actions.
- Mixed fibers indicate missing deployment information, which can lead to unresolved actions.
- Completion curves quantify the evidence needed to resolve ambiguities in deployment decisions.
- Traditional benchmarks often fail to provide reliable deployment guidance despite high scores.
Read more
Deployment-complete benchmarking
Summary
The paper introduces the concept of deployment-complete benchmarking, which addresses the limitations of traditional benchmarks that only record evidence for a specific response without ensuring that this evidence supports a deployment action. The authors argue that benchmarks should not only provide scores but also clarify the actions they support, the ambiguity in deployment decisions, and the costs associated with resolving this ambiguity. They propose a framework where a benchmark is considered complete for a deployment claim if the action is consistent across all evidence fibers. The paper discusses the importance of mixed fibers as indicators of incomplete information and presents completion curves to quantify the evidence needed to resolve ambiguities. Through various evaluations, including public audits and held-out replays, the authors demonstrate that traditional benchmark scores often fail to translate into reliable deployment actions, highlighting the need for a more comprehensive reporting standard that includes evidence maps and completion costs. The findings suggest that deployment-ready benchmarks should focus on the clarity of supported actions rather than merely high scores.
Methodology
The authors developed a framework for deployment-complete benchmarking that includes the analysis of evidence maps, mixed fibers, and completion curves. They conducted public audits and held-out replays to evaluate the effectiveness of their proposed methods in reducing false decisions and improving deployment readiness.
Results
The study found that benchmark-channel conformal coverage was significantly lower in unmeasured deployment channels, with only 10.07% coverage compared to 94.98% in controlled spaces. The introduction of completion evidence reduced false decisions in Tox21 and JARVIS datasets from 1.19% to 0.027% and from 20.3% to 0.128%, respectively. Public audits revealed a high percentage of mixed fibers, indicating widespread incompleteness in existing benchmarks.
Implications
The findings suggest that benchmarks should evolve to provide more comprehensive information that supports deployment decisions, potentially leading to more reliable and effective machine learning applications in real-world scenarios. This could influence how benchmarks are designed and reported in the future, emphasizing the importance of clarity in deployment actions.
Metropolis-Scale Resilient and Trustworthy Traffic Flow Inference Using Multi-Source Data
Time Series
Graph Learning
Optimization
- Introduction of TA-ANP framework for traffic state inference.
- Effective fusion of floating car data and fixed-detector measurements.
- Unified approach to handle multiple traffic inference sub-tasks.
- Robust uncertainty quantification using neural processes and Monte Carlo Dropout.
Read more
Metropolis-Scale Resilient and Trustworthy Traffic Flow Inference Using Multi-Source Data
Summary
This paper addresses the challenge of inferring network-wide traffic states from sparse observations, focusing on the need for high accuracy and trustworthy uncertainty quantification in intelligent transportation systems (ITS). The authors propose a novel framework called Task-Aware Attentive Neural Process (TA-ANP) that integrates floating car data (FCD) with fixed-detector measurements to enhance global traffic state inference (GTSI). By modeling GTSI as a stochastic process, TA-ANP utilizes meta-learning to adapt to changes in sensing configurations without requiring retraining. The framework incorporates a multi-query attention module to effectively manage three sub-tasks: real-time estimation at unobserved locations, forecasting at observed locations, and forecasting at unobserved locations, while minimizing cross-task interference. For uncertainty quantification, the authors combine neural processes with Monte Carlo Dropout to capture both aleatoric and epistemic uncertainties. They also introduce the Metropolitan Multi-Source Traffic Dataset (MMTD), which includes diverse traffic data across a metropolitan area. Experimental results demonstrate that TA-ANP outperforms existing methods in all sub-tasks, providing well-calibrated uncertainties that facilitate efficient sensor placement and showcasing resilience to disturbances in the sensing environment.
Methodology
The authors developed the TA-ANP framework, which employs meta-learning properties of neural processes to adapt to dynamic sensing configurations. A task-aware multi-query attention module is utilized to address three GTSI sub-tasks simultaneously. Uncertainty quantification is achieved through a combination of neural processes and Monte Carlo Dropout, allowing for the capture of different types of uncertainties.
Results
TA-ANP achieved state-of-the-art performance across all sub-tasks in the MMTD dataset, outperforming existing methods under both deterministic and probabilistic metrics. The framework also provided well-calibrated uncertainty estimates, enabling more efficient sensor deployment strategies and demonstrating superior resilience in adapting to changes in the sensing environment.
Implications
The findings suggest that TA-ANP can significantly enhance traffic monitoring and control in urban environments, leading to improved decision-making in ITS. The framework's resilience to disturbances can inform better infrastructure planning and sensor deployment strategies, ultimately contributing to smarter and more efficient urban mobility solutions.
Parametric Prior Mapping Framework for Non-stationary Probabilistic Time Series Forecasting
Time Series
Generative Models
Optimization
- PPM synergizes parametric estimation with deep generative modeling to capture non-stationary dynamics.
- Introduces a parametric push-forward mechanism for deriving adaptive conditional priors.
- Utilizes a hybrid objective combining NLL and MSE for training stability and distributional fidelity.
- Empirical results show up to 31.2% reduction in CRPS and 44.3% in QICE compared to state-of-the-art models.
Read more
Parametric Prior Mapping Framework for Non-stationary Probabilistic Time Series Forecasting
Summary
This paper introduces the Parametric Prior Mapping (PPM) framework aimed at improving non-stationary probabilistic multivariate time series (MTS) forecasting. The authors highlight the challenges of existing methods, which either lack flexibility or struggle with complex temporal dependencies. PPM integrates parametric structural priors into a generative modeling process, allowing for a dynamic, adaptive prior that guides the learning of a complex predictive distribution. This hybrid approach retains the efficiency of parametric methods while leveraging the expressiveness of generative models. The framework is trained using a hybrid objective that combines Negative Log-Likelihood (NLL) and Mean Squared Error (MSE), ensuring both accuracy and robust uncertainty estimates. Empirical evaluations demonstrate that PPM significantly outperforms existing baselines, achieving better accuracy and computational efficiency, making it a promising solution for real-world applications in fields such as finance and transportation where predictive uncertainty is crucial.
Methodology
The PPM framework employs a parametric estimator to derive an adaptive prior from input sequences, which is then refined through a mapping module to produce a flexible predictive distribution. The model is trained using a hybrid objective that balances NLL and MSE to ensure both point-wise accuracy and uncertainty calibration.
Results
PPM outperformed six state-of-the-art baselines in empirical tests, achieving significant reductions in Continuous Ranked Probability Score (CRPS) and Quantile Integrated Calibration Error (QICE). Additionally, it demonstrated a remarkable speed advantage in inference times compared to existing diffusion models.
Implications
The PPM framework has potential applications in high-stakes decision-making areas such as finance, weather forecasting, and transportation, where accurate predictions and well-calibrated uncertainty estimates are critical.
BandVQ: Band-Wise Vector-Quantized EEG Foundation Model
Time Series
- Introduction of BandVQ, a band-wise vector-quantized EEG foundation model.
- Utilization of independent VQ-VAE tokenizers for each EEG frequency band.
- Incorporation of metadata conditioning and region-based masking to enhance model performance.
- Pretraining on a large-scale EEG dataset with over 9,200 subjects.
Read more
BandVQ: Band-Wise Vector-Quantized EEG Foundation Model
Summary
The paper presents BandVQ, a novel EEG foundation model designed to address the challenges of learning transferable representations from diverse EEG recordings. Traditional masked modeling approaches often overlook frequency-specific activity due to reliance on broadband continuous patches or single discrete representations. BandVQ innovatively decomposes EEG signals into five frequency bands (delta, theta, alpha, beta, and gamma) and utilizes independent vector-quantized variational autoencoders (VQ-VAEs) for each band. A shared Transformer encoder is pretrained on the resulting discrete VQ code indices, incorporating masked code tokens, quantized absolute log-power tokens, and metadata prefix tokens that represent various contextual information. The model also employs region-based masking to mitigate trivial reconstructions from spatially adjacent electrodes. BandVQ is pretrained on a substantial dataset of 71 public EEG corpora, totaling over 9,200 subjects and 357,000 single-channel hours, and is evaluated on six subject-independent classification datasets. The results demonstrate strong transfer performance, achieving the highest accuracy on three cognitive tasks and competitive results on motor imagery tasks, showcasing the model's effectiveness in generalizing across different EEG contexts.
Methodology
BandVQ decomposes EEG signals into five frequency bands and trains separate VQ-VAEs for each band. A shared Transformer encoder is pretrained using masked code prediction, with additional conditioning mechanisms including metadata prefix tokens and quantized log-power tokens. Region-based masking is applied to reduce trivial reconstructions from neighboring electrodes.
Results
The BandVQ model achieved the highest reported accuracy on three cognitive tasks and competitive performance on three motor imagery tasks when evaluated on six subject-independent classification datasets. The results indicate strong transferability and generalization capabilities of the model.
Implications
BandVQ has the potential to improve EEG-based applications in brain-computer interfaces, neurological monitoring, and cognitive assessments by providing robust and transferable representations across diverse EEG datasets.
Reflex: Reinforcement Learning with Reflection Symmetry Exploitation in State-Based Continuous Control
Reinforcement Learning
Robotics
Efficient ML
- Reflex introduces a reflection symmetry-aware learning paradigm for state-based continuous control tasks.
- The paper formalizes axial and bilateral reflection types and their transformations.
- A symmetry regularization term is proposed to enhance policy learning by encouraging invariance under reflection transformations.
- Reflex is integrated with both on-policy (PPO) and off-policy (SAC) RL algorithms.
Read more
Reflex: Reinforcement Learning with Reflection Symmetry Exploitation in State-Based Continuous Control
Summary
The paper introduces Reflex, a novel approach to reinforcement learning (RL) that enhances sample efficiency by exploiting reflection symmetry in state-based continuous control tasks. While previous research has focused on image-based RL and rotational symmetries, Reflex addresses the underexplored area of reflection symmetry, which is particularly relevant for tasks involving coordinated left-right behaviors. The authors formalize two types of reflection—axial and bilateral—and develop a symmetry regularization mechanism that integrates with both on-policy and off-policy RL algorithms, specifically Proximal Policy Optimization (PPO) and Soft Actor-Critic (SAC). The theoretical foundation of Reflex includes a proof of the symmetry of optimal policies, which supports the proposed regularization term. The authors evaluate Reflex on various benchmarks, including OpenAI Gym and DeepMind Control, demonstrating that it significantly improves sample efficiency and final performance compared to standard baselines. This work contributes to the understanding of symmetry in RL and provides a practical framework for enhancing learning efficiency in continuous control environments.
Methodology
The authors develop Reflex by formalizing reflection symmetries in state and action spaces, introducing a symmetry consistency regularization term that encourages policies to learn from mirrored states. The approach is integrated with PPO and SAC algorithms, and its effectiveness is evaluated through experiments on OpenAI Gym and DeepMind Control benchmarks.
Results
Reflex demonstrates superior performance over standard RL baselines, achieving improved sample efficiency and higher expected returns across various environments. The theoretical analysis supports the effectiveness of the symmetry regularization mechanism in enhancing policy robustness.
Implications
The findings suggest that leveraging reflection symmetry can significantly enhance the efficiency of RL algorithms in continuous control tasks, potentially leading to broader applications in robotics and other domains requiring coordinated actions.
TGFormer: Towards Temporal Graph Transformer with Auto-Correlation Mechanism
Graph Learning
Time Series
- TGFormer treats temporal graph learning as a time-series analysis problem, enhancing the modeling of complex temporal patterns.
- The Series Transformer layer effectively captures long-term dependencies using a Transformer-based architecture.
- The auto-correlation mechanism (ACoM) captures periodic patterns with reduced computational complexity.
- TGFormer outperforms existing state-of-the-art methods in precision across multiple datasets.
Read more
TGFormer: Towards Temporal Graph Transformer with Auto-Correlation Mechanism
Summary
The paper introduces TGFormer, a novel Transformer architecture tailored for Temporal Graph Neural Networks (TGNNs) that addresses the challenges of capturing long-term dependencies and periodic patterns in temporal graphs. TGFormer redefines temporal graph learning by framing it as a time-series analysis problem, allowing for a systematic examination of historical interactions to derive node representations. A key innovation is the auto-correlation mechanism (ACoM), which leverages stochastic process theory to identify periodic dependencies in node interactions. Unlike traditional attention mechanisms that tend to smooth out high-frequency signals, ACoM operates in the frequency domain, preserving critical periodic fluctuations while reducing computational complexity. Extensive experiments on six public benchmarks demonstrate that TGFormer significantly outperforms state-of-the-art methods, achieving up to a 9.35% improvement in precision. The findings highlight TGFormer’s effectiveness in modeling both long-term dependencies and periodic structures, marking a significant advancement in temporal graph representation learning.
Methodology
TGFormer employs a Transformer-based architecture to model temporal graphs, utilizing a Series Transformer layer for long-term dependency capture and an auto-correlation mechanism (ACoM) for identifying periodic patterns. ACoM transforms time series data into the frequency domain using Fast Fourier Transform (FFT), allowing the model to focus on specific frequency components and preserve high-frequency periodic signals.
Results
Experimental validation on six real-world datasets shows that TGFormer achieves up to a 9.35% improvement in precision compared to state-of-the-art TGNN approaches, demonstrating its superior ability to model long-term dependencies and periodic structures.
Implications
TGFormer has potential applications in dynamic systems modeling such as social networks, traffic flow prediction, and real-time recommendation systems, where understanding temporal dynamics is crucial for accurate forecasting and decision-making.
When Determinants Are Not Enough: Private Rare Switching
Reinforcement Learning
Theory
Optimization
- The standard rare-switching update rule fails under Gaussian noise due to loss of monotonicity.
- A new rare-switching rule based on the generalized Rayleigh quotient is proposed.
- The new rule allows for logarithmic policy updates with controlled regret in private settings.
- The paper includes a cleaned-up proof of the new rule and its implications.
Read more
When Determinants Are Not Enough: Private Rare Switching
Summary
This paper addresses the challenges of applying the standard rare switching policy update in linear bandits and reinforcement learning (RL) when privacy is a concern. The author notes that the conventional determinant-based update rule fails when Gaussian noise is introduced for privacy, disrupting the monotonicity of the design matrix. This paper proposes a new rare-switching rule based on the generalized Rayleigh quotient, which allows for logarithmic policy updates while maintaining a controlled regret. The author presents a cleaned-up version of the proof and reflects on the implications of this adaptation. The new update rule ensures that the learner can still effectively manage policy updates in a private setting, achieving performance guarantees similar to those in the non-private case.
Methodology
The author utilizes a theoretical approach to derive a new rare-switching update rule, leveraging the generalized Rayleigh quotient. The proof involves analyzing the properties of the covariance matrices under privacy constraints and establishing performance guarantees through mathematical inequalities.
Results
The proposed update rule allows for policy updates when the maximum eigenvalue condition is met, ensuring that the updates remain logarithmic in number while controlling regret. The results indicate that the new rule can effectively replace the standard one in private linear bandit and RL settings.
Implications
This work has significant implications for the design of privacy-preserving algorithms in reinforcement learning and linear bandits, enabling practitioners to maintain effective learning strategies without compromising user privacy. It also opens avenues for further research into adaptive learning methods under privacy constraints.
Filtered Posterior Mean Collections: A Unified Framework for Analytical Models of Diffusion Generalization
Generative Models
Computer Vision
- Introduction of Filtered Posterior Mean Collections (FPMCs) as a unified framework for analytical denoisers.
- Identification of three principal design axes (query precision, response weights, source distributions) that differentiate prior methodologies.
- Demonstration of improved performance through soft relaxations and data augmentation strategies.
- Achieved state-of-the-art sample similarity on CIFAR-10, FFHQ 64 × 64, and AFHQ 64 × 64 datasets.
Read more
Filtered Posterior Mean Collections: A Unified Framework for Analytical Models of Diffusion Generalization
Summary
This paper introduces Filtered Posterior Mean Collections (FPMCs), a unified framework for analytical models that approximate the behavior of neural-network denoising functions used in image diffusion models. The authors observe that these denoising functions exhibit consistent generalization behavior across various architectures and training hyperparameters. They consolidate existing methods that model denoiser outputs through posterior weighted averages of training dataset patches into the FPMC framework, which is defined by three axes: query precision vectors, response weights, and source distributions. The study investigates these axes to improve FPMC performance through soft relaxations of prior methodologies and data augmentation strategies. The authors demonstrate that their approach leads to significant improvements in sample similarity on three natural image datasets, thereby enhancing the understanding and performance of diffusion models in generative tasks.
Methodology
The authors develop the FPMC framework by defining query precision vectors, response weights, and source distributions. They analyze the impact of these design axes on model performance, applying soft relaxations to prior patch-based methods and exploring data augmentation techniques. The methodology includes joint finetuning of the design axes to optimize denoiser performance.
Results
The application of the FPMC framework resulted in consistent improvements in sample similarity compared to diffusion model outputs across three natural image datasets. The study highlights that specific choices of the design axes can recover existing methods, providing a clearer understanding of their similarities and differences.
Implications
The findings suggest that the FPMC framework can enhance the performance of generative models, particularly in image synthesis tasks. This could lead to more effective applications in areas such as computer vision, where high-quality image generation is crucial.
RelPrism: A Multi-Faceted Pre-training Framework with Self-Generated Tasks for Relational Databases
Graph Learning
- RelPrism addresses the limitations of existing self-supervised learning methods for relational databases by incorporating multi-faceted information.
- The framework constructs intrinsic, relational, and hybrid attributes, allowing for a broader perspective during pre-training.
- Experiments show that RelPrism outperforms state-of-the-art methods, improving ROC-AUC by 4.15% for classification tasks and reducing MAE by 10.75% for regression tasks.
- The methodology emphasizes the importance of multi-granularity clustering to form pseudo-task pools for effective representation learning.
Read more
RelPrism: A Multi-Faceted Pre-training Framework with Self-Generated Tasks for Relational Databases
Summary
The paper introduces RelPrism, a novel self-supervised learning framework designed for relational databases (RDBs) that addresses the challenges of effective pre-training for diverse predictive tasks. Traditional methods often rely on single-faceted information, which limits adaptability and performance in RDB representation learning. RelPrism overcomes this by constructing intrinsic, relational, and hybrid attributes from multiple perspectives and applying multi-granularity clustering to create pseudo-task pools. This approach allows for comprehensive exposure to various information types during pre-training, enhancing the model's ability to adapt to downstream tasks. The authors validate RelPrism through experiments on 14 tasks across 5 real-world datasets, demonstrating significant improvements in performance metrics compared to state-of-the-art methods, thus showcasing its effectiveness in handling the heterogeneous needs of RDB predictive tasks.
Methodology
RelPrism employs a multi-faceted self-supervised learning approach by constructing intrinsic, relational, and hybrid attributes from relational databases. It utilizes multi-granularity clustering to create pseudo-task pools, which expose the model to diverse perspectives and granularities during pre-training. This method allows for a more comprehensive learning experience, facilitating better adaptation to various downstream tasks.
Results
The framework was tested on 14 tasks across 5 real-world datasets, achieving an improvement of 4.15% in ROC-AUC for classification tasks and a reduction of 10.75% in MAE for regression tasks compared to existing state-of-the-art methods.
Implications
RelPrism has significant implications for enhancing predictive modeling in relational databases, making it easier to handle complex tasks that require multi-faceted information. Its approach can be applied to various domains, such as finance and healthcare, where relational databases are prevalent, potentially leading to better decision-making and insights.