AI-generated summaries
Today's ML research,
without the noise.
Daily summaries of the latest machine learning papers from arXiv, processed every 8 hours.
48
Papers today
8h
Update frequency
7
Days of history
Cluster-Level Attention-Guided Parallel Decoding for Masked Diffusion Language Models
NLP
Large Language Models
Generative Models
- Introduction of confidence-induced clusters (CICs) as span-level update units for MDLMs.
- Development of CLAD, a training-free cluster-level decoder that enhances parallel decoding.
- Utilization of self-attention maps to model inter-cluster dependencies and ensure compatibility.
- Demonstrated significant speed improvements over existing token-level decoding methods.
Read more
Cluster-Level Attention-Guided Parallel Decoding for Masked Diffusion Language Models
Summary
This paper introduces CLAD (Cluster-Level Attention-Guided Decoding), a novel approach to decoding in Masked Diffusion Language Models (MDLMs). The authors identify that existing methods typically operate at a token-level granularity, which limits the potential for parallelism during decoding. They propose a new granularity by defining confidence-induced clusters (CICs), which are contiguous spans of high-confidence masked positions. By leveraging self-attention maps to assess inter-cluster dependencies, CLAD allows for conflict-aware selection of CICs for simultaneous commitment. This method enhances the decoding process by enabling larger units of commitment while avoiding incompatible predictions. Experimental results demonstrate that CLAD achieves significant speedups (1.77×–8.47×) over traditional token-level decoding methods while maintaining comparable accuracy across various reasoning and code-generation benchmarks.
Methodology
The authors propose a two-step approach: first, they group adjacent high-confidence candidates into confidence-induced clusters (CICs) as update units. Then, they use self-attention maps from the forward pass to estimate dependencies between these clusters, allowing for conflict-aware selection of compatible CICs for parallel commitment. This results in a more efficient decoding process that can handle multiple predictions simultaneously.
Results
CLAD achieves speedups ranging from 1.77× to 8.47× compared to traditional Vanilla decoding methods. It also shows improved throughput over token-level dependency-aware baselines while maintaining broadly comparable task accuracy across four reasoning and code-generation benchmarks.
Implications
The findings suggest that changing the decoding unit from individual tokens to confidence-induced spans can significantly enhance throughput in language model decoding tasks. This approach could be beneficial in applications requiring efficient text generation, such as chatbots, automated coding assistants, and other NLP tasks.
The Hamilton-Jacobi Theory of Deep Learning
Theory
Optimization
Interpretability
- Training neural networks corresponds to solving Hamilton-Jacobi initial-value problems.
- Log-sum-exp activation functions are smooth deformations of tropical max operations.
- A single parameter ε connects various perspectives on neural networks, including tropical algebra and PDEs.
- The framework provides actionable design principles for optimizing neural network architectures.
Read more
The Hamilton-Jacobi Theory of Deep Learning
Summary
This paper presents a novel perspective on deep learning by framing the training of neural networks as a search through Hamilton-Jacobi initial-value problems. The authors establish that each gradient step corresponds to selecting initial data for a viscous Hamilton-Jacobi equation, with the Hopf-Cole propagator providing the best fit for observations. The paper demonstrates that various neural network architectures, including residual networks, transformers, and recurrent networks, can be structurally related to Hamilton-Jacobi equations, with a single deformation parameter ε unifying these perspectives. Key contributions include the identification of log-sum-exp layers as smooth deformations of tropical max operations, the establishment of a commutative diagram linking neural networks, tropical algebra, PDEs, and convex optimization, and actionable design principles for neural network architecture. The results yield insights into generalization rates, adversarial robustness, and influence functions, ultimately proposing a comprehensive mathematical theory of deep learning that connects disparate concepts under a unified framework.
Methodology
The authors utilize mathematical frameworks from Hamilton-Jacobi theory, tropical algebra, and convex optimization to establish connections between neural network architectures and partial differential equations. They employ the Hopf-Cole linearization and Maslov dequantization to derive relationships between different mathematical objects and neural network components.
Results
The paper establishes that every log-sum-exp activated layer encodes the exact Hopf-Cole solution of a viscous Hamilton-Jacobi PDE. It also identifies a minimax optimal generalization rate of O(n−1/(d+2)), demonstrates that adversarial robustness is controlled by the parameter ε, and provides a closed-form influence function for softmax weights. The findings suggest that the architecture of neural networks can be optimized based on the derived principles.
Implications
The unification of deep learning concepts under the Hamilton-Jacobi framework has the potential to enhance the design and understanding of neural networks, leading to improved performance in various applications. The insights into generalization and robustness could inform future research and practical implementations in machine learning.
How Much Is a Dataset Worth? Scaling Laws, the Vendi Score, and Matrix Spectral Functions
Theory
Optimization
Efficient ML
- Dataset value is influenced by factors beyond size and compute budget.
- The Vendi Score and neural scaling laws are shown to be submodular.
- Matrix spectral functions provide a broader framework for dataset appraisal.
- A new optimization method yields a 35,000× speedup for maximizing the Vendi Score.
Read more
How Much Is a Dataset Worth? Scaling Laws, the Vendi Score, and Matrix Spectral Functions
Summary
This paper addresses the challenge of appraising the value of datasets in machine learning, emphasizing that dataset value is not solely determined by size or compute budget. The authors explore the relationship between neural scaling laws and the Vendi Score, both of which exhibit submodularity. They introduce a broader class of objectives called matrix spectral functions, which includes the Vendi Score and determinantal point processes (DPPs). The paper presents a novel method for efficiently optimizing these objectives using secular-equation-based updates, achieving significant speed improvements. The authors evaluate various data appraisal methods, including the Vendi Score and facility location, against held-out test performance across multiple datasets. The findings reveal that while the Vendi Score is predictive within moderate ranges, it can perform poorly at higher values, whereas facility location consistently outperforms other methods. The study concludes that dataset value is complex and cannot be reduced to size, class balance, or training budget alone.
Methodology
The authors analyze the submodularity of neural scaling laws and the Vendi Score, introducing matrix spectral functions as a generalization. They develop a fast secular-function strategy for optimizing these objectives, significantly reducing computational overhead. The performance of various data appraisal methods is compared using empirical evaluations on ImageNet-1K-scale datasets.
Results
The study demonstrates that while the Vendi Score can predict dataset value, it is less reliable at higher score ranges. Facility location consistently outperforms the Vendi Score and other matrix spectral variants in predicting held-out test performance. Random sampling of datasets shows limited variety in appraisal scores and performance.
Implications
The findings suggest that more nuanced methods for dataset appraisal could enhance the efficiency of machine learning training processes. The introduction of matrix spectral functions may lead to better data selection strategies, impacting various applications in machine learning.
The Good, the Bad, and the Ugly of Markov Boundary for Tabular Prediction
Theory
Graph Learning
Efficient ML
- Restricting regressors to the Markov boundary can improve prediction, especially in larger and sparser feature spaces.
- Causal discovery methods often fail to provide a useful boundary for prediction due to computational constraints and misalignment of objectives.
- The exact Markov boundary is not the only effective feature set; alternative sets can also yield better performance than using all features.
- The study highlights the trade-offs between minimality, sufficiency, and scalability in feature selection for tabular prediction.
Read more
The Good, the Bad, and the Ugly of Markov Boundary for Tabular Prediction
Summary
This paper investigates the utility of the Markov boundary in tabular prediction, which is the minimal set of features that renders all other features redundant for predicting a target variable. The authors explore whether restricting regressors to the Markov boundary improves prediction performance compared to using the full feature set. They conduct experiments using SCM3K, a synthetic benchmark with 3,450 tasks across various feature counts and regression models. The findings reveal that while restricting to the Markov boundary often enhances prediction, the process of discovering this boundary through causal discovery does not yield the expected benefits. The authors identify three main reasons for this discrepancy: causal discovery prioritizes structural recovery over predictive accuracy, the costs of false positives and negatives are asymmetrical, and many feature sets can outperform the full set without being the exact boundary. The paper concludes with insights on the implications for feature selection and causal structure in tabular models.
Methodology
The authors utilized SCM3K, a controlled benchmark of synthetic structural causal models, to evaluate the performance of six regression models. They compared the test errors of models trained on the full feature set against those trained on the oracle Markov boundary, measuring the difference as the MB gap. They also analyzed the limitations of causal discovery methods in recovering the Markov boundary.
Results
The results indicated that restricting to the Markov boundary generally improved prediction accuracy, with the improvement increasing as the feature space became larger and sparser. However, attempts to recover the boundary through causal discovery did not consistently outperform models using the full feature set, primarily due to computational limitations and the nature of the discovery process.
Implications
The findings suggest that while the Markov boundary is theoretically appealing for feature selection in tabular prediction, practical applications may require alternative approaches that balance predictive performance with computational efficiency. This has implications for the design of future regression models and feature selection techniques in machine learning.
OISD: On-Policy Internal Self-Distillation of Language Models
NLP
Large Language Models
Reinforcement Learning
- Introduction of On-Policy Internal Self-Distillation (OISD) for language models.
- Utilizes the final layer as an internal teacher to guide intermediate layers.
- Employs logit and attention alignment mechanisms for effective knowledge transfer.
- Demonstrates substantial improvements in reasoning tasks over strong RL baselines.
Read more
OISD: On-Policy Internal Self-Distillation of Language Models
Summary
This paper introduces a novel framework called On-Policy Internal Self-Distillation (OISD) aimed at enhancing reasoning capabilities in language models through reinforcement learning (RL). Traditional RL post-training methods often focus solely on optimizing the final output policy using sparse rewards, neglecting the rich predictive signals present in intermediate model representations. OISD addresses this gap by utilizing the final layer of the model as both the acting policy and an internal teacher for selected intermediate layers. The framework employs two key mechanisms: logit alignment, which transfers high-level reasoning behaviors, and attention alignment, which ensures consistent attention patterns across layers. This approach allows for the distillation of informative intermediate representations without relying on external privileged information. The authors demonstrate the effectiveness of OISD through experiments on four mathematical reasoning tasks, showing significant improvements over existing strong RL baselines. The results suggest that leveraging internal computations for distillation can lead to more accurate and coherent reasoning trajectories in language models.
Methodology
The OISD framework operates by keeping the final layer as the sole acting policy during rollout and optimization, while using it as a detached internal teacher for selected intermediate layers. The framework employs logit alignment to transfer predictive beliefs and attention alignment to ensure consistent attention patterns, facilitating the learning of stronger intermediate representations without external supervision.
Results
Experimental results indicate that OISD significantly outperforms strong reasoning RL baselines across four mathematical reasoning benchmarks, showcasing the effectiveness of internal self-distillation in enhancing model reasoning capabilities.
Implications
The findings suggest that internal self-distillation can be a powerful approach for improving reasoning in language models, potentially leading to advancements in various applications such as mathematical reasoning, coding, and complex instruction-following tasks. This research opens avenues for further exploration of internal mechanisms in model training.
The Long-Term Effects of Data Selection in LLM Fine-Tuning
NLP
Large Language Models
Theory
- Introduces the concept of myopic selection in LLM fine-tuning, highlighting the trade-off between short-term gains and long-term adaptability.
- Develops a unified multi-stage evaluation protocol for comparing various data selection strategies.
- Demonstrates through experiments that short-term effective selectors can hinder future learning and increase forgetting.
- Proposes the Long-Horizon Aware Selection (LHAS) objective to improve long-term adaptation and robustness.
Read more
The Long-Term Effects of Data Selection in LLM Fine-Tuning
Summary
This paper investigates the long-term effects of data selection strategies in the fine-tuning of large language models (LLMs). While traditional methods focus on immediate task performance, this study emphasizes the importance of evaluating data selection based on future adaptability, retention, and robustness. The authors introduce the concept of myopic selection, where short-term effective selectors may hinder long-term learning and increase forgetting. They propose a unified multi-stage evaluation protocol to compare various selection strategies, including random, loss-based, gradient-based, diversity-based, and quality-based methods. Through controlled experiments, the authors demonstrate that short-term selectors can lead to rank reversal, improving current performance but slowing down future learning. To address this, they introduce the Long-Horizon Aware Selection (LHAS) objective, which incorporates coverage and anti-concentration terms to mitigate the adverse effects of myopic selection. The findings suggest that data selection should be viewed as a critical training intervention that shapes the model's learning trajectory rather than merely a local efficiency mechanism.
Methodology
The authors conducted controlled experiments using a unified multi-stage protocol to evaluate different data selection strategies. They analyzed the impact of these strategies on future adaptation speed, forgetting, capability imbalance, and out-of-distribution robustness. The study included a theoretical analysis to explain why selectors with equal current-stage gains could differ in future adaptation costs.
Results
The experiments revealed that short-term selectors could lead to rank reversal, where they improve immediate task performance but slow down subsequent learning and increase forgetting. The introduction of the LHAS objective showed potential in reducing the long-term side effects of myopic selection, enhancing future adaptability and robustness.
Implications
The findings suggest that practitioners should reconsider how they evaluate data selection strategies in LLM fine-tuning, focusing on long-term effects rather than just immediate performance. This could lead to more robust and adaptable models in real-world applications.
RL2ML: Finite-Rollout Surrogate Objectives from Reinforcement Learning to Maximum Likelihood
Reinforcement Learning
NLP
Large Language Models
- RL2ML connects standard reinforcement learning, maximum-likelihood training, and beyond-maximum-likelihood objectives.
- Introduces a closed-form unbiased gradient estimator for finite-rollout surrogate objectives.
- Identifies a subcritical-supercritical update-scale transition that influences the effectiveness of surrogate objectives.
- Demonstrates that the optimal choice of surrogate objective depends on evaluation metrics, local sensitivity, and estimator variance.
Read more
RL2ML: Finite-Rollout Surrogate Objectives from Reinforcement Learning to Maximum Likelihood
Summary
This paper introduces RL2ML, a novel family of finite-rollout surrogate objectives designed to bridge the gap between reinforcement learning (RL) and maximum likelihood (ML) training. The primary focus is on optimizing language models using binary feedback from sampled outputs, specifically addressing the conflation of the objective optimized in expectation and the stochastic update geometry induced by finite rollout groups. RL2ML provides a closed-form, unbiased gradient estimator that maintains alignment between the estimator and the objective under a fixed rollout budget. The paper reveals a subcritical-supercritical update-scale transition that is not apparent in traditional population-level objective notation, emphasizing the importance of local sensitivity and estimator variance in selecting the best surrogate objective. The findings suggest that the choice of the surrogate objective can be framed as a one-dimensional optimization problem, rather than an unconstrained hyperparameter search, thus simplifying the optimization process.
Methodology
The paper develops the RL2ML framework by defining a truncated power-likelihood surrogate objective and deriving an unbiased estimator under a finite rollout budget. It employs calibrated local-gain analysis and variance decomposition to analyze the update-scale geometry and the implications of different surrogate objectives. The methodology includes a detailed examination of the MaxRL estimator and its relationship to the proposed RL2ML framework.
Results
The results indicate that the RL2ML framework effectively preserves the estimator-objective alignment of MaxRL while allowing for a continuous degree of freedom in the choice of surrogate objectives. The analysis reveals that the optimal choice of the surrogate objective is not solely determined by its proximity to maximum likelihood but is influenced by various factors including the evaluation metric and estimator variance. The paper provides insights into how to allocate weight to low-success prompts more effectively in finite-horizon training.
Implications
The findings of this paper have significant implications for the training of language models and other machine learning tasks that rely on binary feedback. By providing a structured approach to selecting surrogate objectives, RL2ML can enhance the efficiency and effectiveness of reinforcement learning algorithms, particularly in scenarios with limited rollout budgets. This could lead to improved performance in applications such as natural language processing and other areas where feedback is sparse or binary.
Spectral Reach: Understanding Neural Scaling as Progress into the Spectral Tail
Theory
Optimization
- Introduction of spectral position as a scalable measure of eigenvalue contributions to loss reduction.
- Larger models achieve lower losses by accessing weak spectral signals in the eNTK spectrum.
- Feature learning is identified as a key enabler of spectral reach, amplifying gradients during training.
- The study provides a framework for understanding the dynamics of scaling in large-scale neural networks.
Read more
Spectral Reach: Understanding Neural Scaling as Progress into the Spectral Tail
Summary
This paper investigates the mechanisms behind neural scaling laws, which describe the predictable relationships between model size, dataset size, compute, and performance. The authors introduce a new metric called spectral position, which measures the eigenvalues of the empirical neural tangent kernel (eNTK) that contribute to loss reduction during training. Their findings reveal that as training progresses, spectral position decreases, indicating a shift from dominant eigenmodes to the spectral tail. Larger models exhibit greater spectral reach, allowing them to learn from weaker spectral signals that smaller models cannot access. The study identifies feature learning as a critical factor enabling this spectral reach, as it adaptively amplifies gradient magnitudes, sustaining learning progress where frozen representations would stall. The authors propose that these insights can inform architectural and optimizer design to enhance model performance.
Methodology
The authors derive a loss-network-position (LNP) decomposition that factors instantaneous loss changes into three components: network scale, loss scale, and scale-free spectral position. This allows for the evaluation of spectral position from per-sample gradients without explicit kernel construction. The framework is validated through controlled experiments with random-feature models, aligning empirical observations with theoretical predictions.
Results
The analysis shows that spectral position consistently decreases throughout training, with larger models reaching lower spectral positions than smaller ones. This indicates that larger models can sustain learning from smaller eigenvalues in the eNTK spectrum, leading to lower losses. Additionally, feature learning is shown to play a significant role in maintaining learning progress as spectral position decreases.
Implications
The findings suggest that understanding spectral reach can guide the design of more effective neural network architectures and optimization strategies. By leveraging insights into spectral dynamics, practitioners can enhance model performance, particularly in large-scale applications.
UniRTL: Unifying Code and Graph for Robust RTL Representation Learning
Multimodal
Graph Learning
- UniRTL integrates RTL code and CDFG for enhanced representation learning.
- The framework employs mutual masked modeling for fine-grained cross-modal alignment.
- A hierarchical training strategy is utilized to maximize data utilization.
- UniRTL outperforms existing methods in performance prediction and code retrieval tasks.
Read more
UniRTL: Unifying Code and Graph for Robust RTL Representation Learning
Summary
The paper presents UniRTL, a multimodal pretraining framework designed to enhance register transfer level (RTL) representation learning by integrating both RTL code and control data flow graph (CDFG) modalities. Existing methods typically rely on a single modality, which limits the expressiveness and generalization of learned representations. UniRTL addresses this by achieving fine-grained alignment between the code and graph through mutual masked modeling and employs a hierarchical training strategy that includes a pretrained graph-aware tokenizer. This approach maximizes data utilization by aligning text and code before integrating the graph. The authors evaluate UniRTL on performance prediction and code retrieval tasks, demonstrating that it consistently outperforms prior methods, thus establishing a more robust foundation for hardware design automation.
Methodology
UniRTL uses a unified Transformer architecture to integrate code and graph modalities, employing mutual masked modeling for alignment. It incorporates a graph-aware tokenizer and follows a hierarchical training strategy, aligning text and code before integrating the graph to leverage the richer information from both modalities.
Results
Experimental evaluations show that UniRTL consistently outperforms previous methods in both performance prediction and code retrieval tasks, demonstrating its effectiveness in providing robust RTL representations.
Implications
The development of UniRTL has significant implications for hardware design automation, potentially accelerating the design workflow and improving the efficiency of tasks like performance prediction and code retrieval. Its multimodal approach could also inspire future research in integrating diverse data modalities in other domains.
Generalized Intention Modeling in Multi-Agent Reinforcement Learning
Reinforcement Learning
- Introduction of a task-adaptive opponent modeling framework that combines multiple intent representations.
- Development of reward-predictive intention embeddings that enhance the ego-agent's understanding of opponent impact on returns.
- Demonstration of improved performance stability and robustness compared to traditional single-component modeling methods.
- Insights into the varying effectiveness of opponent modeling strategies across different environments.
Read more
Generalized Intention Modeling in Multi-Agent Reinforcement Learning
Summary
This paper addresses the challenge of modeling opponents' intentions in multi-agent reinforcement learning (MARL), which is crucial for effective decision-making in competitive environments. Traditional methods rely on fixed episode components, such as predicting the opponent's next action or future states, which may not universally represent intent across different tasks and environments. The authors propose a task-adaptive opponent modeling framework that learns a mixture of multiple intent representations, optimizing for performance-driven outcomes. A novel intention representation is introduced that maximizes mutual information with the ego-agent's future returns, allowing for a more relevant capture of opponent information. The proposed architecture combines various episode components dynamically, enabling the ego-agent to adapt its understanding of intent based on the specific environment. Empirical results demonstrate that this adaptive approach outperforms state-of-the-art baselines across several multi-agent benchmarks, providing insights into the effectiveness of different opponent modeling strategies and their dependence on the environment.
Methodology
The authors propose an adaptive architecture that employs a mixture-of-experts approach to model opponent intentions. This architecture includes separate modules for different episode components (actions, observations, future states, and rewards) that are combined using a learned mixing mechanism. Additionally, they introduce a contrastive InfoNCE objective to model intention embeddings predictive of future rewards, maximizing mutual information with the ego-agent's returns.
Results
The adaptive opponent modeling framework consistently matches or exceeds the performance of state-of-the-art baselines across diverse tasks, including Level-Based Foraging, Predator-Prey, Kuhn Poker, and a customized Google Research Football scenario. The results indicate that modeling intentions based on future ego-agent rewards can yield more informative representations than traditional methods focusing on future states or actions.
Implications
This work has significant implications for improving decision-making in competitive multi-agent environments, potentially enhancing applications in areas such as robotics, game AI, and strategic planning. The insights gained from the adaptive modeling approach could inform the design of more effective agents that better understand and predict the behavior of opponents.
Singularity-aware Optimization via Randomized Geometric Probing: Towards Stable Non-smooth Optimization
Optimization
Theory
- Introduction of Singularity-aware Adam (S-Adam) to address issues in non-smooth optimization.
- Development of the Local Geometric Instability (LGI) metric for estimating instability in loss landscapes.
- Adaptive damping mechanism that modulates step sizes based on local geometric conditions.
- Rigorous convergence guarantees established through differential inclusions.
Read more
Singularity-aware Optimization via Randomized Geometric Probing: Towards Stable Non-smooth Optimization
Summary
The paper addresses the challenges of optimizing deep learning models that incorporate non-smooth components, such as ReLU activations and quantization operators, which lead to gradient chattering and poor convergence. The authors propose a new optimizer called Singularity-aware Adam (S-Adam) that stabilizes training by dynamically adjusting step sizes based on local geometric instability. The key innovation is the Local Geometric Instability (LGI) metric, which estimates the Clarke subdifferential diameter using the variance of randomized directional derivatives. S-Adam employs an adaptive damping mechanism to slow down updates in regions of high instability while allowing for rapid convergence in smoother areas. The authors provide a rigorous convergence analysis, demonstrating that S-Adam converges almost surely to Clarke stationary points at an optimal rate of O(1/√T). Empirical evaluations show that S-Adam outperforms existing optimizers like AdamW and Prox-SGD in scenarios involving Quantization-Aware Training and high-noise small-batch learning, achieving significant accuracy improvements on benchmark datasets.
Methodology
The authors developed S-Adam by integrating the LGI metric to assess local geometric instability and employing an adaptive damping mechanism that adjusts learning rates in real-time. They conducted a convergence analysis using differential inclusions and performed empirical evaluations on various datasets to compare S-Adam's performance against existing optimizers.
Results
S-Adam demonstrated significant improvements in accuracy, achieving up to +6% on CIFAR-100 and +3% on TinyImageNet compared to AdamW and Prox-SGD. The optimizer effectively mitigated gradient oscillations and improved convergence stability in non-smooth optimization scenarios.
Implications
The findings suggest that S-Adam can serve as a robust alternative to existing adaptive optimizers, particularly in deep learning contexts where non-smooth components are prevalent. This work bridges theoretical insights in non-smooth optimization with practical applications in deep learning, potentially enhancing model training efficiency and performance.
Self-Certifying Transport MCMC via Dual Spectral-Gap Certificates
Theory
Generative Models
Efficient ML
- Introduction of CerT-MCMC framework for learned-transport MCMC with convergence certificates.
- Development of two complementary certificates: covering certificate and quantile-core certificate.
- Quantile-core certificate provides non-vacuous spectral-gap bounds in high dimensions.
- Demonstrated effectiveness on various datasets, including synthetic and real-world applications.
Read more
Self-Certifying Transport MCMC via Dual Spectral-Gap Certificates
Summary
This paper introduces CerT-MCMC, a novel framework that enhances learned-transport Markov chain Monte Carlo (MCMC) methods with automatic and rigorous convergence certificates. The framework utilizes a normalizing flow to map a Gaussian reference distribution to an approximation of the target posterior, which serves as both the proposal in the independence Metropolis–Hastings (IMH) algorithm and the basis for computable spectral-gap bounds. Two complementary certificates are developed: the covering certificate, which bounds weight-ratio oscillation across the full proposal support, and the quantile-core certificate, which focuses on a high-probability core where oscillation is controlled by empirical quantiles. The paper demonstrates that the quantile-core certificate provides non-vacuous spectral-gap bounds even in high dimensions, outperforming the covering certificate in scenarios where the latter becomes ineffective. The framework is validated through experiments on synthetic targets, structural-engineering posteriors, and real-data logistic regression, showing that the quantile-core certificate can effectively track empirical effective sample sizes. This dual-certificate approach is the first of its kind to provide automatic, dimension-aware convergence certificates for learned-transport MCMC, distinguishing between genuine transport failures and limitations of proof techniques.
Methodology
The methodology involves using normalizing flows to create proposals for the IMH algorithm, and deriving two types of spectral-gap certificates. The covering certificate uses finite-sample covering arguments to bound oscillation over the full proposal support, while the quantile-core certificate restricts to a high-probability core, applying one-dimensional empirical quantiles for oscillation control.
Results
The quantile-core certificate consistently delivered non-vacuous spectral-gap bounds across various dimensions (up to D=20) and datasets, while the covering certificate became ineffective in higher dimensions. The quantile-core certificate tracked empirical effective sample sizes within 7%, demonstrating its practical utility.
Implications
The findings suggest that the CerT-MCMC framework can significantly improve the reliability of learned-transport MCMC methods in Bayesian computation, providing practitioners with robust tools for convergence assessment in high-dimensional settings.
Improving Relative Representations with Learned Anchors and Whitened Inner Products
Multimodal
- Introduces learned anchors as robust semantic prototypes for improved relative representations.
- Utilizes a geometry-aware similarity metric that preserves magnitude information and is invariant to affine transformations.
- Demonstrates significant performance gains in cross-model communication across vision and language tasks.
- Enables stable zero-shot communication between models of varying scales and architectures.
Read more
Improving Relative Representations with Learned Anchors and Whitened Inner Products
Summary
This paper addresses the challenge of aligning independently trained neural models that converge to incompatible latent representations, which hinders modular AI systems. The authors propose an enhanced framework for cross-model communication through two key improvements: learning robust semantic anchors and employing a geometry-aware similarity metric. Traditional methods rely on randomly sampled anchors and cosine similarity, which often fail to capture the complex geometries of modern architectures like Transformers. The proposed method ensures better coverage of the data manifold and preserves discriminative magnitude information while being invariant to affine shifts. The results demonstrate significant performance improvements and consistency across various vision and language tasks, enabling nearly lossless information transfer and stable zero-shot communication even among heterogeneous architectures.
Methodology
The authors developed a framework that involves learning anchors to ensure they are informative and stable, covering the data manifold effectively. They also introduced a covariance-aware similarity measure that retains useful angular and magnitude information while being robust to affine distortions. This dual approach addresses the limitations of traditional relative representation methods.
Results
The proposed framework showed substantial improvements in performance and consistency across multiple tasks, allowing for effective zero-shot stitching of embeddings from different models. The method achieved nearly lossless information transfer, even between models with significant architectural differences.
Implications
This work has implications for the development of modular AI systems, enabling better interoperability between different neural models. It can enhance applications in areas requiring cross-model communication, such as transfer learning and multi-task learning in both vision and language domains.
Automating Formal Verification with Reinforcement Learning and Recursive Inference
Reinforcement Learning
Large Language Models
Theory
- Introduces RLVR to improve formal verification in LLMs.
- Achieves a verified reward increase from 2.2% to 58.1% using RLVR.
- Identifies and addresses specification hacking in model outputs.
- Develops a verifier-guided inference scaffold that improves proof generation success rates.
Read more
Automating Formal Verification with Reinforcement Learning and Recursive Inference
Summary
This thesis addresses the challenges of automating formal verification for large language models (LLMs), particularly due to the scarcity of data for proof assistants and the need for precise machine-checkable specifications. The author proposes a novel approach that combines reinforcement learning from verifiable rewards (RLVR) and verifier-guided inference-time search to enhance the generation of verified programs and proofs. The study begins with training open-source models in Dafny using RLVR, achieving a significant increase in verified rewards from 2.2% to 58.1%. However, issues such as specification hacking were identified, where models exploited weak formal specifications. By filtering out underspecified tasks and employing multi-turn RLVR, the verified pass rate improved from 9.7% to 31.1%. Additionally, a verifier-guided inference scaffold in Lean was developed, treating proof generation as a structured search over subgoals, which led to an increase in the pass rate from 46.2% to 69.2% on a pilot set. The thesis also introduces Dalek-Bench, a benchmark derived from the Rust curve25519-dalek verification project, although initial results indicate a need for stronger evaluation methods. Overall, the findings suggest that formal verifiers can significantly enhance LLM capabilities when they are utilized as sources of reward and feedback, provided that the environment offers clean data and robust specifications.
Methodology
The methodology involves training models using reinforcement learning techniques, specifically Group Relative Policy Optimization (GRPO), to optimize the generation of verified programs. The approach includes filtering tasks to eliminate specification vulnerabilities and employing a structured search framework for proof generation in Lean, which incorporates verifier feedback and diagnostics.
Results
The initial experiments showed a verified reward increase from 2.2% to 58.1%, with further refinement leading to a pass rate improvement from 9.7% to 31.1%. The verifier-guided inference scaffold improved the pass rate from 46.2% to 69.2% on a pilot set, and the new benchmark Dalek-Bench revealed areas needing stronger evaluation methods.
Implications
The findings suggest that integrating formal verification processes into LLM training can enhance the reliability and correctness of generated code, which is crucial for applications in cybersecurity and software development. This work lays the groundwork for future research in automated verification methods and their application in real-world scenarios.
Solving Integer Linear Programming with Parallel Tempering
Optimization
- Introduces a solver-free, sampling-based approach for ILP optimization.
- Utilizes Locally-Balanced Proposal and Parallel Tempering to explore discrete feasible regions.
- Demonstrates superior performance compared to SCIP and competitive results against Gurobi.
- Shows robustness to distribution shifts compared to learning-based methods.
Read more
Solving Integer Linear Programming with Parallel Tempering
Summary
This paper presents a novel solver-free, sampling-based optimization framework for Integer Linear Programming (ILP), addressing the limitations of existing learning-based approaches that struggle with generalization and dependence on external solvers. The authors introduce a Locally-Balanced Proposal to construct a transition kernel that avoids gradient approximations, and they integrate Parallel Tempering to navigate the multimodal energy landscapes typical of ILP problems. Two tempering strategies are proposed: standard temperature tempering and penalty tempering, which modulates constraint barriers while maintaining the objective landscape over feasible solutions. The empirical evaluation demonstrates that the proposed method consistently outperforms SCIP across four benchmarks and matches or exceeds Gurobi's performance on two tasks within a 200-second time limit. Additionally, the framework shows robustness to distribution shifts, remaining competitive with classical solvers on real-world MIPLIB 2017 instances without requiring problem-specific tuning.
Methodology
The authors employ a sampling-based approach that leverages the linear structure of ILP to construct a transition kernel using a Locally-Balanced Proposal. They implement Parallel Tempering with two strategies: temperature tempering, which flattens the energy landscape, and penalty tempering, which selectively reduces barriers in infeasible regions while preserving feasible solutions.
Results
The proposed framework outperformed SCIP across all four benchmark tasks and matched or exceeded Gurobi's performance on two tasks within a 200-second budget. It also demonstrated greater robustness to distribution shifts compared to learning-based methods and remained competitive with classical solvers on MIPLIB 2017 instances without any problem-specific tuning.
Implications
This work suggests a new direction for solving ILP problems that could enhance the efficiency and effectiveness of combinatorial optimization methods, particularly in scenarios where traditional solvers struggle or require extensive tuning. The findings may also influence future research in optimization techniques and machine learning applications in combinatorial settings.
Test Time Training for Supervised Causal Learning
Graph Learning
Theory
Efficient ML
- Identifies critical limitations in existing Supervised Causal Learning methods.
- Introduces TTT-SCL, a framework for dynamic training set generation at test time.
- Establishes a theoretical basis connecting TTT-SCL with score-based methods.
- Demonstrates significant performance improvements across various datasets.
Read more
Test Time Training for Supervised Causal Learning
Summary
This paper addresses the limitations of Supervised Causal Learning (SCL) in causal discovery, particularly its struggles with out-of-distribution generalization. The authors identify three main issues with existing SCL methods: a performance gap between synthetic benchmarks and real-world data, fragility to distribution shifts, and failure in compositional generalization. To overcome these challenges, they propose a novel framework called Test-Time Training for Supervised Causal Learning (TTT-SCL), which dynamically generates training sets tailored to specific test instances. This approach shifts the focus from static training sets to a more adaptive methodology that aligns closely with the characteristics of the test domain. The authors establish a theoretical connection between TTT-SCL and score-based methods, demonstrating that the latter can be viewed as a special case of TTT-SCL. Through extensive experiments on synthetic, pseudo-real, and real-world datasets, the authors show that TTT-SCL significantly outperforms existing SCL and traditional causal discovery methods, confirming its effectiveness and practical applicability in real-world scenarios.
Methodology
The authors propose the TTT-SCL framework, which involves dynamically generating training sets that are explicitly aligned with the distribution of the test instance. This is achieved through a specialized training process that adapts to the characteristics of the test data, allowing for improved generalization and robustness against distribution shifts.
Results
Experiments conducted on synthetic benchmarks, pseudo-real, and real-world datasets reveal that TTT-SCL significantly outperforms both existing SCL methods and traditional causal discovery techniques, indicating its superior performance and adaptability.
Implications
The findings suggest that TTT-SCL can enhance the applicability of causal learning in real-world scenarios, where data distributions may vary significantly from training data. This could lead to more reliable causal inference in various fields, including economics, healthcare, and social sciences.
On Distributional Reinforcement Learning in Chaotic Dynamical Systems
Reinforcement Learning
Theory
Optimization
- Distributional RL objectives are smoother than expectation-based objectives in chaotic systems.
- Return distributions under mild statistical stability assumptions are Lipschitz continuous in the 1-Wasserstein metric.
- Empirical analysis shows that distributional objectives lead to smoother loss landscapes and lower variance targets.
- Distributional Q-learning methods outperform non-distributional approaches in chaotic control experiments.
Read more
On Distributional Reinforcement Learning in Chaotic Dynamical Systems
Summary
This paper addresses the challenges posed by chaotic dynamical systems in the context of Reinforcement Learning (RL), particularly focusing on the high sensitivity to initial conditions that leads to high-variance bootstrap targets and poorly conditioned gradient updates. The authors argue that traditional RL methods, which optimize expected returns through scalar value functions, fail in chaotic environments due to the irregularities introduced by chaotic dynamics. They propose that under certain statistical stability assumptions, the return distribution evolves more smoothly than individual trajectories when measured using the 1-Wasserstein metric. This insight leads to a distributional RL framework that aligns optimization with the structure of return distributions, resulting in better-conditioned learning. The paper provides a theoretical foundation for the advantages of distributional methods in chaotic systems and empirically demonstrates that these methods yield smoother optimization landscapes and improved performance in chaotic control tasks compared to non-distributional approaches.
Methodology
The authors analyze the optimization landscape of chaotic systems and demonstrate that distributional RL can provide a smoother optimization objective. They employ theoretical proofs regarding the Lipschitz continuity of return distributions and conduct empirical experiments to compare distributional and non-distributional RL methods in chaotic environments.
Results
The study finds that distributional RL methods result in smoother loss landscapes and lower variance in one-step targets, which in turn leads to improved episodic returns in chaotic control tasks. The theoretical results confirm that the distributional RL objective is more stable than traditional scalar value function approaches.
Implications
The findings suggest that distributional RL could be a more effective approach for learning in chaotic environments, which are prevalent in various scientific and engineering domains. This could enhance the reliability of RL applications in fields such as climate modeling, fluid dynamics, and multi-agent systems.
FlagGAM: Rule-Based Generalized Additive Modeling for Explainable Tabular Prediction
Interpretability
- FlagGAM provides a rule-defined basis framework for GAM-style tabular prediction.
- It extends rule construction to handle both numerical and categorical features across classification and regression tasks.
- The framework retains a sparse rule-basis matrix, allowing for feature-specific weighting and flexible prediction heads.
- FlagGAM demonstrates competitive performance against modern additive models and tree-based methods, especially under challenging data conditions.
Read more
FlagGAM: Rule-Based Generalized Additive Modeling for Explainable Tabular Prediction
Summary
The paper introduces FlagGAM, a novel framework for explainable tabular prediction that emphasizes accuracy, transparency, and robustness in high-stakes domains. FlagGAM separates the construction of feature-level rules from the prediction process, utilizing a Flag Core Module that transforms numerical and categorical variables into human-readable univariate bases. These bases include threshold flags, category-level flags, tail-deviation bases, and categorical step functions. The framework employs a default additive head to combine these bases into a restricted Generalized Additive Model (GAM) predictor. Unlike traditional models that reduce triggered rules to compact summaries, FlagGAM retains a sparse rule-basis matrix that supports mixed-type classification and regression, feature-specific weighting, and flexible prediction heads. The authors demonstrate that FlagGAM performs comparably to Explainable Boosting Machines (EBM) in transparent additive mode, significantly outperforms ridge regression in mixed-type regression tasks, and exhibits lower AUROC degradation under missing and noisy data conditions. The flexible heads further enhance accuracy, approaching the performance of strong tree-based models, while maintaining interpretability through a rule-basis representation followed by a nonlinear predictor. Overall, FlagGAM offers a practical solution for tabular settings that require competitive accuracy, clear communication of rules, and resilience to imperfect inputs.
Methodology
FlagGAM employs a Flag Core Module to convert raw variables into univariate basis functions, which are then combined using a default additive head to form a restricted GAM-style predictor. The framework includes training-only, within-feature false discovery rate-controlled cutoff screening and rule-level handling of missing values, ensuring that all rules are derived from training data.
Results
FlagGAM was evaluated on various benchmarks, including clinical, credit-risk, census-income, and housing datasets. The results indicate that it closely matches the performance of modern additive models like EBM, significantly outperforms ridge regression in mixed-type regression, and shows reduced AUROC degradation under conditions of missing and noisy data. The flexible prediction heads further enhance accuracy, approaching tree-based model performance.
Implications
FlagGAM has significant implications for fields requiring explainable AI, such as healthcare and finance, where transparent and interpretable models are crucial for decision-making. Its ability to handle mixed-type data and provide clear rules makes it a valuable tool for practitioners in high-stakes environments.
Apertus LLM Family Expansion via Distillation and Quantization
Large Language Models
Efficient ML
NLP
- Introduction of the Apertus-v1.1 model family through distillation and quantization.
- Demonstration of cost-effective model expansion without the need for extensive pre-training.
- Validation of pre-training distillation as a method to enhance model performance with fewer resources.
- Exploration of quantization techniques to optimize models for various hardware constraints.
Read more
Apertus LLM Family Expansion via Distillation and Quantization
Summary
The paper addresses the growing demand for Large Language Models (LLMs) to meet diverse budget and hardware constraints by proposing a cost-effective method for expanding model families through distillation and quantization. The authors introduce the Apertus-v1.1 model family, derived from the Apertus 8B LLM, which includes models with up to 4 billion parameters trained on 1.7 trillion tokens. The study validates the use of pre-training distillation to reduce computational costs significantly while maintaining strong accuracy across various hardware requirements. The methodology emphasizes the importance of model families in providing practitioners with flexible options for deployment, thus democratizing access to advanced AI capabilities. The authors also explore quantization as a means to further optimize models for specific hardware profiles, achieving a balance between performance and resource efficiency. Overall, the paper demonstrates that the Apertus-v1.1 models can deliver competitive performance with reduced training costs compared to traditional methods.
Methodology
The authors employed pre-training distillation to transfer knowledge from a larger teacher model (Apertus 8B) to smaller student models (Apertus-v1.1), allowing for faster convergence and improved performance with fewer training tokens. They also utilized quantization techniques to reduce memory footprint and inference latency while managing the cost-accuracy trade-off. The training involved a mix of KL-Divergence and cross-entropy loss functions, and the models were evaluated on multilingual benchmarks.
Results
The Apertus-v1.1 models demonstrated strong multilingual performance, closely resembling the capabilities of the larger Apertus 8B model while being trained on significantly fewer tokens (1.7 trillion vs. 15 trillion). The models achieved competitive accuracy and efficiency, showcasing the effectiveness of the distillation and quantization approach in expanding the model family.
Implications
The findings suggest that distillation and quantization can significantly lower the barriers to deploying LLMs in various applications, making advanced AI capabilities more accessible across different hardware environments. This approach can facilitate the development of tailored models for specific use cases, enhancing the versatility and applicability of LLMs in real-world scenarios.
Learning Multi-Agent Coordination via Sheaf-ADMM
Optimization
Graph Learning
Robotics
- Introduces Sheaf-ADMM, a framework for multi-agent coordination using differentiable optimization.
- Utilizes cellular sheaf theory to define inter-agent constraints for heterogeneous global consensus.
- Demonstrates improved robustness and performance in tasks like MNIST classification and Sudoku solving.
- Enables distinct analysis of coordination dynamics through the separation of state variables.
Read more
Learning Multi-Agent Coordination via Sheaf-ADMM
Summary
This paper presents Sheaf-ADMM, a differentiable optimization framework designed for multi-agent coordination. The framework allows agents to process overlapping local views of an input, each solving a convex subproblem parameterized by a neural encoder. Coordination among agents is achieved through the Alternating Direction Method of Multipliers (ADMM), with inter-agent constraints defined by a cellular sheaf, which specifies aspects of neighboring solutions that must agree. This approach enables heterogeneous forms of global consensus. The authors demonstrate the effectiveness of Sheaf-ADMM on various tasks, including maze pathfinding, image classification, and Sudoku. The results show that agents, despite having limited local views, can learn to coordinate effectively to produce correct global outputs. Notably, the method improves robustness to distribution shifts in MNIST classification compared to standard CNNs and achieves higher solve rates in Sudoku than parameter-matched message-passing neural network (MPNN) baselines. The structure of ADMM also allows for distinct analysis of primal, consensus, and dual state variables, providing insights into coordination dynamics that are not available in traditional message-passing architectures.
Methodology
The methodology involves formulating coordination as a constrained optimization problem solved using ADMM. Each agent independently solves local subproblems, followed by a consensus step that projects their proposals towards global consistency. The framework is differentiable, allowing for backpropagation through the optimization trajectory, and incorporates neural network parameterizations for agent subproblems.
Results
The evaluation of Sheaf-ADMM on tasks such as maze pathfinding, image classification (MNIST), and Sudoku shows that agents can effectively coordinate despite limited local views. The method outperforms standard CNNs in robustness to distribution shifts and achieves significantly higher solve rates in Sudoku compared to MPNN baselines.
Implications
The Sheaf-ADMM framework has potential applications in areas requiring multi-agent coordination, such as robotics, distributed systems, and collaborative AI. Its ability to learn coordination dynamics and improve robustness to local view limitations could enhance performance in complex, real-world tasks.
Constrained Flow Optimization via Sequential Fine Tuning for Molecular Design
Generative Models
Optimization
- Introduction of a formal framework for Constrained Generative Optimization.
- Development of the Constrained Flow Optimization (CFO) algorithm for balancing reward maximization and constraint satisfaction.
- CFO provides convergence guarantees for constrained generative optimization.
- Experimental results show consistent improvements in reward and constraint satisfaction.
Read more
Constrained Flow Optimization via Sequential Fine Tuning for Molecular Design
Summary
This paper addresses the challenge of optimizing generative models, specifically diffusion and flow models, to maximize reward functions while adhering to constraints relevant in molecular design and protein engineering. The authors introduce a formal framework for Constrained Generative Optimization, which allows for the adaptation of pre-trained models to generate samples that not only maximize task-specific rewards (like binding affinity) but also satisfy domain-specific constraints (such as molecular synthesizability). The proposed algorithm, Constrained Flow Optimization (CFO), employs a dual approach based on the augmented Lagrangian scheme, transforming the constrained optimization problem into a sequence of simpler generative optimization tasks. This method enables automatic balancing between reward maximization and constraint satisfaction without the need for manual weight adjustments. The authors provide convergence guarantees for both constrained generative optimization and constrained generation through CFO. Experimental evaluations demonstrate that CFO consistently improves reward outcomes while maintaining high levels of constraint satisfaction across synthetic and molecular design tasks, highlighting its practical applicability in scientific discovery.
Methodology
The authors propose a dual approach using the augmented Lagrangian scheme to convert the constrained optimization problem into a series of ordinary generative optimization subproblems. CFO alternates between solving a KL-regularized fine-tuning problem to maximize an augmented reward function and updating the parameters based on estimated constraint violations from generated samples.
Results
CFO was evaluated in both synthetic settings and real molecular design tasks, achieving significant increases in reward while ensuring high constraint satisfaction. The results demonstrate the algorithm's effectiveness in optimizing generative models for practical applications in molecular design.
Implications
The findings suggest that CFO can enhance the reliability and predictability of generative models in scientific discovery applications, particularly in fields requiring adherence to strict constraints, such as drug discovery and protein engineering.
LoRe: Adaptive Interaction-Evaluation Routing with Per-Step Interaction Budgets for Iterative Graph Solvers
Graph Learning
Optimization
Efficient ML
- LoRe introduces a per-step operator budgeting framework for iterative graph solvers.
- The method dynamically routes computation to high-conflict interactions, improving efficiency.
- LoRe achieves a 15× speedup and 44× memory reduction on the TSP problem.
- The framework is a drop-in solution that does not require retraining of existing models.
Read more
LoRe: Adaptive Interaction-Evaluation Routing with Per-Step Interaction Budgets for Iterative Graph Solvers
Summary
The paper introduces LoRe, a novel framework designed to enhance the efficiency of diffusion-based neural solvers for combinatorial optimization problems, particularly in scenarios where inference time and memory usage are critical. Traditional iterative graph solvers often face scalability issues due to the need for dense evaluations of interactions, which can lead to out-of-memory errors and excessive latency. LoRe addresses this by implementing a per-step interaction-evaluation budget, dynamically routing computation to focus on high-conflict or high-uncertainty interactions rather than relying on static sparsification methods. This approach is inspired by methodologies from many-body physics, specifically the Cluster-Bath decomposition, allowing the solver to maintain accuracy while reducing computational load. The authors validate LoRe through extensive experiments, demonstrating significant improvements in scalability and efficiency across various combinatorial optimization tasks, including the Maximum Independent Set (MIS) and the Traveling Salesperson Problem (TSP).
Methodology
LoRe operates as a training-free, inference-time wrapper that implements a Cluster-Bath decomposition for graph solvers. It identifies a dynamic subset of high-conflict interactions to evaluate at each step while approximating the influence of omitted interactions through a global recall signal. This allows for efficient computation while maintaining solution quality.
Results
LoRe significantly extends the feasible inference capabilities of graph solvers, achieving over 3× improvement beyond baseline out-of-memory limits on the MIS problem. It also delivers up to an 8× speedup and a 12× reduction in peak memory usage, with competitive solution quality maintained. For the TSP, it achieves a 15× speedup and a 44× memory reduction.
Implications
The findings suggest that LoRe can be effectively applied in real-time decision-making systems where computational resources are constrained, such as dynamic vehicle routing and resource allocation in data centers. The dynamic routing approach can lead to more efficient and scalable solutions in various combinatorial optimization tasks.
IRIS: time-structured manifold projections
Time Series
Optimization
Theory
- IRIS integrates time-structured layouts with manifold learning, enhancing the visualization of dynamic biomedical data.
- The algorithm operates in two phases: optimizing radial distances for timestamps and adjusting angular positions for high-dimensional structure.
- Evaluation across multiple datasets shows IRIS outperforms UMAP in representing temporal relationships while retaining class structure.
- The method is open-source, promoting accessibility and further research in the field.
Read more
IRIS: time-structured manifold projections
Summary
The paper introduces IRIS, a novel Manifold Learning algorithm designed to visualize high-dimensional biomedical data while incorporating temporal information. Traditional algorithms like t-SNE and UMAP struggle to represent time-ordered data effectively, often leading to a loss of critical temporal dynamics in visualizations. IRIS addresses this limitation by structuring layouts both chronologically and according to manifold topology, enabling clearer insights into the evolution of cell types and other classes over time. The algorithm operates in two phases: first, it optimizes the mapping of timestamps to radial distances, ensuring a uniform density of points, and second, it adjusts the angular positions of points to minimize divergence between high-dimensional and low-dimensional representations. The authors evaluate IRIS on diverse datasets, including single-cell RNA sequencing (scRNA-seq) and comparative metagenomics, demonstrating its ability to reveal dynamic phenomena that are obscured in traditional layouts. The results indicate that IRIS effectively structures data by time while maintaining class distinctions, outperforming UMAP in temporal representation. The algorithm is open-source and implemented in Python and C++, with potential future enhancements aimed at improving computational efficiency and developing interactive visualization tools.
Methodology
IRIS employs a two-phase optimization process: first, it maps timestamps to radial distances to ensure uniform density, and second, it optimizes the angular positions of points to minimize divergence between high-dimensional and low-dimensional spaces. This is achieved through a polar reparameterization of the Euclidean cost function.
Results
IRIS effectively structures layouts by time, demonstrating superior performance in temporal representation compared to UMAP. Quantitative metrics confirm that IRIS retains class structuring similar to UMAP while elucidating dynamic phenomena that are not apparent in traditional visualizations.
Implications
The ability to visualize high-dimensional biomedical data with integrated temporal information has significant implications for understanding biological processes, such as developmental stages in scRNA-seq and temporal trends in metagenomics. The open-source nature of IRIS encourages further exploration and application in various biomedical research contexts.
Revisiting Zeroth-Order Hessian Approximation: A Single-Step Policy Optimization Lens
Optimization
Theory
Efficient ML
- Introduces a unified framework for ZO Hessian approximation using single-step Policy Optimization.
- Presents ZoVH, a comprehensive suite of variance-reduced Hessian estimators.
- Establishes theoretical guarantees for the unbiasedness and variance optimality of the proposed methods.
- Demonstrates significant improvements in estimation accuracy and convergence performance in empirical evaluations.
Read more
Revisiting Zeroth-Order Hessian Approximation: A Single-Step Policy Optimization Lens
Summary
This paper addresses the challenge of accurate Zeroth-Order (ZO) Hessian estimation, which is crucial for derivative-free optimization tasks such as bilevel optimization and Bayesian inference. The authors propose a unified framework that reinterprets ZO Hessian approximation through the lens of single-step Policy Optimization (PO), establishing a theoretical equivalence between ZO Hessian estimators and the Hessian of a smoothed PO objective. This leads to the introduction of ZoVH, a suite of variance-reduced estimators for the full Hessian matrix and its inverse. The methodology leverages an optimal baseline to minimize variance and a query reuse strategy to enhance sample efficiency. Theoretical analysis confirms the unbiasedness and variance optimality of the proposed estimators, while empirical results demonstrate superior estimation accuracy and convergence performance compared to classical methods. The paper also develops a curvature-aware Zeroth-Order Optimization (ZOO) algorithm, which incorporates ridge regularization and bias correction, proving its effectiveness through extensive experiments.
Methodology
The authors reinterpret ZO Hessian approximation as the Hessian of a smoothed objective under a parameterized sampling policy. They introduce ZoVH, which utilizes an optimal baseline to minimize variance and a query reuse strategy to leverage historical function queries, enhancing sample efficiency. Theoretical analysis is provided to confirm the properties of the estimators, and empirical evaluations are conducted to validate the findings.
Results
The theoretical analysis confirms that the ZoVH estimators are unbiased and optimal in terms of variance. Empirical results show that ZoVH achieves lower estimation errors compared to classical Hessian estimators across various synthetic functions and neural networks, and the curvature-aware ZOO algorithm demonstrates substantial improvements over existing ZO methods in practical applications.
Implications
The proposed methods can significantly enhance the performance of derivative-free optimization tasks, particularly in high-dimensional settings where traditional methods struggle. The findings have potential applications in areas requiring efficient optimization techniques, such as machine learning model training and uncertainty quantification.
BOKBO (Best of K Bad Options): Calibrated Abstention for VLA Policies
Robotics
Theory
Multimodal
- BOKBO is the first conformal abstention layer for K-sample VLA inference, providing safety guarantees.
- The method achieves high reliability and task success rates while addressing silent failures in traditional K-sampling.
- A critical analysis reveals that existing nonconformity scores fail to measure policy uncertainty accurately.
- The introduction of a learned violation predictor improves safety calibration significantly.
Read more
BOKBO (Best of K Bad Options): Calibrated Abstention for VLA Policies
Summary
The paper introduces BOKBO, a novel conformal abstention layer designed for K-sample vision-language-action (VLA) policies. Traditional K-sample inference methods assume at least one candidate action is safe, which often leads to silent failures when all candidates are unsafe. BOKBO addresses this by providing finite-sample distribution-free upper bounds on the unsafe execution rate among non-abstained decisions. The authors demonstrate that BOKBO can maintain a conditional risk control (CRC) bound with 86% reliability across bootstrap splits, achieving 78% coverage and 70% net task success on specific benchmarks. They also reveal a critical flaw in existing nonconformity scores used in K-sampling, which correlate more with perturbation than with actual policy uncertainty. The paper proposes a learned violation predictor as a more reliable alternative. Additionally, it highlights the importance of per-task expert force calibration to mitigate inflated violation rates in safety evaluations. Overall, BOKBO represents a significant advancement in ensuring safety in K-sample VLA inference.
Methodology
The authors developed BOKBO by applying conformal risk control (CRC) to the K-sample VLA setting. They tested various nonconformity scores, including base-policy confidence, K-sample disagreement, and a learned violation predictor, to determine their effectiveness in predicting safety violations. The methodology also involved extensive evaluations on benchmarks like LIBERO, assessing the performance of BOKBO against traditional K-sample methods.
Results
BOKBO demonstrated a conditional CRC bound holding on 86% of bootstrap splits, achieving 78% coverage and 70% net task success on the libero_object_temp_x0.1 benchmark. The per-task variant improved the minimum conditional hold from 0.71 to 0.93. The analysis revealed that existing free signals correlated strongly with perturbation rather than uncertainty, while the learned predictor provided better calibration. Additionally, the paper identified that globally-set thresholds inflated violation rates by 5x, which was resolved through expert-calibrated thresholds.
Implications
BOKBO has significant implications for enhancing the safety and reliability of VLA policies in robotics and other applications where K-sampling is employed. By ensuring that unsafe actions can be identified and abstained from, BOKBO can improve the robustness of autonomous systems in uncertain environments. The findings also suggest a need for reevaluating existing methodologies in safety evaluations and the calibration of action thresholds.
CalArena: A Large-Scale Post-Hoc Calibration Benchmark
Computer Vision
Theory
Optimization
- Introduction of a large-scale benchmark for post-hoc calibration covering diverse tasks and models.
- Standardized evaluation framework for comparing dozens of calibration methods.
- Post-Hoc Improvement (PHI) proposed as a new metric for assessing calibration quality.
- Empirical results show that smooth calibration functions are superior to binning methods.
Read more
CalArena: A Large-Scale Post-Hoc Calibration Benchmark
Summary
The paper introduces CalArena, a comprehensive benchmark for evaluating post-hoc calibration methods in machine learning. It addresses the critical issue of poorly calibrated probability estimates in classifiers, which can undermine decision-making in high-stakes applications. The authors compile nearly 2000 experiments across various tasks, including binary and multiclass classification in both tabular and computer vision domains. They provide a standardized framework for comparing numerous calibration methods, emphasizing the importance of Post-Hoc Improvement (PHI) as a more principled metric for assessing calibration quality. The study reveals that smooth calibration functions generally outperform binning-based methods, and highlights the necessity of calibration-specific designs for generic models. The authors also release all data, code, and tools to facilitate further research in this area.
Methodology
The authors constructed a suite of benchmarks by aggregating predictions from various classical and modern models across multiple datasets. They standardized implementations of numerous calibration methods and employed a new evaluation metric, Post-Hoc Improvement (PHI), to assess both calibration error reduction and predictive performance degradation.
Results
The results indicate consistent patterns across domains, with smooth calibration functions outperforming binning-based approaches. The study also found that dedicated multiclass methods are crucial in high-dimensional settings, and generic machine learning models often require calibration-specific designs to be competitive.
Implications
The findings have significant implications for practitioners in machine learning, particularly in fields where reliable probability estimates are essential. The benchmark and tools provided can guide the selection and development of effective calibration methods, ultimately improving the reliability of machine learning systems in critical applications.
Reducing the GPU Memory Bottleneck with Lossless Compression for ML -- Extended
Efficient ML
Graph Learning
Large Language Models
- Introduces Invariant Bit Packing (IBP), a lossless compression algorithm tailored for ML workloads.
- IBP achieves significant performance improvements, including 74% faster GNN training and 180% faster DLRM embedding lookup.
- The method minimizes GPU memory usage while avoiding the accuracy trade-offs associated with lossy compression.
- Provides easy-to-use APIs for integration into existing ML frameworks.
Read more
Reducing the GPU Memory Bottleneck with Lossless Compression for ML -- Extended
Summary
This paper addresses the significant GPU memory bottleneck encountered during machine learning (ML) training and inference, particularly when handling large datasets that exceed GPU memory capacity. Traditional methods often rely on PCIe for on-demand tensor transfers, leading to critical transfer bottlenecks. While lossy compression techniques have been proposed to mitigate these issues, they introduce unpredictable accuracy loss, complicating deployment in real-world applications. The authors propose a novel approach using lossless compression, specifically through a new algorithm called Invariant Bit Packing (IBP). IBP minimizes data transfer time by identifying and eliminating invariant bits across tensors, optimizing GPU-accelerated decompression, and leveraging warp parallelism and asynchronous PCIe transfers. The paper demonstrates the integration of IBP into popular ML frameworks, showcasing its effectiveness in enhancing performance without sacrificing accuracy. The results indicate substantial improvements in training and inference times across various ML models, including Graph Neural Networks (GNNs), Deep Learning Recommendation Models (DLRMs), and Large Language Models (LLMs).
Methodology
The authors analyze PCIe transfer bottlenecks and existing compression methods, then develop IBP, which identifies invariant bits across tensors and compresses data while minimizing metadata. The algorithm is optimized for GPU decompression, allowing for efficient data transfer and processing. IBP is integrated into popular ML systems, and its performance is evaluated against state-of-the-art GPU-accelerated compression libraries.
Results
IBP significantly accelerates GNN training by 74%, DLRM embedding lookup by 180%, and LLM inference by 24% on average when tested on an A100 GPU, all while maintaining model accuracy. The method also demonstrates effectiveness on streaming datasets.
Implications
The findings suggest that lossless compression can effectively alleviate GPU memory bottlenecks in ML applications, enabling the use of larger datasets without compromising model performance. This has implications for various domains utilizing ML, including e-commerce, drug discovery, and fraud detection.
Convex Basins in Single-Index Model Loss Landscapes: Applications to Robust Recovery under Strong Adversarial Corruption
Theory
Efficient ML
Optimization
- Introduces a robust recovery algorithm for Gaussian Single Index Models with non-monotonic link functions.
- Establishes the existence of a convex basin in the loss landscape that aids in robust recovery.
- Demonstrates efficient convergence to low estimation error under adversarial conditions.
- Fills a significant gap in robust statistics literature for non-monotonic link functions.
Read more
Convex Basins in Single-Index Model Loss Landscapes: Applications to Robust Recovery under Strong Adversarial Corruption
Summary
This paper addresses the challenge of robustly learning Gaussian Single Index Models (SIMs) amidst heavy-tailed noise and a fraction of adversarially corrupted data. Previous research has focused on specific cases like linear regression and monotonic link functions, but these methods do not extend to generic asymmetric non-monotonic link functions, which are prevalent in modern neural architectures. The authors propose a novel robust recovery algorithm that operates with near-linear sample and time complexity for these non-monotonic link functions, filling a significant gap in the literature. The key contribution is a new structural understanding of the loss landscape under adversarial conditions, demonstrating the existence of a constant-radius convex basin around the true parameter. This basin can be efficiently accessed through robust spectral initialization, allowing for effective gradient descent that converges to a low estimation error. The findings provide the first robust recovery guarantees for a wide range of nonlinear SIMs, which were previously unaddressed, thus advancing the field of robust statistics in high-dimensional settings.
Methodology
The authors utilize a combination of robust spectral initialization and gradient descent techniques to navigate the loss landscape of non-monotonic SIMs. They establish theoretical guarantees for the existence of convex basins around the true parameters, which facilitate efficient recovery even in the presence of adversarial corruption.
Results
The proposed algorithm achieves a final estimation error of O(σ√ϵ) with a time complexity of ˜O(nd) and requires ˜O(d) samples, where ϵ denotes the contamination fraction. This represents a significant improvement over previous methods that either failed under adversarial conditions or were limited to monotonic link functions.
Implications
The findings have broad implications for robust statistical modeling in high-dimensional data settings, particularly in applications involving neural networks and other machine learning models that utilize non-monotonic link functions. The ability to recover parameters robustly in the presence of adversarial corruption enhances the reliability of machine learning systems in real-world scenarios.
CellBRIDGE: Learning Cellular Trajectories via Interaction-Aware Alignment
Graph Learning
Time Series
Interpretability
- CellBRIDGE augments Optimal Transport with interaction-aware costs derived from ligand-receptor signaling.
- The method improves trajectory inference by explicitly modeling cell-cell communication.
- CellBRIDGE enables interpretable in silico perturbations that align with expected biological outcomes.
- The approach shows broad applicability across various trajectory inference frameworks.
Read more
CellBRIDGE: Learning Cellular Trajectories via Interaction-Aware Alignment
Summary
The paper addresses the challenge of inferring cellular dynamics from population snapshots in single-cell RNA sequencing (scRNA-seq), where direct tracking of individual cells is not possible due to destructive measurements. Traditional methods rely on gene-expression distances for Optimal Transport (OT) but overlook the structured cell-cell interactions mediated by ligand-receptor signaling. The authors introduce CellBRIDGE, a novel approach that enhances OT by incorporating a directed, typed interaction cost based on ligand-receptor activity. This method improves cross-snapshot couplings and trajectory estimates in both synthetic and real scRNA-seq datasets. CellBRIDGE also allows for mechanistically interpretable in silico perturbations, demonstrating its effectiveness in modeling cellular dynamics and its potential for guiding experimental design in drug discovery.
Methodology
CellBRIDGE employs a Fused Gromov-Wasserstein (FGW) framework that minimizes both the cost of transport in gene expression space and the structural distortion of inferred communication networks. It constructs proxy communication networks based on directed ligand-receptor pairs within local expression neighborhoods, allowing for a biologically meaningful prior in trajectory inference.
Results
The experiments conducted on synthetic and real-world scRNA-seq datasets indicate that CellBRIDGE significantly enhances trajectory inference accuracy compared to feature-only baselines. The method also successfully captures the effects of silencing specific ligand-receptor pairs, leading to trajectory shifts that mirror expected biological responses.
Implications
CellBRIDGE has the potential to advance our understanding of cellular dynamics in development and disease, providing a valuable tool for drug discovery and experimental design. Its ability to incorporate biological interactions into trajectory inference could lead to more accurate models of cellular behavior.
The Sample Complexity of Multiclass and Sparse Contextual Bandits
Theory
Reinforcement Learning
Efficient ML
- Introduces improved sample complexity bounds for contextual bandits with sparse rewards.
- Establishes algorithms that achieve ε-optimal policies with significantly reduced dependence on the number of actions.
- Bridges a gap in existing literature by providing tight bounds that are minimax optimal up to logarithmic factors.
- Utilizes two complementary approaches: DEC-based exploration and low-variance exploration techniques.
Read more
The Sample Complexity of Multiclass and Sparse Contextual Bandits
Summary
This paper investigates the sample complexity of contextual bandits in a stochastic i.i.d. setting, focusing on the s-sparse scenario where the reward vector has an L1-norm bounded by s, significantly smaller than the number of actions |A|. The authors present algorithms that achieve an ε-optimal policy with a sample complexity of O((s/ε² + |A|/ε) log |Π|/δ), improving upon previous bounds that included an extraneous Θ(|A|⁹) dependence. The study bridges a gap in prior research by establishing tight sample complexity bounds for contextual bandits with sparse rewards, particularly in the context of multiclass classification. The results are derived through two main approaches: an exploration-by-optimization algorithm informed by the decision-estimation coefficient (DEC) and a low-variance exploration technique that leads to tractable algorithms. These findings not only enhance the theoretical understanding of contextual bandits but also provide practical algorithms with improved sample complexity guarantees for applications such as bandit multiclass list classification.
Methodology
The authors employ two main methodologies: first, they analyze contextual bandits using an exploration-by-optimization algorithm informed by the decision-estimation coefficient (DEC), which allows for optimal sample complexity rates based on sparsity. Second, they develop a low-variance exploration technique that leads to concrete algorithms and extends to contextual combinatorial semi-bandits, enhancing sample complexity guarantees.
Results
The paper presents algorithms that achieve a sample complexity of O((s/ε² + |A|/ε) log |Π|/δ), which is a significant improvement over previous bounds that had a Θ(|A|⁹) dependence. The results are shown to be minimax optimal up to logarithmic factors, providing a theoretical foundation for efficient learning in contextual bandit settings.
Implications
The findings have significant implications for various applications in online learning, such as recommendation systems, adaptive experimentation, and medical decision-making, where efficient decision-making under uncertainty is crucial. The improved sample complexity guarantees can lead to more effective algorithms in practical scenarios.
Bridging Chemists and AI: An Expert-Augmented Framework for Interpretable Route Evaluation
Interpretability
- Introduces an expert-augmented framework combining machine learning with chemists' expertise for route evaluation.
- Utilizes a DeepSets-based model trained on tree edit distances and fine-tuned with expert evaluations.
- Achieves significant improvements in scoring accuracy and interpretability compared to existing methods.
- Provides a dual-output evaluation system that aligns with real-world decision-making in synthesis planning.
Read more
Bridging Chemists and AI: An Expert-Augmented Framework for Interpretable Route Evaluation
Summary
This paper addresses the challenge of selecting efficient multi-step synthetic routes in organic synthesis, particularly in medicinal and process chemistry. The authors propose an expert-augmented, data-driven scoring framework that integrates machine learning with chemists' domain knowledge to provide both numerical scores and interpretable qualitative assessments of synthetic routes. The framework employs a DeepSets-based model trained on tree edit distances between reference and machine-generated routes, which is then fine-tuned using expert evaluations. This approach allows for a dual-output evaluation, providing a multi-objective quality score alongside a feasibility rating. The model demonstrates significant improvements over previous baselines in both quantitative scoring and qualitative assessments, achieving a Spearman correlation coefficient of 0.78 and a Pearson correlation of 0.77 for category assessments, as well as a top-1 ranking accuracy of 60.2% for score predictions. The results indicate that the framework effectively captures the nuances of expert chemical judgment, making it a valuable tool for retrosynthetic planning.
Methodology
The authors developed a DeepSets-based scoring model that processes synthetic routes by comparing tree edit distances. The model is trained on a large dataset of patent routes and fine-tuned using expert evaluations to enhance its predictive capabilities. The framework translates complex chemical reasoning into a learnable format, allowing for both numerical scoring and qualitative assessments of route feasibility.
Results
The proposed model achieved a Spearman correlation of 0.78 ± 0.05 and a Pearson correlation of 0.77 ± 0.06 in predicting category assessments. It also reached a top-1 ranking accuracy of 60.2% for score predictions, significantly outperforming the previous baseline of 17.5%. The model effectively captures expert judgment nuances, achieving a classification accuracy of 67 ± 6.4% in a three-tier rating system.
Implications
This framework has the potential to streamline the process of synthetic route evaluation in organic chemistry, making it more efficient and scalable. By integrating expert knowledge with machine learning, it can enhance decision-making in drug discovery and development, ultimately leading to more effective and cost-efficient synthesis strategies.
Causal Intelligence for Constraint-Aware Intervention Design to Induce State Transitions
Optimization
Graph Learning
Interpretability
- COAST provides a principled framework for designing interventions that induce state transitions in complex systems.
- The framework integrates causal discovery and multi-objective optimization to balance efficacy, complexity, and stability of interventions.
- COAST is modular and domain-agnostic, making it applicable across various fields, particularly in biomedicine.
- The approach successfully identifies causal drivers and robust intervention strategies from both synthetic and real datasets.
Read more
Causal Intelligence for Constraint-Aware Intervention Design to Induce State Transitions
Summary
The paper introduces COAST (Causally Optimal Actions for State Transitions), a novel framework designed to facilitate the in-silico design of interventions that can induce specific state transitions in complex systems, particularly in biomedical contexts. Traditional predictive models often lack mechanistic insight and do not provide a structured approach for decision-making regarding interventions. COAST addresses this gap by employing causal intelligence to learn context-specific causal graphs and structural causal models from data that characterize source and target states. The framework incorporates a multi-objective optimization approach that balances the efficacy of transitions, the complexity of interventions, and the stability of target states. COAST is modular and domain-agnostic, allowing for the integration of various components such as feature selection, causal discovery, and intervention evaluation. The authors demonstrate the effectiveness of COAST through synthetic benchmarks and real biological datasets, successfully identifying key causal drivers and robust intervention strategies that achieve desired state transitions while providing transparent mechanistic rationales for experimental validation.
Methodology
COAST employs a modular approach that includes context-specific feature selection, learning of causal graphs and structural causal models, and a multi-objective optimization formulation. This allows it to identify interventions that can induce desired state transitions while adhering to biological and practical constraints.
Results
The application of COAST on synthetic benchmarks and real biological datasets demonstrated its capability to recover key causal drivers and identify effective single- and multi-target intervention strategies. The framework achieved desired state transitions while providing clear mechanistic rationales for the interventions.
Implications
The COAST framework has significant implications for fields such as biomedicine and drug discovery, where understanding causal relationships and designing effective interventions are crucial. It can accelerate the discovery process by reducing experimental burdens and enhancing the prioritization of interventions based on mechanistic insights.
SigmaMedStat: Temporal Signal Modeling for ICU False Alarm Reduction
Time Series
- Introduces a novel temporal CWT-LSTM architecture for ICU alarm classification.
- Achieves a mean AUC of 0.822, significantly outperforming static classification methods.
- Demonstrates the importance of temporal chunking and multi-channel signal fusion.
- Identifies specific alarm types that are easier or harder to classify.
Read more
SigmaMedStat: Temporal Signal Modeling for ICU False Alarm Reduction
Summary
The paper addresses the critical issue of alarm fatigue in intensive care units (ICUs), where clinical monitors generate an overwhelming number of alarms, most of which are clinically irrelevant. This desensitization can lead to missed true emergencies, posing risks to patient safety. The author introduces SigmaMedStat, a machine learning system designed to evaluate the trustworthiness of physiological alarm signals before clinical action is taken. The proposed methodology involves a temporal modeling framework that segments each 60-second alarm recording into six consecutive 10-second chunks. Each chunk is processed using Continuous Wavelet Transform (CWT) to generate scalograms, which are then encoded with an EfficientNet-B0 encoder and analyzed by a two-layer Long Short-Term Memory (LSTM) network. The system was evaluated on the PhysioNet/Computing in Cardiology Challenge 2015 dataset, achieving a mean AUC of 0.822 ± 0.016 through five-fold stratified cross-validation. The results indicate that temporal modeling significantly outperforms static classification methods, with ablation studies confirming the independent contributions of temporal chunking and multi-channel signal fusion to performance. The analysis also highlights the varying classification accuracy across different alarm types, with Ventricular Flutter being the most accurately classified and Asystole the most challenging. The findings suggest that temporal structure in physiological signals is a valuable feature for improving alarm classification in clinical settings.
Methodology
The methodology involves splitting 60-second ICU alarm recordings into six 10-second chunks, applying Continuous Wavelet Transform (CWT) to generate scalograms for each chunk, encoding them with an EfficientNet-B0 encoder, and analyzing the resulting feature sequences using a two-layer Long Short-Term Memory (LSTM) network. The system was validated using five-fold stratified cross-validation on the PhysioNet dataset.
Results
The proposed system achieved a mean AUC of 0.822 ± 0.016, outperforming a static EfficientNet baseline by 18.1 AUC points. Ablation studies confirmed that both temporal chunking and multi-channel signal fusion independently enhance classification performance. The analysis revealed that Ventricular Flutter was classified with an AUC of 0.820, while Asystole had the lowest classification accuracy (AUC 0.722).
Implications
The findings suggest that incorporating temporal modeling into alarm classification systems can significantly reduce false alarms in ICUs, potentially improving patient safety and clinical response times. The approach may be applicable to other areas of healthcare monitoring where alarm fatigue is a concern.
Evolutionary Refinement of Generative Graph Topologies: A Hybrid WGAN-GA Approach
Generative Models
Graph Learning
Optimization
- Combines WGANs with Genetic Algorithms for graph generation refinement.
- Addresses structural deviations in generated graphs compared to real data.
- Implements evolutionary edge editing to optimize graph connectivity.
- Demonstrates improved alignment with real graph statistical properties.
Read more
Evolutionary Refinement of Generative Graph Topologies: A Hybrid WGAN-GA Approach
Summary
This paper addresses the challenge of generating realistic graph-structured data, which is essential for various applications such as data augmentation and privacy-preserving data sharing. The authors propose a hybrid approach that combines Wasserstein Generative Adversarial Networks (WGANs) with Genetic Algorithms (GAs) to refine the generated graphs. The WGAN framework is utilized to produce initial graphs with node features and connectivity patterns, while a Graph Neural Network (GNN) critic evaluates the realism and class consistency of these graphs. To enhance the quality of the generated graphs, a GA is applied post-generation to refine the edges, correcting structural deviations and improving alignment with real data distributions. The refinement process focuses on optimizing graph structures through evolutionary edge editing, which allows for precise adjustments to connectivity patterns. Experimental results demonstrate that the GA refinement significantly reduces discrepancies in structural properties, such as degree and spectral distributions, leading to synthetic graphs that better reflect real-world characteristics. This work highlights the effectiveness of evolutionary refinement in improving GAN-based graph generation methods, making them more suitable for practical applications.
Methodology
The methodology involves two main stages: (1) a coarse generation stage using a WGAN to produce initial graph structures, and (2) a refinement stage employing a Genetic Algorithm to iteratively optimize the generated graphs by modifying edges based on fitness measures derived from real graph statistics.
Results
The experimental results indicate that the GA refinement consistently lowers the Maximum Mean Discrepancy (MMD) between generated graphs and real graphs, resulting in synthetic graphs that exhibit more coherent structural patterns and improved connectivity reflective of the underlying data relationships.
Implications
The proposed hybrid approach enhances the capability of GAN-based models for generating realistic graph-structured data, which can be beneficial for applications in social networks, molecular biology, and other fields requiring synthetic graph data for analysis or model training.
Towards Continuous-time Causal Foundation Models
Time Series
- Introduces a continuity criterion for continuous-time causal priors based on trajectory-law invariance.
- Develops a three-tier taxonomy for categorizing causal priors in time series analysis.
- Demonstrates that fine-grid integration outperforms naive integration in empirical tests.
- Proposes a construction for continuous-time causal models using OU processes and random DAGs.
Read more
Towards Continuous-time Causal Foundation Models
Summary
This paper addresses the limitations of discrete-time causal Prior-data Fitted Networks (PFNs) in time series analysis by proposing a framework for continuous-time causal foundation models. The authors introduce a precise continuity criterion that requires the joint law of a sampled trajectory to be invariant to the observation schedule. They present a three-tier taxonomy of causal priors: discrete, naive observation-grid integration, and fine-grid integration with decoupled observation. The top tier is realized through a construction using Ornstein–Uhlenbeck (OU) processes or small-Multilayer Perceptron (MLP) nonlinear drifts on random directed acyclic graphs (DAGs) with various types of interventions. An empirical evaluation demonstrates that fine-grid integration significantly outperforms naive integration across multiple scenarios, confirming the effectiveness of their proposed continuous-time framework. The authors also release a preliminary zero-shot protocol for real-world applications in pharmacokinetics and physical systems, although detailed results from these applications are deferred to an appendix.
Methodology
The authors propose a three-tier taxonomy for continuous-time causal priors and develop a construction that implements the top tier using Ornstein–Uhlenbeck processes or small-MLP nonlinear drifts on random DAGs. They conduct a 2x2 ablation study comparing encoder and integrator configurations on both linear and nonlinear priors, evaluating performance across different discretizations.
Results
The empirical evaluation shows that fine-grid integration consistently outperforms naive integration in all tested scenarios, with a significant p-value indicating strong statistical significance. The performance gap increases as the evaluation grid is refined, highlighting the advantages of the proposed continuous-time approach.
Implications
The proposed framework has potential applications in fields requiring accurate modeling of time-dependent causal relationships, such as pharmacokinetics, healthcare analytics, and any domain involving irregularly sampled time series data. It opens avenues for more robust causal inference in continuous time, which could enhance predictive modeling and decision-making processes.
PrismFlow: Residual Dynamics for Flow Matching in Time-Series Generation
Generative Models
Time Series
- PrismFlow addresses mode collapse in time-series generation by using a bank of Koopman-inspired experts.
- The method employs a Winner-Take-All training objective to promote expert specialization and reduce averaging effects.
- PrismFlow achieves state-of-the-art performance with significant improvements in key evaluation metrics.
- The approach is robust in low-data settings and effective for various time-series tasks, including forecasting and imputation.
Read more
PrismFlow: Residual Dynamics for Flow Matching in Time-Series Generation
Summary
The paper introduces PrismFlow, a novel method for generating high-quality time-series data that addresses the challenges posed by multimodal patterns and multiscale dynamics in real-world signals. Traditional Flow Matching (FM) methods often rely on a single global vector-field estimator, which can lead to oversmoothing and poor mode coverage due to the averaging of incompatible dynamics. PrismFlow mitigates this by employing a set of Koopman-inspired dynamical experts that learn residual corrections in a latent space, allowing for local nonlinear temporal evolution to be approximated by linear transitions. The authors propose a confidence-aware Winner-Take-All (WTA) objective that encourages specialization among experts, ensuring that only the most relevant expert updates its parameters for each sample. This approach preserves the stability of the global transport field while enabling the recovery of fine-grained temporal structures. Empirical evaluations demonstrate that PrismFlow outperforms standard FM methods, achieving significant improvements in metrics such as Context-FID and Discriminative Score, while remaining effective in low-data scenarios and applicable for forecasting and imputation tasks.
Methodology
PrismFlow utilizes a bank of Koopman-inspired dynamical experts that learn residual corrections to a global transport field. The training employs a confidence-aware Winner-Take-All (WTA) objective, allowing for competitive selection of experts based on their alignment with the current sample, which promotes specialization and reduces regression-to-the-mean behavior.
Results
PrismFlow demonstrated a 15.6% improvement in Context-FID and a 38.6% increase in Discriminative Score compared to standard FM methods. The method effectively recovers diverse modes in time-series generation tasks and maintains robustness in low-data settings.
Implications
The findings suggest that PrismFlow can significantly enhance the generation of time-series data in various fields such as healthcare, finance, and environmental monitoring, where high-fidelity signal synthesis is crucial. Its ability to handle multimodal dynamics and low-data scenarios makes it a valuable tool for real-world applications.
When LLMs Learn to Be Consistently Wrong: A Multi-Model Study of Linear Representations of Synthetic Deception
NLP
Large Language Models
Interpretability
- Demonstrates that synthetic dishonesty can be rapidly induced in language models through supervised fine-tuning.
- Linear representations of dishonesty are highly detectable, achieving near-perfect AUC in most models evaluated.
- Probes trained on one domain (TruthfulQA) generalize effectively to diverse reasoning tasks (MMLU) with minimal performance loss.
- Identifies two architectural regimes in models regarding their handling of dishonesty: collapse-type and high-dimensional models.
Read more
When LLMs Learn to Be Consistently Wrong: A Multi-Model Study of Linear Representations of Synthetic Deception
Summary
This paper addresses the issue of deceptive alignment in AI, where models may produce false outputs while maintaining accurate internal representations. The author introduces a multi-model experimental framework to study synthetic dishonesty, where models are fine-tuned to generate incorrect outputs. Five transformer architectures (ranging from 1.4B to 9B parameters) are evaluated, and linear probes are employed to detect dishonesty in model activations. The findings reveal that dishonest representations can be rapidly induced and detected with high accuracy, particularly in the early layers of the models. The study also demonstrates robust cross-domain generalization of these representations and highlights the differences in architectural responses to adversarial noise. A mechanistic analysis identifies two distinct regimes in model behavior: collapse-type models, which exhibit concentrated dishonesty, and high-dimensional models, which maintain richer representations. The results underscore the potential for monitoring and understanding dishonesty in language models, with implications for AI safety and model interpretability.
Methodology
The study employs a multi-model experimental paradigm where honest and deceptive variants of transformer models are fine-tuned using LoRA. Linear and nonlinear probes are trained on mean-pooled hidden states to classify activations as honest or deceptive. A mechanistic analysis is conducted to explore the geometry of dishonesty representations across different model architectures.
Results
The results show that linear probes can detect synthetic dishonesty with AUC values of 0.99 or higher in four out of five models, with Pythia-1.4B being an exception at 0.705. The study also finds that late-layer representations are more robust to noise, and cross-domain transfer of dishonesty representations is achieved with negligible performance loss.
Implications
The findings suggest that understanding and monitoring dishonesty in language models is crucial for AI safety. The ability to detect and analyze deceptive behaviors can inform the development of safer AI systems and improve interpretability in model outputs.
Momentum Based Reward Design for Low Emission Traffic Signal Control
Reinforcement Learning
Optimization
- Introduction of a Momentum-Based Reward Function (MBRF) that promotes continuous vehicle movement.
- Evaluation of the MBRF in SUMO shows better performance than traditional delay and queue-based rewards.
- The proposed method leads to improved throughput-emission trade-offs and more stable learning behaviors.
- Demonstrates the effectiveness of DRL in adaptive traffic signal control without requiring major infrastructure changes.
Read more
Momentum Based Reward Design for Low Emission Traffic Signal Control
Summary
This paper addresses the challenge of urban traffic congestion, which significantly contributes to environmental pollution and long commute times. Traditional traffic signal control systems often struggle to adapt to dynamic traffic conditions, leading to inefficiencies. The authors propose a novel Momentum-Based Reward Function (MBRF) for Deep Reinforcement Learning (DRL) traffic signal control, which encourages continuous vehicle movement rather than merely penalizing congestion. The MBRF is designed to promote sustained vehicle flow by incentivizing phase persistence based on observed discharge efficiency. The methodology is evaluated using the SUMO (Simulation of Urban MObility) environment, employing standard traffic metrics such as waiting time, queue length, throughput, and CO2 emissions. The results demonstrate that the MBRF leads to improved throughput-emission trade-offs and more stable learning behaviors compared to traditional delay or queue-based rewards, as well as classical controllers like Max Pressure and LQF. This approach not only enhances traffic efficiency but also reduces emissions without explicitly optimizing for environmental metrics, showcasing the potential for DRL in adaptive traffic signal control.
Methodology
The authors formulated the traffic signal control problem as a Markov Decision Process (MDP) and implemented the MBRF within a DRL framework. The MBRF incentivizes sustained vehicle motion by rewarding phase persistence proportional to discharge efficiency, thereby aligning the learning objective with real traffic dynamics. The evaluation was conducted using the SUMO simulation environment, measuring various traffic performance metrics.
Results
The proposed MBRF outperformed traditional reward structures in terms of traffic throughput and emissions. The results indicated more stable learning behaviors and improved traffic efficiency, demonstrating that the MBRF effectively encourages smoother control policies without rigid constraints.
Implications
The findings suggest that the MBRF can be a valuable tool for urban traffic management, potentially leading to reduced congestion and emissions in real-world scenarios. This approach could be adapted for various traffic conditions and integrated into existing traffic management systems to enhance their responsiveness and efficiency.
Non-destructive Identification of Oyster Species is possible from Hyperspectral Images with Machine Learning
Computer Vision
- Hyperspectral imaging can non-destructively differentiate between oyster species.
- PLS-DA outperformed CNN in classification accuracy for oyster species identification.
- Distinct elemental compositions were found between Black-Lip and Sydney rock oysters.
- The methodology has potential applications in aquaculture for species traceability and broodstock management.
Read more
Non-destructive Identification of Oyster Species is possible from Hyperspectral Images with Machine Learning
Summary
This study explores the use of hyperspectral imaging (HSI) combined with machine learning techniques to non-destructively identify two oyster species: Black-Lip rock (BL) and Sydney rock (SR) oysters. Traditional methods of species identification, such as DNA profiling, are often destructive and time-consuming, making them unsuitable for large-scale applications. The researchers scanned 156 live oyster samples using an HSI camera and applied Partial Least Square Discriminant Analysis (PLS-DA) and Convolutional Neural Networks (CNN) to analyze the spectral reflectance data. The PLS-DA model achieved a remarkable 100% classification accuracy, significantly outperforming the CNN, which achieved 83% and 96% accuracy for the left and right valves, respectively. The study also examined the elemental and mineralogical composition of the oyster valves, revealing distinct differences in the number of layers and concentrations of carbon and oxygen between the two species. These findings suggest that HSI can effectively differentiate between oyster species based on their spectral signatures, paving the way for rapid, non-destructive identification methods that could enhance operational efficiency in aquaculture and improve species traceability in seafood supply chains.
Methodology
The study involved scanning live samples of Black-Lip and Sydney rock oysters using a hyperspectral imaging camera across a wavelength range of 950-2515 nm. Machine learning models, specifically Partial Least Square Discriminant Analysis (PLS-DA) and Convolutional Neural Networks (CNN), were trained using Monte Carlo Cross Validation (MCCV) to classify the oysters based on their spectral reflectance data. Additionally, electron microscopy was used to analyze the elemental and mineralogical composition of the oyster valves.
Results
The PLS-DA model achieved a median test set classification accuracy of 100%, while the CNN achieved 83% and 96% accuracy for the left and right valves, respectively. Analysis of the oyster valves revealed that Black-Lip oysters had more layers and different elemental compositions compared to Sydney rock oysters, indicating potential structural differences that could be detected through HSI.
Implications
The findings suggest that hyperspectral imaging combined with machine learning can provide an effective, rapid, and non-destructive method for identifying oyster species. This has significant implications for improving operational efficiency in aquaculture, enhancing traceability in seafood supply chains, and facilitating the use of wild spat as broodstock.
MIC: Maximizing Informational Capacity in Adaptive Representations via Isotropic Subspace Alignment
NLP
Efficient ML
Optimization
- Introduction of MIC framework for optimizing multi-granular embeddings.
- Development of Soft Collapse Regularization to manage redundancy in nested subspaces.
- Implementation of Spectral Isotropy Regularization for ensuring uniformity in embeddings.
- Demonstrated significant performance improvements over existing MRL baselines.
Read more
MIC: Maximizing Informational Capacity in Adaptive Representations via Isotropic Subspace Alignment
Summary
The paper introduces MIC, a novel framework designed to enhance multi-scale representation learning by addressing issues of dimensional redundancy and spectral collapse in nested subspaces. The authors propose two key regularization techniques: Soft Collapse Regularization (SCR) and Spectral Isotropy Regularization (SIR). SCR minimizes redundancy between prefix and residual subspaces through cross-correlation penalties, while SIR ensures uniform distribution of embeddings in low-dimensional prefixes. By integrating these strategies within a self-distillation objective, MIC generates semantically dense representations that retain high discriminative power, particularly in scenarios requiring high compression. The framework shifts the focus of Matryoshka Representation Learning from usability to maximizing informational capacity through geometric alignment. Extensive experiments demonstrate that MIC outperforms standard baselines, especially in low-dimensional settings where maintaining informational capacity is critical.
Methodology
The MIC framework enhances Matryoshka Representation Learning by combining a nested contrastive loss with two novel regularizers: Soft Collapse Regularization and Spectral Isotropy Regularization. SCR employs a thresholded correlation penalty to manage redundancy between prefix and residual subspaces, while SIR ensures isotropic distribution of embeddings. The framework utilizes self-distillation to optimize these geometric properties, preventing representation collapse and ensuring high semantic density.
Results
The experiments conducted show that MIC consistently outperforms state-of-the-art MRL baselines across various tasks, particularly excelling in high-compression scenarios where the preservation of informational capacity is crucial. The results indicate that MIC effectively mitigates redundancy and enhances the discriminative power of embeddings.
Implications
The findings suggest that MIC could be applied in various domains requiring efficient representation learning, such as Natural Language Processing and other fields where high-dimensional embeddings are prevalent. The techniques developed could lead to more efficient models that balance performance with resource constraints.
How's it going? Reinforcement learning in language models recruits a functional welfare axis
NLP
Large Language Models
Reinforcement Learning
- Reinforcement learning recruits a pre-existing representation of functional welfare in language models.
- The study demonstrates that punishment and reward vectors behave as representations of negative and positive welfare, respectively.
- The effects of these vectors are robust across various training conditions and persist even in pre-trained models.
- The functional welfare axis influences model behavior in unrelated domains, indicating a generalization of learned representations.
Read more
How's it going? Reinforcement learning in language models recruits a functional welfare axis
Summary
This paper investigates how reinforcement learning (RL) influences the internal representations of language models, specifically focusing on the concept of functional welfare. The authors present evidence that RL recruits a pre-existing representation of functional welfare, which reflects how well or poorly the model is performing relative to its goals. They train language models in a novel maze environment with semantically neutral rewards and extract concept vectors for both rewarded and punished trajectories. The analysis reveals that the punishment vector aligns with negative welfare indicators, promoting failure-related tokens and negative emotions, while the reward vector corresponds to positive welfare, encouraging completion-related tokens and positive sentiments. These vectors are shown to be effective even in models prior to maze training, suggesting that the functional welfare axis pre-exists post-training. The findings highlight the ability of minimal reward signals to broadly influence model behavior and have implications for interpretability, post-training dynamics, and alignment.
Methodology
The authors trained language models in a text-based maze environment with neutral rewards. They extracted concept vectors for rewarded and punished trajectories and evaluated their effects on model behavior in various unrelated tasks, including sentiment analysis and confidence assessments.
Results
The analysis showed that the punishment vector (vMOLD) promotes negative outcomes, while the reward vector (vGOLD) encourages positive outcomes. The vectors were nearly antiparallel and effectively influenced model behavior across different tasks. These effects were consistent regardless of model family, scale, and training algorithms.
Implications
The study suggests that minimal reward signals can significantly affect model behavior by recruiting pre-existing welfare-like representations. This has important implications for the interpretability of language models, their post-training dynamics, and alignment with human values.
De-attribute to Forget for LLM Unlearning
NLP
Large Language Models
Reinforcement Learning
- Introduces a novel data de-attribution objective for LLM unlearning.
- Presents DareU, the first LLM unlearning framework utilizing reinforcement learning.
- Demonstrates effective unlearning while preserving model utility.
- Outperforms existing LLM unlearning methods in empirical evaluations.
Read more
De-attribute to Forget for LLM Unlearning
Summary
This paper addresses the challenges of unlearning in large language models (LLMs), particularly in the context of inappropriate training data. Existing methods often focus on optimizing prediction losses, which can lead to issues like over-forgetting and degraded model performance. The authors propose a novel framework called DareU, which reframes the unlearning objective as reducing data attribution scores for the forget set. By employing reinforcement learning, specifically Proximal Policy Optimization (PPO), DareU aims to minimize the attribution of LLM-generated responses to the data owners from whom data is to be forgotten. The empirical results demonstrate that DareU effectively balances the quality of forgetting with the utility of the model, outperforming existing unlearning baselines. This approach not only provides a more precise optimization target but also ensures that the model does not generate incoherent outputs post-unlearning.
Methodology
The authors propose DareU, which uses reinforcement learning to optimize LLM responses by minimizing the attribution scores associated with the forget set. The framework employs Proximal Policy Optimization (PPO) to align the model's outputs with the goal of reducing the influence of specific data owners.
Results
DareU was empirically validated and shown to achieve a better balance between forget quality and model utility compared to existing unlearning baselines. The results indicate that DareU effectively reduces the attribution scores of LLM-generated responses to the forget set while maintaining coherent output.
Implications
The findings suggest that DareU can be applied in scenarios requiring compliance with data protection regulations, such as GDPR, by providing an efficient method for LLMs to forget specific data without extensive retraining. This has significant implications for privacy and data management in AI applications.
Bifurcated Remaining Useful Life Prediction: A Hybrid Approach for Realistic Uncertainty Characterization
Time Series
- Introduces a hybrid prognostic framework for RUL estimation that incorporates uncertainty quantification.
- Utilizes a bifurcated approach to classify engine states into healthy and degraded regimes.
- Employs an LSTM-based autoencoder for state classification and a Conditional Weibull model for RUL estimation.
- Generates continuous state probabilities for improved prediction accuracy and uncertainty characterization.
Read more
Bifurcated Remaining Useful Life Prediction: A Hybrid Approach for Realistic Uncertainty Characterization
Summary
This paper presents a novel hybrid framework for estimating the Remaining Useful Life (RUL) of turbofan engines, focusing on uncertainty quantification. Utilizing the NASA C-MAPSS dataset, the authors bifurcate the operational lifespan of engines into 'healthy' and 'degraded' states. An LSTM-based autoencoder is employed to classify these states based on reconstruction error from nominal data. For the healthy regime, a Conditional Weibull Survival Analysis is used for estimating Mean Residual Life, while a Probabilistic Neural Network with Monte Carlo Dropout addresses uncertainties in the degraded regime. The framework innovatively uses a calibrated sigmoid function to convert autoencoder outputs into continuous state probabilities, allowing for dynamic weighting of predictions. This approach generates physically consistent uncertainty bands, enhancing prediction confidence, particularly near end-of-life scenarios, and provides a robust tool for risk-informed maintenance decisions.
Methodology
The methodology involves a hybrid approach where an LSTM-based autoencoder classifies engine states into healthy and degraded regions. For RUL estimation, a Conditional Weibull Survival Analysis is applied in the healthy region, while a Probabilistic Neural Network with Monte Carlo Dropout is utilized in the degraded region. The predictions from both models are fused using probability weights derived from the autoencoder's output.
Results
The proposed framework successfully generates uncertainty bands that are physically consistent, yielding high-confidence predictions, especially as the engine approaches end-of-life. The integration of uncertainty quantification significantly enhances the reliability of RUL predictions compared to traditional methods.
Implications
This research has significant implications for predictive maintenance in safety-critical sectors such as aviation, automotive, and heavy manufacturing. By providing a robust tool for RUL estimation with integrated uncertainty quantification, it aids operators in making informed maintenance decisions, thereby enhancing operational safety and reliability.
Retriever Portfolios: A Principled Approach to Adaptive RAG
NLP
Large Language Models
Optimization
- Introduces retriever portfolio optimization to enhance RAG systems by selecting diverse retrievers for heterogeneous queries.
- Formalizes an expected best-of-k objective to evaluate retriever portfolios, ensuring coverage of different query types.
- Demonstrates that fixed portfolios can achieve comparable or better accuracy with lower latency than adaptive hyperparameter tuning methods.
- Empirical results show significant improvements in retrieval recall and answer accuracy across multiple QA benchmarks.
Read more
Retriever Portfolios: A Principled Approach to Adaptive RAG
Summary
This paper addresses the limitations of traditional retrieval-augmented generation (RAG) systems, which typically rely on a single retriever and a fixed set of hyperparameters, leading to suboptimal performance across diverse query types. The authors propose a novel method called retriever portfolio optimization, which automatically selects a small, diverse subset of retrievers from a larger pool to better cover the heterogeneous distribution of queries. They formalize this approach using an expected best-of-k objective, allowing for efficient portfolio construction with near-optimal guarantees. The proposed method includes a pipeline that learns a static portfolio of complementary retrievers and employs a lightweight router to dynamically select the best retriever for each query. Empirical evaluations on various question-answering benchmarks demonstrate that the learned portfolios significantly outperform both single-retriever and naive multi-retriever baselines in terms of retrieval metrics and answer quality, while also reducing latency and token costs compared to inference-time hyperparameter tuning methods.
Methodology
The authors develop a pipeline that learns a static portfolio of retrievers offline, using an expected best-of-k objective to select a diverse subset of retrievers from a larger pool. A lightweight router is trained to dynamically select the most suitable retriever for each query, thus avoiding costly inference-time hyperparameter tuning.
Results
The proposed method consistently outperforms single-retriever baselines and other adaptive methods like Vendi-RAG across various QA benchmarks, achieving better retrieval recall and answer accuracy while significantly reducing latency and token usage.
Implications
This work has potential applications in improving the efficiency and effectiveness of RAG systems in various domains, including open-domain question answering and knowledge-intensive dialogue systems, by enabling more tailored retrieval strategies that adapt to the complexity of user queries.
MAAT: Multi-phase Adapter-Aware Targeted Unlearning
NLP
Large Language Models
Theory
- Introduction of 5WBENCH, a benchmark that quantifies causal unlearning failures.
- MAAT framework achieves high forgetting and retention of Why-type causal knowledge.
- Demonstrates the challenges of gradient dilution and multi-hop reasoning in unlearning.
- Outperforms existing methods on the forget-retain tradeoff across multiple models.
Read more
MAAT: Multi-phase Adapter-Aware Targeted Unlearning
Summary
The paper addresses a significant gap in machine unlearning evaluation, specifically the underrepresentation of causal knowledge in existing benchmarks. The authors introduce 5WBENCH, a balanced benchmark consisting of 5,000 samples categorized into Who, What, When, Where, and Why questions, highlighting the inadequacy of current methods in handling causal knowledge. They demonstrate that existing unlearning methods struggle to achieve both high forgetting and high retention of Why-type causal knowledge due to the complexity of multi-hop reasoning and gradient dilution. To tackle this issue, the authors propose MAAT (Multi-phase Adapter-Aware Targeted Unlearning), a novel three-phase framework that operates on LoRA adapter weights. MAAT employs structured adapter surgery techniques, including gradient projection, SVD-based pruning, task vector negation, and hybrid KL–hidden-state retain repair. The results show that MAAT is the first method to effectively balance forgetting and retention of causal knowledge, achieving a new benchmark on the forget-retain Pareto frontier, outperforming all existing baselines.
Methodology
The MAAT framework consists of three phases: (1) gradient projection to orthogonalize forget updates against retain gradients, (2) SVD-based pruning of adapter dimensions to focus forgetting signals, and (3) hybrid KL–hidden-state retain repair to prevent re-learning of forgotten content. This structured approach allows for targeted unlearning without merging adapter weights into the base model.
Results
MAAT successfully achieves a balance between forgetting and retaining causal knowledge, reaching a new operational point on the forget-retain Pareto frontier. It outperforms all baseline methods evaluated on both Llama 3.2-3B and Gemma 3-4B models.
Implications
The findings suggest that existing unlearning methods may not adequately address causal knowledge, which is critical for applications requiring reliable and interpretable AI systems. The introduction of 5WBENCH and the MAAT framework can guide future research in improving unlearning techniques, particularly in domains where causal reasoning is essential.
Scaling Higher-Order Graph Learning with Maximal Clique Complexes
Graph Learning
- Introduction of sCWL and fCWL tests that preserve expressivity while improving scalability.
- Development of the maximal clique complex for efficient higher-order graph representation.
- CliqueWalk method for sampling maximal cliques, enabling linear scaling with graph size.
- Competitive performance on classification benchmarks compared to existing GNNs.
Read more
Scaling Higher-Order Graph Learning with Maximal Clique Complexes
Summary
This paper addresses the limitations of traditional graph neural networks (GNNs) that primarily model pairwise interactions by introducing a scalable framework for higher-order graph learning. The authors propose simplified and factored cellular Weisfeiler–Leman tests (sCWL and fCWL) that maintain the expressivity of the original CWL test while enhancing computational efficiency. They introduce the maximal clique complex, which encodes only the maximal cliques of a graph, thereby reducing time and memory complexity. To avoid the computational burden of explicit clique enumeration, the authors develop CliqueWalk, a biased random walk that efficiently samples maximal cliques, allowing for linear scaling with graph size. The proposed methods demonstrate strong empirical performance on node and graph classification tasks, achieving competitive results compared to existing GNNs while significantly improving scalability and efficiency.
Methodology
The authors introduce simplified and factored versions of the cellular Weisfeiler–Leman tests (sCWL and fCWL) and their corresponding neural architectures (sCWNs and fCWNs). They propose the maximal clique complex to encode only maximal cliques and develop CliqueWalk, a biased random walk algorithm for efficient sampling of these cliques. This approach allows for scalable higher-order graph learning without the need for exhaustive clique enumeration.
Results
The proposed methods, sCWNs and fCWNs, demonstrate competitive performance on various node and graph classification benchmarks, matching or exceeding the performance of existing GNNs while achieving better scalability and efficiency. The empirical results validate the effectiveness of the introduced techniques in handling larger graphs.
Implications
The findings suggest that the proposed framework can be applied to various domains where higher-order interactions are significant, such as social network analysis, molecular property prediction, and other complex relational data tasks. The scalability of the methods opens up possibilities for their use in large-scale graph learning applications.
Unlearning in Diffusion Models: A Unified Framework with KL Divergence and Likelihood Constraints
Generative Models
Optimization
Theory
- Introduces a principled constrained optimization framework for unlearning in diffusion models.
- Formulates three optimization problems based on KL divergences and likelihood constraints.
- Establishes strong duality for the proposed problems, enabling effective solution characterization.
- Demonstrates superior performance of KL-constrained methods over traditional weight-based approaches.
Read more
Unlearning in Diffusion Models: A Unified Framework with KL Divergence and Likelihood Constraints
Summary
This paper addresses the challenge of unlearning in diffusion models, which involves removing undesirable data or concepts while maintaining the utility of pretrained models. The authors propose a constrained optimization framework that formulates unlearning as minimizing the deviation from a pretrained model while enforcing constraints to separate the model from unlearning distributions. They introduce three optimization problems based on reverse and forward KL divergences and likelihood constraints, generalizing existing approaches for concept and data unlearning. The paper establishes strong duality for these nonconvex problems, allowing for the characterization of optimal solutions and the development of primal-dual algorithms. Experimental results show that the KL-constrained approach outperforms weight-based baselines in achieving a better retention-unlearning tradeoff, while the likelihood-based method effectively balances unlearning and concept preservation.
Methodology
The authors employ a constrained optimization framework to minimize the distance between a pretrained model and the unlearning targets, using reverse and forward KL divergences to formulate the problems. They develop primal-dual algorithms based on the strong duality of the formulated problems, which allows for efficient computation of optimal solutions.
Results
The experimental results indicate that the KL-constrained approach achieves better retention-unlearning tradeoffs compared to weight-based methods. The likelihood-based unlearning method matches the effectiveness of unlearning while better preserving the retained concepts, demonstrating the advantages of the proposed framework.
Implications
This work has significant implications for the ethical deployment of generative models, allowing for the removal of harmful content while maintaining model utility. It provides a systematic approach to machine unlearning that can be applied in various domains where data privacy and ethical considerations are paramount.
Early Prediction of Future Behavioral Strategy from Process Traces
Reinforcement Learning
Time Series
Robotics
- The paper formulates early cross-task behavioral strategy prediction as a relevant problem.
- Introduction of the Process-Level Latent Variable Model (PLVM) for fusing task-specific traces.
- PLVM outperforms traditional outcome-based models and single-task models in predicting behavior.
- Controlled simulations validate the effectiveness of PLVM in recovering behavioral phenotypes.
Read more
Early Prediction of Future Behavioral Strategy from Process Traces
Summary
This paper addresses the challenge of predicting future behavioral strategies in adaptive systems using limited prior evidence. It highlights the difficulty of inferring person-level tendencies from standard behavioral outcomes, which often collapse distinct processes into similar results. Instead, the authors propose leveraging process-level traces that capture the unfolding of behavior within tasks. They introduce a Process-Level Latent Variable Model (PLVM) that encodes task-specific traces and fuses them into a shared latent representation for predicting behavior in a target task. The study is instantiated using the PowerWash Simulator dataset, where the model predicts whether players will exhibit locally persistent or frequently switching behaviors in a held-out level based on partial traces from two source tasks. The findings demonstrate that PLVM significantly outperforms outcome-based models and single-task process models, suggesting that cross-task modeling can effectively support early predictions of behavioral strategies when observing sufficient target-task behavior is impractical.
Methodology
The authors developed the Process-Level Latent Variable Model (PLVM), which utilizes task-specific encoders to summarize process traces from multiple source tasks. These summaries are then fused into a shared person-level latent representation that is used to predict behavior in a held-out target task. The methodology includes controlled simulations with known latent types to validate the model's effectiveness.
Results
The PLVM demonstrated superior performance in predicting player behavior in the PowerWash Simulator compared to outcome-based models and single-task process models. It effectively distinguished between locally persistent and frequently switching behaviors, indicating that transferable behavioral strategies are not fully captured by aggregate outcomes or single-task traces.
Implications
The findings suggest that adaptive systems, such as educational tutors or game AI, can benefit from early predictions of user behavior based on process traces. This could lead to more personalized and effective interventions and support strategies tailored to individual behavioral tendencies.