AI-generated summaries
Today's ML research,
without the noise.
Daily summaries of the latest machine learning papers from arXiv, processed every 8 hours.
67
Papers today
8h
Update frequency
7
Days of history
Student Capacity Moderates Knowledge Distillation Effectiveness: A Systematic Study Across ResNet Teacher-Student Pairs on CIFAR-10
Computer Vision
Efficient ML
Theory
- Student capacity is a critical factor in the effectiveness of knowledge distillation.
- Feature-KD can outperform Logit-KD when implemented correctly, contradicting previous assumptions.
- Architectural adjustments for input resolution are essential for optimal performance in KD.
- Implementation bugs can significantly skew the results of KD methods.
Read more
Student Capacity Moderates Knowledge Distillation Effectiveness: A Systematic Study Across ResNet Teacher-Student Pairs on CIFAR-10
Summary
This paper investigates the impact of teacher-student capacity relationships on the effectiveness of knowledge distillation (KD) in ResNet-based image classification tasks using the CIFAR-10 dataset. The study systematically evaluates three teacher-student pairs (R50→R18, R34→R18, and R50→R34) and compares two KD methods: Logit-KD and Feature-KD. The findings reveal that student capacity significantly influences distillation gains, with R34 students benefiting more than R18 students, even when teacher-student accuracy gaps are similar. Additionally, the study highlights the importance of implementation correctness in Feature-KD, noting that a bug in gradient clipping adversely affected performance. After correcting this issue, Feature-KD was shown to match or exceed Logit-KD performance in two out of three pairs. Lastly, the research emphasizes that an architecture designed for the input resolution is crucial for effective distillation, as correcting the ResNet stem for 32×32 inputs improved teacher accuracy significantly.
Methodology
The study employed a systematic ablation approach to evaluate Logit-KD and Feature-KD across various configurations of distillation weight and temperature. It involved controlled experiments with three ResNet teacher-student pairs, ensuring reproducibility through multiple seeds. The researchers also corrected implementation issues that affected Feature-KD performance.
Results
The results indicated that R34 students achieved a distillation gain of +0.30 percentage points with Feature-KD compared to +0.18 for R34→R18 and no gain for R34→R18 Logit-KD. After correcting the implementation bug, Feature-KD performance improved, achieving 95.55% accuracy on R50→R34 against a baseline of 95.25%. Furthermore, correcting the ResNet architecture for 32×32 inputs led to a more than 5 percentage point increase in teacher accuracy.
Implications
The findings suggest that optimizing student capacity and ensuring architectural correctness can enhance the effectiveness of knowledge distillation in deep learning models, particularly for resource-constrained environments. This has potential applications in deploying efficient models in real-world scenarios where computational resources are limited.
Supervised Training Rapidly Degrades Early Visual Cortex Alignment Across Biologically Plausible Learning Rules
Computer Vision
- Untrained neural networks often align better with early visual cortex than trained networks.
- Supervised training, particularly with backpropagation, significantly degrades V1 alignment.
- Different learning rules affect alignment dynamics differently, with BP being the most destructive.
- Predictive coding and STDP preserve more brain-like structure compared to BP.
Read more
Supervised Training Rapidly Degrades Early Visual Cortex Alignment Across Biologically Plausible Learning Rules
Summary
This paper investigates the surprising phenomenon where untrained neural networks often exhibit better alignment with early visual cortex representations than their trained counterparts. The study tracks representational similarity to human fMRI data across training for four different learning rules: backpropagation (BP), feedback alignment (FA), predictive coding (PC), and spike-timing-dependent plasticity (STDP). Using a dataset of 720 object images and fMRI data from three subjects, the author measures the Spearman correlations between model and brain representational dissimilarity matrices at various training epochs. The findings reveal that training significantly degrades alignment with the early visual cortex (V1), with BP causing the most substantial reduction. In contrast, PC and STDP preserve more alignment. Interestingly, while BP erodes V1 alignment, it shows a slight increase in alignment in higher-level object-selective areas (LOC). This suggests that untrained networks capture low-level visual statistics effectively, and that the global error signals used in BP reshape early representations more aggressively than local learning rules. The study highlights the dynamics of representational alignment during training and raises questions about the intrinsic properties of network architectures versus the effects of training.
Methodology
The study employs representational similarity analysis (RSA) to compare neural network activations with human fMRI data. Four learning rules (BP, FA, PC, STDP) are implemented on a shared convolutional architecture. The models are trained on a subset of CIFAR-10 images, and representational dissimilarity matrices (RDMs) are computed at multiple checkpoints during training to assess alignment with brain data.
Results
The results indicate that all learning rules degrade V1 alignment, with BP causing a reduction of 25-90% in alignment within a single epoch. BP shows the most significant decrease in V1 alignment (∆r = -0.080), while PC and STDP maintain better alignment (∆r ≈ -0.04). Conversely, BP leads to a slight increase in alignment in the LOC, suggesting a complex relationship between training and representational structure.
Implications
These findings challenge the conventional view that training always improves alignment with biological systems. They suggest that the choice of learning rule can significantly influence how neural networks develop representations that relate to human visual processing, which could inform future neural network designs and training strategies.
Reducing the GPU Memory Bottleneck with Lossless Compression for ML -- Extended
Efficient ML
Graph Learning
Large Language Models
- Introduces Invariant Bit Packing (IBP), a lossless compression algorithm tailored for ML workloads.
- IBP achieves significant performance improvements, including 74% faster GNN training and 180% faster DLRM embedding lookup.
- The method minimizes GPU memory usage while avoiding the accuracy trade-offs associated with lossy compression.
- Provides easy-to-use APIs for integration into existing ML frameworks.
Read more
Reducing the GPU Memory Bottleneck with Lossless Compression for ML -- Extended
Summary
This paper addresses the significant GPU memory bottleneck encountered during machine learning (ML) training and inference, particularly when handling large datasets that exceed GPU memory capacity. Traditional methods often rely on PCIe for on-demand tensor transfers, leading to critical transfer bottlenecks. While lossy compression techniques have been proposed to mitigate these issues, they introduce unpredictable accuracy loss, complicating deployment in real-world applications. The authors propose a novel approach using lossless compression, specifically through a new algorithm called Invariant Bit Packing (IBP). IBP minimizes data transfer time by identifying and eliminating invariant bits across tensors, optimizing GPU-accelerated decompression, and leveraging warp parallelism and asynchronous PCIe transfers. The paper demonstrates the integration of IBP into popular ML frameworks, showcasing its effectiveness in enhancing performance without sacrificing accuracy. The results indicate substantial improvements in training and inference times across various ML models, including Graph Neural Networks (GNNs), Deep Learning Recommendation Models (DLRMs), and Large Language Models (LLMs).
Methodology
The authors analyze PCIe transfer bottlenecks and existing compression methods, then develop IBP, which identifies invariant bits across tensors and compresses data while minimizing metadata. The algorithm is optimized for GPU decompression, allowing for efficient data transfer and processing. IBP is integrated into popular ML systems, and its performance is evaluated against state-of-the-art GPU-accelerated compression libraries.
Results
IBP significantly accelerates GNN training by 74%, DLRM embedding lookup by 180%, and LLM inference by 24% on average when tested on an A100 GPU, all while maintaining model accuracy. The method also demonstrates effectiveness on streaming datasets.
Implications
The findings suggest that lossless compression can effectively alleviate GPU memory bottlenecks in ML applications, enabling the use of larger datasets without compromising model performance. This has implications for various domains utilizing ML, including e-commerce, drug discovery, and fraud detection.
Why Linear Recurrent Memory Works in Partially Observable Reinforcement Learning
Reinforcement Learning
Theory
Efficient ML
- Linear RNNs can effectively represent log-belief dynamics in partially observable environments.
- The Adaptive Logit Filter (ALF) achieves optimal asymptotic error rates in state decoding.
- The study establishes a connection between the eigenvalues of latent dynamics and environmental determinism.
- The proposed filters highlight the representational efficiency of linear memories in RL.
Read more
Why Linear Recurrent Memory Works in Partially Observable Reinforcement Learning
Summary
This paper investigates the effectiveness of linear recurrent neural networks (RNNs) in partially observable reinforcement learning (RL) environments. The authors provide a theoretical foundation for the performance of linear RNNs by constructing two linear filters that relate to the log-belief dynamics in hidden Markov models (HMMs). The first filter accurately reproduces the pre-softmax logits of the belief vector in HMMs with deterministic transitions, serving as a sufficient statistic for optimal policy learning. The second filter, the Adaptive Logit Filter (ALF), achieves a vanishing state-decoding error in nearly deterministic transition matrices, effectively reducing state ambiguity. The results extend to action-controlled HMMs, demonstrating that linear RNNs can represent or approximate the log-belief vector in partially observable RL. The paper also includes numerical experiments that validate the theoretical findings and showcase the ALF's effectiveness as a feature extractor in a small RL game.
Methodology
The authors construct two linear filters to analyze the log-belief dynamics in HMMs and POMDPs. They derive theoretical results regarding the performance of these filters, particularly focusing on the ALF's ability to recover true states with minimal error in nearly deterministic environments. Numerical simulations and experiments in a small RL environment are conducted to illustrate the theoretical properties and practical effectiveness of the filters.
Results
The paper demonstrates that a time-invariant linear recurrent memory can exactly reproduce log-belief dynamics in HMMs with deterministic transitions. The ALF achieves an asymptotic error rate that vanishes as the stochastic perturbation approaches zero, matching the optimal rates of nonlinear filtering. Additionally, the experiments show that the time-varying ALF yields strong learned policies in a reinforcement learning setting.
Implications
The findings suggest that linear RNNs can be effectively utilized in partially observable RL tasks, providing insights into their design and optimization. The theoretical justification for their performance may lead to more efficient algorithms and architectures in RL applications, particularly in environments with limited observability.
When LLMs Learn to Be Consistently Wrong: A Multi-Model Study of Linear Representations of Synthetic Deception
NLP
Large Language Models
Interpretability
- Demonstrates that synthetic dishonesty can be rapidly induced in language models through supervised fine-tuning.
- Linear representations of dishonesty are highly detectable, achieving near-perfect AUC in most models evaluated.
- Probes trained on one domain (TruthfulQA) generalize effectively to diverse reasoning tasks (MMLU) with minimal performance loss.
- Identifies two architectural regimes in models regarding their handling of dishonesty: collapse-type and high-dimensional models.
Read more
When LLMs Learn to Be Consistently Wrong: A Multi-Model Study of Linear Representations of Synthetic Deception
Summary
This paper addresses the issue of deceptive alignment in AI, where models may produce false outputs while maintaining accurate internal representations. The author introduces a multi-model experimental framework to study synthetic dishonesty, where models are fine-tuned to generate incorrect outputs. Five transformer architectures (ranging from 1.4B to 9B parameters) are evaluated, and linear probes are employed to detect dishonesty in model activations. The findings reveal that dishonest representations can be rapidly induced and detected with high accuracy, particularly in the early layers of the models. The study also demonstrates robust cross-domain generalization of these representations and highlights the differences in architectural responses to adversarial noise. A mechanistic analysis identifies two distinct regimes in model behavior: collapse-type models, which exhibit concentrated dishonesty, and high-dimensional models, which maintain richer representations. The results underscore the potential for monitoring and understanding dishonesty in language models, with implications for AI safety and model interpretability.
Methodology
The study employs a multi-model experimental paradigm where honest and deceptive variants of transformer models are fine-tuned using LoRA. Linear and nonlinear probes are trained on mean-pooled hidden states to classify activations as honest or deceptive. A mechanistic analysis is conducted to explore the geometry of dishonesty representations across different model architectures.
Results
The results show that linear probes can detect synthetic dishonesty with AUC values of 0.99 or higher in four out of five models, with Pythia-1.4B being an exception at 0.705. The study also finds that late-layer representations are more robust to noise, and cross-domain transfer of dishonesty representations is achieved with negligible performance loss.
Implications
The findings suggest that understanding and monitoring dishonesty in language models is crucial for AI safety. The ability to detect and analyze deceptive behaviors can inform the development of safer AI systems and improve interpretability in model outputs.
Apertus LLM Family Expansion via Distillation and Quantization
Large Language Models
Efficient ML
NLP
- Introduction of the Apertus-v1.1 model family through distillation and quantization.
- Demonstration of cost-effective model expansion without the need for extensive pre-training.
- Validation of pre-training distillation as a method to enhance model performance with fewer resources.
- Exploration of quantization techniques to optimize models for various hardware constraints.
Read more
Apertus LLM Family Expansion via Distillation and Quantization
Summary
The paper addresses the growing demand for Large Language Models (LLMs) to meet diverse budget and hardware constraints by proposing a cost-effective method for expanding model families through distillation and quantization. The authors introduce the Apertus-v1.1 model family, derived from the Apertus 8B LLM, which includes models with up to 4 billion parameters trained on 1.7 trillion tokens. The study validates the use of pre-training distillation to reduce computational costs significantly while maintaining strong accuracy across various hardware requirements. The methodology emphasizes the importance of model families in providing practitioners with flexible options for deployment, thus democratizing access to advanced AI capabilities. The authors also explore quantization as a means to further optimize models for specific hardware profiles, achieving a balance between performance and resource efficiency. Overall, the paper demonstrates that the Apertus-v1.1 models can deliver competitive performance with reduced training costs compared to traditional methods.
Methodology
The authors employed pre-training distillation to transfer knowledge from a larger teacher model (Apertus 8B) to smaller student models (Apertus-v1.1), allowing for faster convergence and improved performance with fewer training tokens. They also utilized quantization techniques to reduce memory footprint and inference latency while managing the cost-accuracy trade-off. The training involved a mix of KL-Divergence and cross-entropy loss functions, and the models were evaluated on multilingual benchmarks.
Results
The Apertus-v1.1 models demonstrated strong multilingual performance, closely resembling the capabilities of the larger Apertus 8B model while being trained on significantly fewer tokens (1.7 trillion vs. 15 trillion). The models achieved competitive accuracy and efficiency, showcasing the effectiveness of the distillation and quantization approach in expanding the model family.
Implications
The findings suggest that distillation and quantization can significantly lower the barriers to deploying LLMs in various applications, making advanced AI capabilities more accessible across different hardware environments. This approach can facilitate the development of tailored models for specific use cases, enhancing the versatility and applicability of LLMs in real-world scenarios.
Non-destructive Identification of Oyster Species is possible from Hyperspectral Images with Machine Learning
Computer Vision
- Hyperspectral imaging can non-destructively differentiate between oyster species.
- PLS-DA outperformed CNN in classification accuracy for oyster species identification.
- Distinct elemental compositions were found between Black-Lip and Sydney rock oysters.
- The methodology has potential applications in aquaculture for species traceability and broodstock management.
Read more
Non-destructive Identification of Oyster Species is possible from Hyperspectral Images with Machine Learning
Summary
This study explores the use of hyperspectral imaging (HSI) combined with machine learning techniques to non-destructively identify two oyster species: Black-Lip rock (BL) and Sydney rock (SR) oysters. Traditional methods of species identification, such as DNA profiling, are often destructive and time-consuming, making them unsuitable for large-scale applications. The researchers scanned 156 live oyster samples using an HSI camera and applied Partial Least Square Discriminant Analysis (PLS-DA) and Convolutional Neural Networks (CNN) to analyze the spectral reflectance data. The PLS-DA model achieved a remarkable 100% classification accuracy, significantly outperforming the CNN, which achieved 83% and 96% accuracy for the left and right valves, respectively. The study also examined the elemental and mineralogical composition of the oyster valves, revealing distinct differences in the number of layers and concentrations of carbon and oxygen between the two species. These findings suggest that HSI can effectively differentiate between oyster species based on their spectral signatures, paving the way for rapid, non-destructive identification methods that could enhance operational efficiency in aquaculture and improve species traceability in seafood supply chains.
Methodology
The study involved scanning live samples of Black-Lip and Sydney rock oysters using a hyperspectral imaging camera across a wavelength range of 950-2515 nm. Machine learning models, specifically Partial Least Square Discriminant Analysis (PLS-DA) and Convolutional Neural Networks (CNN), were trained using Monte Carlo Cross Validation (MCCV) to classify the oysters based on their spectral reflectance data. Additionally, electron microscopy was used to analyze the elemental and mineralogical composition of the oyster valves.
Results
The PLS-DA model achieved a median test set classification accuracy of 100%, while the CNN achieved 83% and 96% accuracy for the left and right valves, respectively. Analysis of the oyster valves revealed that Black-Lip oysters had more layers and different elemental compositions compared to Sydney rock oysters, indicating potential structural differences that could be detected through HSI.
Implications
The findings suggest that hyperspectral imaging combined with machine learning can provide an effective, rapid, and non-destructive method for identifying oyster species. This has significant implications for improving operational efficiency in aquaculture, enhancing traceability in seafood supply chains, and facilitating the use of wild spat as broodstock.
The Sample Complexity of Multiclass and Sparse Contextual Bandits
Theory
Reinforcement Learning
Efficient ML
- Introduces improved sample complexity bounds for contextual bandits with sparse rewards.
- Establishes algorithms that achieve ε-optimal policies with significantly reduced dependence on the number of actions.
- Bridges a gap in existing literature by providing tight bounds that are minimax optimal up to logarithmic factors.
- Utilizes two complementary approaches: DEC-based exploration and low-variance exploration techniques.
Read more
The Sample Complexity of Multiclass and Sparse Contextual Bandits
Summary
This paper investigates the sample complexity of contextual bandits in a stochastic i.i.d. setting, focusing on the s-sparse scenario where the reward vector has an L1-norm bounded by s, significantly smaller than the number of actions |A|. The authors present algorithms that achieve an ε-optimal policy with a sample complexity of O((s/ε² + |A|/ε) log |Π|/δ), improving upon previous bounds that included an extraneous Θ(|A|⁹) dependence. The study bridges a gap in prior research by establishing tight sample complexity bounds for contextual bandits with sparse rewards, particularly in the context of multiclass classification. The results are derived through two main approaches: an exploration-by-optimization algorithm informed by the decision-estimation coefficient (DEC) and a low-variance exploration technique that leads to tractable algorithms. These findings not only enhance the theoretical understanding of contextual bandits but also provide practical algorithms with improved sample complexity guarantees for applications such as bandit multiclass list classification.
Methodology
The authors employ two main methodologies: first, they analyze contextual bandits using an exploration-by-optimization algorithm informed by the decision-estimation coefficient (DEC), which allows for optimal sample complexity rates based on sparsity. Second, they develop a low-variance exploration technique that leads to concrete algorithms and extends to contextual combinatorial semi-bandits, enhancing sample complexity guarantees.
Results
The paper presents algorithms that achieve a sample complexity of O((s/ε² + |A|/ε) log |Π|/δ), which is a significant improvement over previous bounds that had a Θ(|A|⁹) dependence. The results are shown to be minimax optimal up to logarithmic factors, providing a theoretical foundation for efficient learning in contextual bandit settings.
Implications
The findings have significant implications for various applications in online learning, such as recommendation systems, adaptive experimentation, and medical decision-making, where efficient decision-making under uncertainty is crucial. The improved sample complexity guarantees can lead to more effective algorithms in practical scenarios.
LoRe: Adaptive Interaction-Evaluation Routing with Per-Step Interaction Budgets for Iterative Graph Solvers
Graph Learning
Optimization
Efficient ML
- LoRe introduces a per-step operator budgeting framework for iterative graph solvers.
- The method dynamically routes computation to high-conflict interactions, improving efficiency.
- LoRe achieves a 15× speedup and 44× memory reduction on the TSP problem.
- The framework is a drop-in solution that does not require retraining of existing models.
Read more
LoRe: Adaptive Interaction-Evaluation Routing with Per-Step Interaction Budgets for Iterative Graph Solvers
Summary
The paper introduces LoRe, a novel framework designed to enhance the efficiency of diffusion-based neural solvers for combinatorial optimization problems, particularly in scenarios where inference time and memory usage are critical. Traditional iterative graph solvers often face scalability issues due to the need for dense evaluations of interactions, which can lead to out-of-memory errors and excessive latency. LoRe addresses this by implementing a per-step interaction-evaluation budget, dynamically routing computation to focus on high-conflict or high-uncertainty interactions rather than relying on static sparsification methods. This approach is inspired by methodologies from many-body physics, specifically the Cluster-Bath decomposition, allowing the solver to maintain accuracy while reducing computational load. The authors validate LoRe through extensive experiments, demonstrating significant improvements in scalability and efficiency across various combinatorial optimization tasks, including the Maximum Independent Set (MIS) and the Traveling Salesperson Problem (TSP).
Methodology
LoRe operates as a training-free, inference-time wrapper that implements a Cluster-Bath decomposition for graph solvers. It identifies a dynamic subset of high-conflict interactions to evaluate at each step while approximating the influence of omitted interactions through a global recall signal. This allows for efficient computation while maintaining solution quality.
Results
LoRe significantly extends the feasible inference capabilities of graph solvers, achieving over 3× improvement beyond baseline out-of-memory limits on the MIS problem. It also delivers up to an 8× speedup and a 12× reduction in peak memory usage, with competitive solution quality maintained. For the TSP, it achieves a 15× speedup and a 44× memory reduction.
Implications
The findings suggest that LoRe can be effectively applied in real-time decision-making systems where computational resources are constrained, such as dynamic vehicle routing and resource allocation in data centers. The dynamic routing approach can lead to more efficient and scalable solutions in various combinatorial optimization tasks.
Diffusion Models Preferentially Memorize Prototypical Examples or: Why Does My Diffusion Model Love Slop?
Generative Models
- Diffusion models preferentially memorize common substrings over atypical samples.
- Memorization behavior is influenced by dataset characteristics, particularly the presence of fat-tailed distributions.
- An intermediate regime of partial memorization can lead to bland outputs, termed 'slop'.
- Dataset diversity at higher abstraction levels is crucial for reducing memorization risks.
Read more
Diffusion Models Preferentially Memorize Prototypical Examples or: Why Does My Diffusion Model Love Slop?
Summary
This paper investigates the memorization behavior of diffusion models, particularly focusing on the types of samples that are memorized during training. The authors challenge the prevailing notion that atypical or rare samples are memorized first, demonstrating instead that common substrings are preferentially memorized. Through experiments with data generated from the Random Hierarchy Model (RHM), they find that even unique training samples lead to the memorization of common features. The study reveals that delayed memorization occurs for fat-tailed datasets, where atypical samples are present, and this effect is intensified when such samples are integrated into high-level production rules. The authors also identify a regime of partial memorization where common substrings are learned first, leading to a phenomenon they describe as 'slop'—a blandness in generated outputs when training is halted prematurely. This work highlights the importance of dataset diversity at higher abstraction levels in mitigating memorization risks, which has implications for the safe deployment of generative models.
Methodology
The authors trained diffusion models on strings generated according to the Random Hierarchy Model (RHM) and analyzed the memorization patterns of the models. They compared the memorization of common versus rare samples and examined the effects of dataset diversity and structure on memorization behavior.
Results
The findings indicate that common substrings are memorized preferentially, even in datasets with unique samples. Delayed memorization was observed in fat-tailed datasets, and the introduction of fat-tails into high-level production rules amplified this effect. The study also identified a regime of partial memorization leading to the generation of bland outputs when training is prematurely stopped.
Implications
The results suggest that understanding the memorization dynamics of generative models is essential for addressing privacy concerns and enhancing the creative capabilities of AI. By focusing on dataset diversity and structure, developers can mitigate risks associated with memorization, leading to more reliable and creative generative models.
Solving Integer Linear Programming with Parallel Tempering
Optimization
- Introduces a solver-free, sampling-based approach for ILP optimization.
- Utilizes Locally-Balanced Proposal and Parallel Tempering to explore discrete feasible regions.
- Demonstrates superior performance compared to SCIP and competitive results against Gurobi.
- Shows robustness to distribution shifts compared to learning-based methods.
Read more
Solving Integer Linear Programming with Parallel Tempering
Summary
This paper presents a novel solver-free, sampling-based optimization framework for Integer Linear Programming (ILP), addressing the limitations of existing learning-based approaches that struggle with generalization and dependence on external solvers. The authors introduce a Locally-Balanced Proposal to construct a transition kernel that avoids gradient approximations, and they integrate Parallel Tempering to navigate the multimodal energy landscapes typical of ILP problems. Two tempering strategies are proposed: standard temperature tempering and penalty tempering, which modulates constraint barriers while maintaining the objective landscape over feasible solutions. The empirical evaluation demonstrates that the proposed method consistently outperforms SCIP across four benchmarks and matches or exceeds Gurobi's performance on two tasks within a 200-second time limit. Additionally, the framework shows robustness to distribution shifts, remaining competitive with classical solvers on real-world MIPLIB 2017 instances without requiring problem-specific tuning.
Methodology
The authors employ a sampling-based approach that leverages the linear structure of ILP to construct a transition kernel using a Locally-Balanced Proposal. They implement Parallel Tempering with two strategies: temperature tempering, which flattens the energy landscape, and penalty tempering, which selectively reduces barriers in infeasible regions while preserving feasible solutions.
Results
The proposed framework outperformed SCIP across all four benchmark tasks and matched or exceeded Gurobi's performance on two tasks within a 200-second budget. It also demonstrated greater robustness to distribution shifts compared to learning-based methods and remained competitive with classical solvers on MIPLIB 2017 instances without any problem-specific tuning.
Implications
This work suggests a new direction for solving ILP problems that could enhance the efficiency and effectiveness of combinatorial optimization methods, particularly in scenarios where traditional solvers struggle or require extensive tuning. The findings may also influence future research in optimization techniques and machine learning applications in combinatorial settings.
Augmented Lagrangian Predictive Coding
Optimization
Theory
- PC-ALM integrates augmented Lagrangian methods into predictive coding to enhance local learning dynamics.
- The method achieves BP-equivalent performance in nonlinear networks, particularly in deep narrow architectures.
- PC-ALM introduces recurrent dynamics that facilitate faster and more uniform credit propagation across layers.
- The approach maintains a finite inference budget while aligning weight updates with BP gradients.
Read more
Augmented Lagrangian Predictive Coding
Summary
This paper introduces Augmented Lagrangian Predictive Coding (PC-ALM), a novel approach that enhances predictive coding (PC) by integrating the augmented Lagrangian framework to align weight updates with backpropagation (BP) gradients. Unlike traditional BP, which requires global error propagation, PC-ALM maintains a local learning dynamic while accumulating per-layer constraint errors into a layer-local Lagrange multiplier. The authors demonstrate that in linear PC networks, PC-ALM converges to a solution that mirrors BP gradients through local updates. The paper further explores the performance of PC-ALM in nonlinear networks, showing that it effectively closes the performance gap with BP across various architectures, particularly in deep narrow networks where PC typically underperforms. The introduction of recurrent dynamics in activations allows for improved credit propagation, with PC-ALM exhibiting 'ballistic' credit distribution across layers, contrasting with the slower, diffusive nature of standard PC. The findings suggest that the augmented Lagrangian framework not only generalizes PC but also provides insights into local learning dynamics that could inform future neural network training methodologies.
Methodology
The authors developed PC-ALM by modifying the inference step of traditional predictive coding with a dual variable (Lagrange multiplier) that integrates prediction errors. This method operates within a finite inference iteration budget, allowing for local updates that converge to BP gradients in linear networks and match BP performance in nonlinear networks up to a depth of 128.
Results
PC-ALM demonstrated convergence to BP gradients in linear networks and matched BP performance in nonlinear networks across various width-depth configurations. The method showed significant improvements in credit propagation dynamics, distributing credit signals more uniformly across layers and accelerating the propagation wavefront in deep networks compared to standard PC.
Implications
The findings suggest that PC-ALM could lead to more biologically plausible training methods for deep learning, potentially influencing future research in local learning algorithms and decentralized network training. The insights gained from the augmented Lagrangian framework may also inform the design of more efficient neural networks.
Spectral Reach: Understanding Neural Scaling as Progress into the Spectral Tail
Theory
Optimization
- Introduction of spectral position as a scalable measure of eigenvalue contributions to loss reduction.
- Larger models achieve lower losses by accessing weak spectral signals in the eNTK spectrum.
- Feature learning is identified as a key enabler of spectral reach, amplifying gradients during training.
- The study provides a framework for understanding the dynamics of scaling in large-scale neural networks.
Read more
Spectral Reach: Understanding Neural Scaling as Progress into the Spectral Tail
Summary
This paper investigates the mechanisms behind neural scaling laws, which describe the predictable relationships between model size, dataset size, compute, and performance. The authors introduce a new metric called spectral position, which measures the eigenvalues of the empirical neural tangent kernel (eNTK) that contribute to loss reduction during training. Their findings reveal that as training progresses, spectral position decreases, indicating a shift from dominant eigenmodes to the spectral tail. Larger models exhibit greater spectral reach, allowing them to learn from weaker spectral signals that smaller models cannot access. The study identifies feature learning as a critical factor enabling this spectral reach, as it adaptively amplifies gradient magnitudes, sustaining learning progress where frozen representations would stall. The authors propose that these insights can inform architectural and optimizer design to enhance model performance.
Methodology
The authors derive a loss-network-position (LNP) decomposition that factors instantaneous loss changes into three components: network scale, loss scale, and scale-free spectral position. This allows for the evaluation of spectral position from per-sample gradients without explicit kernel construction. The framework is validated through controlled experiments with random-feature models, aligning empirical observations with theoretical predictions.
Results
The analysis shows that spectral position consistently decreases throughout training, with larger models reaching lower spectral positions than smaller ones. This indicates that larger models can sustain learning from smaller eigenvalues in the eNTK spectrum, leading to lower losses. Additionally, feature learning is shown to play a significant role in maintaining learning progress as spectral position decreases.
Implications
The findings suggest that understanding spectral reach can guide the design of more effective neural network architectures and optimization strategies. By leveraging insights into spectral dynamics, practitioners can enhance model performance, particularly in large-scale applications.
Spectral Anatomy of Quantum Gaussian Process Kernels
Theory
Optimization
- Introduces normalized spectral entropy as a key diagnostic for QGP kernels.
- Establishes a connection between spectral properties and the performance of QGPs in Bayesian optimization.
- Identifies a 'Goldilocks region' for kernel design that balances expressiveness and informativeness.
- Demonstrates empirical validation of findings on quantum hardware with minimal error.
Read more
Spectral Anatomy of Quantum Gaussian Process Kernels
Summary
This paper explores the spectral properties of quantum Gaussian process (QGP) kernels, revealing that two recent findings regarding QGPs are interconnected through the normalized spectral entropy of the kernel Gram matrix. The authors demonstrate that the eigenspectrum of the kernel matrix influences both the approximation capabilities of classical random-feature schemes and the informativeness of the GP posterior. They introduce a spectral coordinate, S(K)/log n, as a unified measure for assessing dequantization difficulty, posterior calibration, and variance contraction across various quantum and classical kernel families. The study identifies a 'Goldilocks region' where kernels are neither trivial nor overly concentrated, thus maintaining a useful GP posterior. Empirical results indicate that the optimal predictive performance varies with the target function, and the spectral statistics can be reliably estimated on current quantum hardware, showcasing the practical applicability of their findings.
Methodology
The authors analyze the spectral properties of kernel Gram matrices and derive theoretical bounds related to Nyström approximation errors and variance contraction. They conduct empirical experiments across multiple quantum and classical kernel families, utilizing both simulators and IBM Heron quantum hardware to validate their findings.
Results
The study finds that the normalized spectral entropy S(K)/log n serves as a robust indicator of QGP performance, with empirical results showing median absolute errors of 3.2% and mean errors of 5.2% across various configurations. The optimal predictive performance is shown to depend on the target function's characteristics, with different entropy levels yielding the best results for smooth versus band-limited targets.
Implications
The findings suggest that the spectral properties of quantum kernels can guide the design of more effective quantum algorithms for Bayesian optimization and other applications in quantum machine learning. The ability to estimate spectral statistics on current hardware indicates a pathway for practical implementations of QGPs in real-world scenarios.
Cluster-Level Attention-Guided Parallel Decoding for Masked Diffusion Language Models
NLP
Large Language Models
Generative Models
- Introduction of confidence-induced clusters (CICs) as span-level update units for MDLMs.
- Development of CLAD, a training-free cluster-level decoder that enhances parallel decoding.
- Utilization of self-attention maps to model inter-cluster dependencies and ensure compatibility.
- Demonstrated significant speed improvements over existing token-level decoding methods.
Read more
Cluster-Level Attention-Guided Parallel Decoding for Masked Diffusion Language Models
Summary
This paper introduces CLAD (Cluster-Level Attention-Guided Decoding), a novel approach to decoding in Masked Diffusion Language Models (MDLMs). The authors identify that existing methods typically operate at a token-level granularity, which limits the potential for parallelism during decoding. They propose a new granularity by defining confidence-induced clusters (CICs), which are contiguous spans of high-confidence masked positions. By leveraging self-attention maps to assess inter-cluster dependencies, CLAD allows for conflict-aware selection of CICs for simultaneous commitment. This method enhances the decoding process by enabling larger units of commitment while avoiding incompatible predictions. Experimental results demonstrate that CLAD achieves significant speedups (1.77×–8.47×) over traditional token-level decoding methods while maintaining comparable accuracy across various reasoning and code-generation benchmarks.
Methodology
The authors propose a two-step approach: first, they group adjacent high-confidence candidates into confidence-induced clusters (CICs) as update units. Then, they use self-attention maps from the forward pass to estimate dependencies between these clusters, allowing for conflict-aware selection of compatible CICs for parallel commitment. This results in a more efficient decoding process that can handle multiple predictions simultaneously.
Results
CLAD achieves speedups ranging from 1.77× to 8.47× compared to traditional Vanilla decoding methods. It also shows improved throughput over token-level dependency-aware baselines while maintaining broadly comparable task accuracy across four reasoning and code-generation benchmarks.
Implications
The findings suggest that changing the decoding unit from individual tokens to confidence-induced spans can significantly enhance throughput in language model decoding tasks. This approach could be beneficial in applications requiring efficient text generation, such as chatbots, automated coding assistants, and other NLP tasks.
The Good, the Bad, and the Ugly of Markov Boundary for Tabular Prediction
Theory
Graph Learning
Efficient ML
- Restricting regressors to the Markov boundary can improve prediction, especially in larger and sparser feature spaces.
- Causal discovery methods often fail to provide a useful boundary for prediction due to computational constraints and misalignment of objectives.
- The exact Markov boundary is not the only effective feature set; alternative sets can also yield better performance than using all features.
- The study highlights the trade-offs between minimality, sufficiency, and scalability in feature selection for tabular prediction.
Read more
The Good, the Bad, and the Ugly of Markov Boundary for Tabular Prediction
Summary
This paper investigates the utility of the Markov boundary in tabular prediction, which is the minimal set of features that renders all other features redundant for predicting a target variable. The authors explore whether restricting regressors to the Markov boundary improves prediction performance compared to using the full feature set. They conduct experiments using SCM3K, a synthetic benchmark with 3,450 tasks across various feature counts and regression models. The findings reveal that while restricting to the Markov boundary often enhances prediction, the process of discovering this boundary through causal discovery does not yield the expected benefits. The authors identify three main reasons for this discrepancy: causal discovery prioritizes structural recovery over predictive accuracy, the costs of false positives and negatives are asymmetrical, and many feature sets can outperform the full set without being the exact boundary. The paper concludes with insights on the implications for feature selection and causal structure in tabular models.
Methodology
The authors utilized SCM3K, a controlled benchmark of synthetic structural causal models, to evaluate the performance of six regression models. They compared the test errors of models trained on the full feature set against those trained on the oracle Markov boundary, measuring the difference as the MB gap. They also analyzed the limitations of causal discovery methods in recovering the Markov boundary.
Results
The results indicated that restricting to the Markov boundary generally improved prediction accuracy, with the improvement increasing as the feature space became larger and sparser. However, attempts to recover the boundary through causal discovery did not consistently outperform models using the full feature set, primarily due to computational limitations and the nature of the discovery process.
Implications
The findings suggest that while the Markov boundary is theoretically appealing for feature selection in tabular prediction, practical applications may require alternative approaches that balance predictive performance with computational efficiency. This has implications for the design of future regression models and feature selection techniques in machine learning.
Multivariate Distributional Reinforcement Learning Using Sliced Divergences
Reinforcement Learning
Theory
- Introduction of Sliced Distributional Reinforcement Learning (SDRL) for multivariate return distributions.
- Establishment of Bellman contraction under shared scalar discounting and a maximum-slicing variant for dense discount matrices.
- Analysis of various base divergences suitable for SDRL, including Wasserstein and MMD.
- Evaluation of SDRL on multiple environments, showcasing its practical applicability.
Read more
Multivariate Distributional Reinforcement Learning Using Sliced Divergences
Summary
This paper addresses the challenges of extending distributional reinforcement learning (DRL) to multivariate settings, where traditional metrics struggle with computational tractability and theoretical guarantees. The authors introduce Sliced Distributional Reinforcement Learning (SDRL), which utilizes projections to lift one-dimensional divergences to multivariate return distributions. They prove Bellman contraction for uniform slicing under shared scalar discounting and propose a maximum-slicing variant (MSDRL) that provides contraction guarantees for general dense discount matrices. The SDRL framework supports various base divergences, including Wasserstein, Cramér, and Maximum Mean Discrepancy (MMD), and is evaluated on a toy chain problem, a gridworld image-based environment, and a subset of Atari games, demonstrating its effectiveness in multivariate scenarios.
Methodology
The authors develop SDRL by leveraging the concept of slicing, which projects multivariate distributions into one-dimensional spaces for easier comparison. They establish theoretical foundations for Bellman contraction in both uniform and maximum slicing scenarios, allowing for efficient updates in multivariate DRL settings. The framework is designed to accommodate various divergence metrics, enhancing its flexibility and applicability.
Results
The SDRL framework was successfully applied to a toy chain problem and a gridworld environment, as well as a subset of Atari games, demonstrating improved performance in multivariate reinforcement learning tasks. The theoretical guarantees established for both uniform and maximum slicing variants provide a solid foundation for future research in this area.
Implications
The introduction of SDRL has significant implications for reinforcement learning applications that involve multivariate return distributions, such as multi-agent systems, complex decision-making environments, and scenarios requiring sophisticated discounting mechanisms. This work opens avenues for further exploration of distributional methods in high-dimensional settings.
Evolutionary Refinement of Generative Graph Topologies: A Hybrid WGAN-GA Approach
Generative Models
Graph Learning
Optimization
- Combines WGANs with Genetic Algorithms for graph generation refinement.
- Addresses structural deviations in generated graphs compared to real data.
- Implements evolutionary edge editing to optimize graph connectivity.
- Demonstrates improved alignment with real graph statistical properties.
Read more
Evolutionary Refinement of Generative Graph Topologies: A Hybrid WGAN-GA Approach
Summary
This paper addresses the challenge of generating realistic graph-structured data, which is essential for various applications such as data augmentation and privacy-preserving data sharing. The authors propose a hybrid approach that combines Wasserstein Generative Adversarial Networks (WGANs) with Genetic Algorithms (GAs) to refine the generated graphs. The WGAN framework is utilized to produce initial graphs with node features and connectivity patterns, while a Graph Neural Network (GNN) critic evaluates the realism and class consistency of these graphs. To enhance the quality of the generated graphs, a GA is applied post-generation to refine the edges, correcting structural deviations and improving alignment with real data distributions. The refinement process focuses on optimizing graph structures through evolutionary edge editing, which allows for precise adjustments to connectivity patterns. Experimental results demonstrate that the GA refinement significantly reduces discrepancies in structural properties, such as degree and spectral distributions, leading to synthetic graphs that better reflect real-world characteristics. This work highlights the effectiveness of evolutionary refinement in improving GAN-based graph generation methods, making them more suitable for practical applications.
Methodology
The methodology involves two main stages: (1) a coarse generation stage using a WGAN to produce initial graph structures, and (2) a refinement stage employing a Genetic Algorithm to iteratively optimize the generated graphs by modifying edges based on fitness measures derived from real graph statistics.
Results
The experimental results indicate that the GA refinement consistently lowers the Maximum Mean Discrepancy (MMD) between generated graphs and real graphs, resulting in synthetic graphs that exhibit more coherent structural patterns and improved connectivity reflective of the underlying data relationships.
Implications
The proposed hybrid approach enhances the capability of GAN-based models for generating realistic graph-structured data, which can be beneficial for applications in social networks, molecular biology, and other fields requiring synthetic graph data for analysis or model training.
The Hamilton-Jacobi Theory of Deep Learning
Theory
Optimization
Interpretability
- Training neural networks corresponds to solving Hamilton-Jacobi initial-value problems.
- Log-sum-exp activation functions are smooth deformations of tropical max operations.
- A single parameter ε connects various perspectives on neural networks, including tropical algebra and PDEs.
- The framework provides actionable design principles for optimizing neural network architectures.
Read more
The Hamilton-Jacobi Theory of Deep Learning
Summary
This paper presents a novel perspective on deep learning by framing the training of neural networks as a search through Hamilton-Jacobi initial-value problems. The authors establish that each gradient step corresponds to selecting initial data for a viscous Hamilton-Jacobi equation, with the Hopf-Cole propagator providing the best fit for observations. The paper demonstrates that various neural network architectures, including residual networks, transformers, and recurrent networks, can be structurally related to Hamilton-Jacobi equations, with a single deformation parameter ε unifying these perspectives. Key contributions include the identification of log-sum-exp layers as smooth deformations of tropical max operations, the establishment of a commutative diagram linking neural networks, tropical algebra, PDEs, and convex optimization, and actionable design principles for neural network architecture. The results yield insights into generalization rates, adversarial robustness, and influence functions, ultimately proposing a comprehensive mathematical theory of deep learning that connects disparate concepts under a unified framework.
Methodology
The authors utilize mathematical frameworks from Hamilton-Jacobi theory, tropical algebra, and convex optimization to establish connections between neural network architectures and partial differential equations. They employ the Hopf-Cole linearization and Maslov dequantization to derive relationships between different mathematical objects and neural network components.
Results
The paper establishes that every log-sum-exp activated layer encodes the exact Hopf-Cole solution of a viscous Hamilton-Jacobi PDE. It also identifies a minimax optimal generalization rate of O(n−1/(d+2)), demonstrates that adversarial robustness is controlled by the parameter ε, and provides a closed-form influence function for softmax weights. The findings suggest that the architecture of neural networks can be optimized based on the derived principles.
Implications
The unification of deep learning concepts under the Hamilton-Jacobi framework has the potential to enhance the design and understanding of neural networks, leading to improved performance in various applications. The insights into generalization and robustness could inform future research and practical implementations in machine learning.
Constrained Flow Optimization via Sequential Fine Tuning for Molecular Design
Generative Models
Optimization
- Introduction of a formal framework for Constrained Generative Optimization.
- Development of the Constrained Flow Optimization (CFO) algorithm for balancing reward maximization and constraint satisfaction.
- CFO provides convergence guarantees for constrained generative optimization.
- Experimental results show consistent improvements in reward and constraint satisfaction.
Read more
Constrained Flow Optimization via Sequential Fine Tuning for Molecular Design
Summary
This paper addresses the challenge of optimizing generative models, specifically diffusion and flow models, to maximize reward functions while adhering to constraints relevant in molecular design and protein engineering. The authors introduce a formal framework for Constrained Generative Optimization, which allows for the adaptation of pre-trained models to generate samples that not only maximize task-specific rewards (like binding affinity) but also satisfy domain-specific constraints (such as molecular synthesizability). The proposed algorithm, Constrained Flow Optimization (CFO), employs a dual approach based on the augmented Lagrangian scheme, transforming the constrained optimization problem into a sequence of simpler generative optimization tasks. This method enables automatic balancing between reward maximization and constraint satisfaction without the need for manual weight adjustments. The authors provide convergence guarantees for both constrained generative optimization and constrained generation through CFO. Experimental evaluations demonstrate that CFO consistently improves reward outcomes while maintaining high levels of constraint satisfaction across synthetic and molecular design tasks, highlighting its practical applicability in scientific discovery.
Methodology
The authors propose a dual approach using the augmented Lagrangian scheme to convert the constrained optimization problem into a series of ordinary generative optimization subproblems. CFO alternates between solving a KL-regularized fine-tuning problem to maximize an augmented reward function and updating the parameters based on estimated constraint violations from generated samples.
Results
CFO was evaluated in both synthetic settings and real molecular design tasks, achieving significant increases in reward while ensuring high constraint satisfaction. The results demonstrate the algorithm's effectiveness in optimizing generative models for practical applications in molecular design.
Implications
The findings suggest that CFO can enhance the reliability and predictability of generative models in scientific discovery applications, particularly in fields requiring adherence to strict constraints, such as drug discovery and protein engineering.
CalArena: A Large-Scale Post-Hoc Calibration Benchmark
Computer Vision
Theory
Optimization
- Introduction of a large-scale benchmark for post-hoc calibration covering diverse tasks and models.
- Standardized evaluation framework for comparing dozens of calibration methods.
- Post-Hoc Improvement (PHI) proposed as a new metric for assessing calibration quality.
- Empirical results show that smooth calibration functions are superior to binning methods.
Read more
CalArena: A Large-Scale Post-Hoc Calibration Benchmark
Summary
The paper introduces CalArena, a comprehensive benchmark for evaluating post-hoc calibration methods in machine learning. It addresses the critical issue of poorly calibrated probability estimates in classifiers, which can undermine decision-making in high-stakes applications. The authors compile nearly 2000 experiments across various tasks, including binary and multiclass classification in both tabular and computer vision domains. They provide a standardized framework for comparing numerous calibration methods, emphasizing the importance of Post-Hoc Improvement (PHI) as a more principled metric for assessing calibration quality. The study reveals that smooth calibration functions generally outperform binning-based methods, and highlights the necessity of calibration-specific designs for generic models. The authors also release all data, code, and tools to facilitate further research in this area.
Methodology
The authors constructed a suite of benchmarks by aggregating predictions from various classical and modern models across multiple datasets. They standardized implementations of numerous calibration methods and employed a new evaluation metric, Post-Hoc Improvement (PHI), to assess both calibration error reduction and predictive performance degradation.
Results
The results indicate consistent patterns across domains, with smooth calibration functions outperforming binning-based approaches. The study also found that dedicated multiclass methods are crucial in high-dimensional settings, and generic machine learning models often require calibration-specific designs to be competitive.
Implications
The findings have significant implications for practitioners in machine learning, particularly in fields where reliable probability estimates are essential. The benchmark and tools provided can guide the selection and development of effective calibration methods, ultimately improving the reliability of machine learning systems in critical applications.
Early Prediction of Future Behavioral Strategy from Process Traces
Reinforcement Learning
Time Series
Robotics
- The paper formulates early cross-task behavioral strategy prediction as a relevant problem.
- Introduction of the Process-Level Latent Variable Model (PLVM) for fusing task-specific traces.
- PLVM outperforms traditional outcome-based models and single-task models in predicting behavior.
- Controlled simulations validate the effectiveness of PLVM in recovering behavioral phenotypes.
Read more
Early Prediction of Future Behavioral Strategy from Process Traces
Summary
This paper addresses the challenge of predicting future behavioral strategies in adaptive systems using limited prior evidence. It highlights the difficulty of inferring person-level tendencies from standard behavioral outcomes, which often collapse distinct processes into similar results. Instead, the authors propose leveraging process-level traces that capture the unfolding of behavior within tasks. They introduce a Process-Level Latent Variable Model (PLVM) that encodes task-specific traces and fuses them into a shared latent representation for predicting behavior in a target task. The study is instantiated using the PowerWash Simulator dataset, where the model predicts whether players will exhibit locally persistent or frequently switching behaviors in a held-out level based on partial traces from two source tasks. The findings demonstrate that PLVM significantly outperforms outcome-based models and single-task process models, suggesting that cross-task modeling can effectively support early predictions of behavioral strategies when observing sufficient target-task behavior is impractical.
Methodology
The authors developed the Process-Level Latent Variable Model (PLVM), which utilizes task-specific encoders to summarize process traces from multiple source tasks. These summaries are then fused into a shared person-level latent representation that is used to predict behavior in a held-out target task. The methodology includes controlled simulations with known latent types to validate the model's effectiveness.
Results
The PLVM demonstrated superior performance in predicting player behavior in the PowerWash Simulator compared to outcome-based models and single-task process models. It effectively distinguished between locally persistent and frequently switching behaviors, indicating that transferable behavioral strategies are not fully captured by aggregate outcomes or single-task traces.
Implications
The findings suggest that adaptive systems, such as educational tutors or game AI, can benefit from early predictions of user behavior based on process traces. This could lead to more personalized and effective interventions and support strategies tailored to individual behavioral tendencies.
Momentum Based Reward Design for Low Emission Traffic Signal Control
Reinforcement Learning
Optimization
- Introduction of a Momentum-Based Reward Function (MBRF) that promotes continuous vehicle movement.
- Evaluation of the MBRF in SUMO shows better performance than traditional delay and queue-based rewards.
- The proposed method leads to improved throughput-emission trade-offs and more stable learning behaviors.
- Demonstrates the effectiveness of DRL in adaptive traffic signal control without requiring major infrastructure changes.
Read more
Momentum Based Reward Design for Low Emission Traffic Signal Control
Summary
This paper addresses the challenge of urban traffic congestion, which significantly contributes to environmental pollution and long commute times. Traditional traffic signal control systems often struggle to adapt to dynamic traffic conditions, leading to inefficiencies. The authors propose a novel Momentum-Based Reward Function (MBRF) for Deep Reinforcement Learning (DRL) traffic signal control, which encourages continuous vehicle movement rather than merely penalizing congestion. The MBRF is designed to promote sustained vehicle flow by incentivizing phase persistence based on observed discharge efficiency. The methodology is evaluated using the SUMO (Simulation of Urban MObility) environment, employing standard traffic metrics such as waiting time, queue length, throughput, and CO2 emissions. The results demonstrate that the MBRF leads to improved throughput-emission trade-offs and more stable learning behaviors compared to traditional delay or queue-based rewards, as well as classical controllers like Max Pressure and LQF. This approach not only enhances traffic efficiency but also reduces emissions without explicitly optimizing for environmental metrics, showcasing the potential for DRL in adaptive traffic signal control.
Methodology
The authors formulated the traffic signal control problem as a Markov Decision Process (MDP) and implemented the MBRF within a DRL framework. The MBRF incentivizes sustained vehicle motion by rewarding phase persistence proportional to discharge efficiency, thereby aligning the learning objective with real traffic dynamics. The evaluation was conducted using the SUMO simulation environment, measuring various traffic performance metrics.
Results
The proposed MBRF outperformed traditional reward structures in terms of traffic throughput and emissions. The results indicated more stable learning behaviors and improved traffic efficiency, demonstrating that the MBRF effectively encourages smoother control policies without rigid constraints.
Implications
The findings suggest that the MBRF can be a valuable tool for urban traffic management, potentially leading to reduced congestion and emissions in real-world scenarios. This approach could be adapted for various traffic conditions and integrated into existing traffic management systems to enhance their responsiveness and efficiency.
CellBRIDGE: Learning Cellular Trajectories via Interaction-Aware Alignment
Graph Learning
Time Series
Interpretability
- CellBRIDGE augments Optimal Transport with interaction-aware costs derived from ligand-receptor signaling.
- The method improves trajectory inference by explicitly modeling cell-cell communication.
- CellBRIDGE enables interpretable in silico perturbations that align with expected biological outcomes.
- The approach shows broad applicability across various trajectory inference frameworks.
Read more
CellBRIDGE: Learning Cellular Trajectories via Interaction-Aware Alignment
Summary
The paper addresses the challenge of inferring cellular dynamics from population snapshots in single-cell RNA sequencing (scRNA-seq), where direct tracking of individual cells is not possible due to destructive measurements. Traditional methods rely on gene-expression distances for Optimal Transport (OT) but overlook the structured cell-cell interactions mediated by ligand-receptor signaling. The authors introduce CellBRIDGE, a novel approach that enhances OT by incorporating a directed, typed interaction cost based on ligand-receptor activity. This method improves cross-snapshot couplings and trajectory estimates in both synthetic and real scRNA-seq datasets. CellBRIDGE also allows for mechanistically interpretable in silico perturbations, demonstrating its effectiveness in modeling cellular dynamics and its potential for guiding experimental design in drug discovery.
Methodology
CellBRIDGE employs a Fused Gromov-Wasserstein (FGW) framework that minimizes both the cost of transport in gene expression space and the structural distortion of inferred communication networks. It constructs proxy communication networks based on directed ligand-receptor pairs within local expression neighborhoods, allowing for a biologically meaningful prior in trajectory inference.
Results
The experiments conducted on synthetic and real-world scRNA-seq datasets indicate that CellBRIDGE significantly enhances trajectory inference accuracy compared to feature-only baselines. The method also successfully captures the effects of silencing specific ligand-receptor pairs, leading to trajectory shifts that mirror expected biological responses.
Implications
CellBRIDGE has the potential to advance our understanding of cellular dynamics in development and disease, providing a valuable tool for drug discovery and experimental design. Its ability to incorporate biological interactions into trajectory inference could lead to more accurate models of cellular behavior.
AMNESIA: A Large Scale Medical Unlearning Benchmark Suite with Disease-Informed Analysis
NLP
Large Language Models
Multimodal
- AMNESIA is the first large-scale, clinically-grounded benchmark for medical unlearning.
- The dataset includes 70,560 question-answer pairs from real patient notes across 11 disease categories.
- The benchmark supports both factual and reasoning questions, enabling diverse evaluation scenarios.
- Four unlearning methods were evaluated, revealing significant challenges in maintaining knowledge integrity.
Read more
AMNESIA: A Large Scale Medical Unlearning Benchmark Suite with Disease-Informed Analysis
Summary
The paper introduces AMNESIA, the first large-scale benchmark for medical unlearning, addressing the need for models to selectively forget specific training data while maintaining overall utility. Existing benchmarks have largely focused on synthetic or general data, leaving a gap in clinical unlearning. AMNESIA comprises 70,560 question-answer pairs derived from 8,820 patient notes across 11 disease categories, allowing for both factual and reasoning questions. The benchmark facilitates evaluation of unlearning methods at random patient and disease levels, highlighting the challenges of erasing patient-specific knowledge without affecting shared clinical knowledge. The authors evaluate four widely used unlearning methods and introduce a novel metric for assessing the leakage of medical terminology. The findings reveal that unlearning individual patients can inadvertently erode knowledge of others with the same condition, underscoring the need for improved unlearning techniques that can better isolate patient data from shared clinical insights.
Methodology
The authors constructed the AMNESIA dataset from de-identified patient notes using the PMC-Patients-v2 dataset. They categorized patient notes into disease categories using a GPT model and created a comprehensive set of factual and reasoning questions. The evaluation involved applying four representative unlearning methods and analyzing their effectiveness in forgetting specific patient data while retaining overall model performance.
Results
The evaluation demonstrated that unlearning individual patients negatively impacted the model's knowledge of other patients with the same condition. This finding indicates that current unlearning methods may not effectively isolate patient-specific knowledge from shared clinical knowledge, highlighting the need for more sophisticated unlearning techniques.
Implications
AMNESIA provides a crucial resource for researchers in medical AI, particularly in developing methods that comply with privacy regulations like GDPR. The insights gained from this benchmark could lead to improved unlearning techniques that enhance patient privacy while preserving the utility of medical models.
Test Time Training for Supervised Causal Learning
Graph Learning
Theory
Efficient ML
- Identifies critical limitations in existing Supervised Causal Learning methods.
- Introduces TTT-SCL, a framework for dynamic training set generation at test time.
- Establishes a theoretical basis connecting TTT-SCL with score-based methods.
- Demonstrates significant performance improvements across various datasets.
Read more
Test Time Training for Supervised Causal Learning
Summary
This paper addresses the limitations of Supervised Causal Learning (SCL) in causal discovery, particularly its struggles with out-of-distribution generalization. The authors identify three main issues with existing SCL methods: a performance gap between synthetic benchmarks and real-world data, fragility to distribution shifts, and failure in compositional generalization. To overcome these challenges, they propose a novel framework called Test-Time Training for Supervised Causal Learning (TTT-SCL), which dynamically generates training sets tailored to specific test instances. This approach shifts the focus from static training sets to a more adaptive methodology that aligns closely with the characteristics of the test domain. The authors establish a theoretical connection between TTT-SCL and score-based methods, demonstrating that the latter can be viewed as a special case of TTT-SCL. Through extensive experiments on synthetic, pseudo-real, and real-world datasets, the authors show that TTT-SCL significantly outperforms existing SCL and traditional causal discovery methods, confirming its effectiveness and practical applicability in real-world scenarios.
Methodology
The authors propose the TTT-SCL framework, which involves dynamically generating training sets that are explicitly aligned with the distribution of the test instance. This is achieved through a specialized training process that adapts to the characteristics of the test data, allowing for improved generalization and robustness against distribution shifts.
Results
Experiments conducted on synthetic benchmarks, pseudo-real, and real-world datasets reveal that TTT-SCL significantly outperforms both existing SCL methods and traditional causal discovery techniques, indicating its superior performance and adaptability.
Implications
The findings suggest that TTT-SCL can enhance the applicability of causal learning in real-world scenarios, where data distributions may vary significantly from training data. This could lead to more reliable causal inference in various fields, including economics, healthcare, and social sciences.
Fixed Universal Transformers
Theory
- Introduces the notion of universal transformers that can simulate any transformer via input embeddings.
- Provides explicit constructions for universal transformers and shows that randomly initialized transformers are universal.
- Establishes lower bounds on the embedding dimensions required for universality in transformers with multiple heads.
- Empirical evaluations demonstrate high accuracy in specific algorithmic tasks, supporting the theoretical findings.
Read more
Fixed Universal Transformers
Summary
This paper introduces the concept of universal transformers, which are fixed transformers capable of simulating any transformer within a specified class through appropriate input embeddings. This concept parallels the idea of universal Turing machines, where the input embedding serves as a program encoding the target model's parameters while keeping all internal parameters of the universal transformer constant. The authors provide explicit sparse constructions that achieve universality when the embedding dimension is sufficiently large and demonstrate that randomly initialized transformers are almost surely universal. The paper empirically validates its theoretical claims through algorithmic tasks such as parenthesis balancing and multi-hop reasoning, suggesting that a significant portion of a transformer's expressive power may stem from its input representation rather than its learned weights.
Methodology
The authors formalize the concept of universal transformers and provide both explicit constructions and theoretical proofs regarding their universality. They analyze the embedding dimensions necessary for different configurations of transformers and conduct empirical evaluations on specific tasks to validate their theoretical claims.
Results
The paper shows that a fixed universal transformer can simulate any target transformer with a suitable input embedding, achieving universality under specific conditions. The empirical results indicate that both sparse and randomly initialized universal transformers perform well on tasks like parenthesis balancing and k-hop induction, achieving perfect or near-perfect accuracy.
Implications
The findings suggest that the architecture of transformers can be simplified by focusing on input representations, which could lead to more efficient model designs. Additionally, the concept of universal transformers may enhance transfer learning approaches by allowing pre-trained models to adapt to new tasks with minimal adjustments.
UniRTL: Unifying Code and Graph for Robust RTL Representation Learning
Multimodal
Graph Learning
- UniRTL integrates RTL code and CDFG for enhanced representation learning.
- The framework employs mutual masked modeling for fine-grained cross-modal alignment.
- A hierarchical training strategy is utilized to maximize data utilization.
- UniRTL outperforms existing methods in performance prediction and code retrieval tasks.
Read more
UniRTL: Unifying Code and Graph for Robust RTL Representation Learning
Summary
The paper presents UniRTL, a multimodal pretraining framework designed to enhance register transfer level (RTL) representation learning by integrating both RTL code and control data flow graph (CDFG) modalities. Existing methods typically rely on a single modality, which limits the expressiveness and generalization of learned representations. UniRTL addresses this by achieving fine-grained alignment between the code and graph through mutual masked modeling and employs a hierarchical training strategy that includes a pretrained graph-aware tokenizer. This approach maximizes data utilization by aligning text and code before integrating the graph. The authors evaluate UniRTL on performance prediction and code retrieval tasks, demonstrating that it consistently outperforms prior methods, thus establishing a more robust foundation for hardware design automation.
Methodology
UniRTL uses a unified Transformer architecture to integrate code and graph modalities, employing mutual masked modeling for alignment. It incorporates a graph-aware tokenizer and follows a hierarchical training strategy, aligning text and code before integrating the graph to leverage the richer information from both modalities.
Results
Experimental evaluations show that UniRTL consistently outperforms previous methods in both performance prediction and code retrieval tasks, demonstrating its effectiveness in providing robust RTL representations.
Implications
The development of UniRTL has significant implications for hardware design automation, potentially accelerating the design workflow and improving the efficiency of tasks like performance prediction and code retrieval. Its multimodal approach could also inspire future research in integrating diverse data modalities in other domains.
Chem-PerturBridge: a harmonized compendium of small molecule perturbation transcriptomic effects
Theory
- Chem-PerturBridge harmonizes diverse transcriptomic datasets, providing a unified resource for small-molecule perturbation modeling.
- The resource includes over 37,000 compounds and 1.25 million transcriptomic samples, standardized for better usability.
- Fine-grained logFC agreement across datasets is weak, while logFC direction agreement is more consistent.
- Pretraining on Chem-PerturBridge significantly improves compound representation learning outcomes.
Read more
Chem-PerturBridge: a harmonized compendium of small molecule perturbation transcriptomic effects
Summary
The paper introduces Chem-PerturBridge, a comprehensive and harmonized resource designed to facilitate the training and evaluation of small-molecule transcriptomic perturbation models. This resource integrates over 37,000 compounds, 136 cellular contexts, and 1.25 million transcriptomic samples across various assay types, including bulk RNA-seq and single-cell data. The authors address the fragmentation of existing transcriptomic resources by standardizing metadata and creating condition-level perturbation effects. They evaluate the agreement of matched conditions across datasets and find that while fine-grained logFC rankings show weak agreement, the direction of logFC is more stable. Additionally, the paper demonstrates that embeddings pretrained on Chem-PerturBridge improve performance in compound representation learning compared to existing methods. The resource not only serves as a benchmark for evaluating cross-dataset signature agreement but also enhances the predictive capabilities of models trained on heterogeneous perturbation data.
Methodology
The authors developed Chem-PerturBridge by integrating multiple transcriptomic datasets, standardizing metadata across compounds, cellular contexts, doses, and assays. They constructed condition-level perturbation effects using a shared replicate-aware layer and performed differential gene expression analysis. The resource was then evaluated for matched-condition agreement and used for pretraining compound representation models.
Results
The evaluation revealed that matched same-compound conditions exhibited weak agreement in logFC rankings across datasets, while direction agreement was more stable. Models pretrained on Chem-PerturBridge outperformed those trained solely on L1000 embeddings and other baseline methods in downstream prediction tasks.
Implications
Chem-PerturBridge provides a valuable resource for researchers in pharmacogenomics and systems biology, enabling better integration of diverse transcriptomic data for improved modeling of small-molecule effects. It facilitates the development of more accurate predictive models and enhances the understanding of chemical perturbation effects across different cellular contexts.
Revisiting Zeroth-Order Hessian Approximation: A Single-Step Policy Optimization Lens
Optimization
Theory
Efficient ML
- Introduces a unified framework for ZO Hessian approximation using single-step Policy Optimization.
- Presents ZoVH, a comprehensive suite of variance-reduced Hessian estimators.
- Establishes theoretical guarantees for the unbiasedness and variance optimality of the proposed methods.
- Demonstrates significant improvements in estimation accuracy and convergence performance in empirical evaluations.
Read more
Revisiting Zeroth-Order Hessian Approximation: A Single-Step Policy Optimization Lens
Summary
This paper addresses the challenge of accurate Zeroth-Order (ZO) Hessian estimation, which is crucial for derivative-free optimization tasks such as bilevel optimization and Bayesian inference. The authors propose a unified framework that reinterprets ZO Hessian approximation through the lens of single-step Policy Optimization (PO), establishing a theoretical equivalence between ZO Hessian estimators and the Hessian of a smoothed PO objective. This leads to the introduction of ZoVH, a suite of variance-reduced estimators for the full Hessian matrix and its inverse. The methodology leverages an optimal baseline to minimize variance and a query reuse strategy to enhance sample efficiency. Theoretical analysis confirms the unbiasedness and variance optimality of the proposed estimators, while empirical results demonstrate superior estimation accuracy and convergence performance compared to classical methods. The paper also develops a curvature-aware Zeroth-Order Optimization (ZOO) algorithm, which incorporates ridge regularization and bias correction, proving its effectiveness through extensive experiments.
Methodology
The authors reinterpret ZO Hessian approximation as the Hessian of a smoothed objective under a parameterized sampling policy. They introduce ZoVH, which utilizes an optimal baseline to minimize variance and a query reuse strategy to leverage historical function queries, enhancing sample efficiency. Theoretical analysis is provided to confirm the properties of the estimators, and empirical evaluations are conducted to validate the findings.
Results
The theoretical analysis confirms that the ZoVH estimators are unbiased and optimal in terms of variance. Empirical results show that ZoVH achieves lower estimation errors compared to classical Hessian estimators across various synthetic functions and neural networks, and the curvature-aware ZOO algorithm demonstrates substantial improvements over existing ZO methods in practical applications.
Implications
The proposed methods can significantly enhance the performance of derivative-free optimization tasks, particularly in high-dimensional settings where traditional methods struggle. The findings have potential applications in areas requiring efficient optimization techniques, such as machine learning model training and uncertainty quantification.
CSULoRA: Closest Safe Update Low-Rank Adaptation
NLP
Large Language Models
Efficient ML
- CSULoRA is a post-hoc method that enhances safety in low-rank adaptation without retraining the model.
- It decomposes LoRA updates into components based on their alignment with a safety-aligned subspace.
- The method preserves task-relevant information while mitigating unsafe updates.
- Experimental results show a significant reduction in attack success rates while retaining utility improvements.
Read more
CSULoRA: Closest Safe Update Low-Rank Adaptation
Summary
The paper introduces CSULoRA, a novel post-hoc method designed to enhance the safety of low-rank adaptation (LoRA) for fine-tuning large language models. While LoRA is effective for parameter-efficient adaptation, it poses risks to safety when exposed to adversarial or unsafe fine-tuning data. Existing methods to preserve safety often involve hard interventions that can compromise task-relevant information or require additional tuning. CSULoRA addresses these limitations by estimating a safety-aligned subspace based on the weight displacement between a safety-aligned model and its base checkpoint. It decomposes LoRA updates into fully aligned, partially aligned, and off-subspace components. Instead of discarding the off-subspace components, CSULoRA employs a closed-form penalized minimum-change optimization to preserve the fully aligned components while attenuating potentially unsafe directions. Experimental results demonstrate that CSULoRA significantly reduces attack success rates in adversarial fine-tuning scenarios while maintaining the utility gains achieved through standard LoRA fine-tuning.
Methodology
CSULoRA estimates a safety-aligned subspace by analyzing the weight displacement between a safety-aligned model and its base model. It decomposes each LoRA update into three components: fully aligned, partially aligned, and off-subspace. The method then solves a closed-form optimization problem that preserves the fully aligned component and softly penalizes the energy in the remaining components, ensuring a smooth adjustment without the need for retraining.
Results
In adversarial fine-tuning experiments, CSULoRA significantly lowers the attack success rate compared to standard LoRA and other safety-preserving methods, while also preserving most of the utility improvements gained from LoRA fine-tuning.
Implications
CSULoRA has the potential to improve the safety of large language models in real-world applications, particularly in scenarios where models are fine-tuned on user-generated data that may contain unsafe examples. This method can be applied in various domains where safety and reliability are critical, such as healthcare, finance, and customer service.
Causal Intelligence for Constraint-Aware Intervention Design to Induce State Transitions
Optimization
Graph Learning
Interpretability
- COAST provides a principled framework for designing interventions that induce state transitions in complex systems.
- The framework integrates causal discovery and multi-objective optimization to balance efficacy, complexity, and stability of interventions.
- COAST is modular and domain-agnostic, making it applicable across various fields, particularly in biomedicine.
- The approach successfully identifies causal drivers and robust intervention strategies from both synthetic and real datasets.
Read more
Causal Intelligence for Constraint-Aware Intervention Design to Induce State Transitions
Summary
The paper introduces COAST (Causally Optimal Actions for State Transitions), a novel framework designed to facilitate the in-silico design of interventions that can induce specific state transitions in complex systems, particularly in biomedical contexts. Traditional predictive models often lack mechanistic insight and do not provide a structured approach for decision-making regarding interventions. COAST addresses this gap by employing causal intelligence to learn context-specific causal graphs and structural causal models from data that characterize source and target states. The framework incorporates a multi-objective optimization approach that balances the efficacy of transitions, the complexity of interventions, and the stability of target states. COAST is modular and domain-agnostic, allowing for the integration of various components such as feature selection, causal discovery, and intervention evaluation. The authors demonstrate the effectiveness of COAST through synthetic benchmarks and real biological datasets, successfully identifying key causal drivers and robust intervention strategies that achieve desired state transitions while providing transparent mechanistic rationales for experimental validation.
Methodology
COAST employs a modular approach that includes context-specific feature selection, learning of causal graphs and structural causal models, and a multi-objective optimization formulation. This allows it to identify interventions that can induce desired state transitions while adhering to biological and practical constraints.
Results
The application of COAST on synthetic benchmarks and real biological datasets demonstrated its capability to recover key causal drivers and identify effective single- and multi-target intervention strategies. The framework achieved desired state transitions while providing clear mechanistic rationales for the interventions.
Implications
The COAST framework has significant implications for fields such as biomedicine and drug discovery, where understanding causal relationships and designing effective interventions are crucial. It can accelerate the discovery process by reducing experimental burdens and enhancing the prioritization of interventions based on mechanistic insights.
Active Timepoint Selection for Learning Measure-Valued Trajectories
Time Series
Optimization
Theory
- Introduces a framework for active timepoint selection in measure-valued trajectories.
- Utilizes Linearized Optimal Transport to facilitate Gaussian Process modeling in non-Euclidean spaces.
- Addresses the challenge of epistemic uncertainty quantification in distributional interpolation.
- Empirical results show improved performance over uniform and random baselines.
Read more
Active Timepoint Selection for Learning Measure-Valued Trajectories
Summary
This paper addresses the challenge of inferring continuous probability paths from sparse snapshots, particularly in fields like single-cell biology where data acquisition is costly and destructive. The authors propose a novel active learning framework that strategically selects optimal measurement times to minimize uncertainty in estimating underlying probability paths. The framework leverages Linearized Optimal Transport (LOT) to map distributional snapshots into a tangent space suitable for Gaussian Process (GP) modeling. This approach allows for the construction of a tractable probabilistic surrogate for the probability path, enabling the iterative selection of measurement times. The proposed method overcomes challenges associated with the non-Euclidean geometry of the output space and the difficulty of quantifying epistemic uncertainty in measure-valued dynamics. Empirical evaluations demonstrate that the active timepoint selection strategy significantly outperforms traditional uncertainty-agnostic baselines on both synthetic and real-world datasets.
Methodology
The authors propose an active timepoint selection strategy that involves linearizing the Wasserstein space using Linearized Optimal Transport. They map observed snapshots to a tangent space, compress these tangent vectors, and apply a warped Gaussian Process prior to model the temporal coefficients. This allows for the quantification of epistemic uncertainty and the selection of optimal measurement times.
Results
The proposed acquisition strategy outperformed uniform and random baselines in empirical tests conducted on both synthetic and real-world datasets, demonstrating its effectiveness in minimizing uncertainty in estimating probability paths.
Implications
This work has significant implications for fields requiring high-fidelity data acquisition under budget constraints, such as single-cell biology, where strategic measurement timing can enhance the understanding of dynamic processes without incurring excessive costs.
IRIS: time-structured manifold projections
Time Series
Optimization
Theory
- IRIS integrates time-structured layouts with manifold learning, enhancing the visualization of dynamic biomedical data.
- The algorithm operates in two phases: optimizing radial distances for timestamps and adjusting angular positions for high-dimensional structure.
- Evaluation across multiple datasets shows IRIS outperforms UMAP in representing temporal relationships while retaining class structure.
- The method is open-source, promoting accessibility and further research in the field.
Read more
IRIS: time-structured manifold projections
Summary
The paper introduces IRIS, a novel Manifold Learning algorithm designed to visualize high-dimensional biomedical data while incorporating temporal information. Traditional algorithms like t-SNE and UMAP struggle to represent time-ordered data effectively, often leading to a loss of critical temporal dynamics in visualizations. IRIS addresses this limitation by structuring layouts both chronologically and according to manifold topology, enabling clearer insights into the evolution of cell types and other classes over time. The algorithm operates in two phases: first, it optimizes the mapping of timestamps to radial distances, ensuring a uniform density of points, and second, it adjusts the angular positions of points to minimize divergence between high-dimensional and low-dimensional representations. The authors evaluate IRIS on diverse datasets, including single-cell RNA sequencing (scRNA-seq) and comparative metagenomics, demonstrating its ability to reveal dynamic phenomena that are obscured in traditional layouts. The results indicate that IRIS effectively structures data by time while maintaining class distinctions, outperforming UMAP in temporal representation. The algorithm is open-source and implemented in Python and C++, with potential future enhancements aimed at improving computational efficiency and developing interactive visualization tools.
Methodology
IRIS employs a two-phase optimization process: first, it maps timestamps to radial distances to ensure uniform density, and second, it optimizes the angular positions of points to minimize divergence between high-dimensional and low-dimensional spaces. This is achieved through a polar reparameterization of the Euclidean cost function.
Results
IRIS effectively structures layouts by time, demonstrating superior performance in temporal representation compared to UMAP. Quantitative metrics confirm that IRIS retains class structuring similar to UMAP while elucidating dynamic phenomena that are not apparent in traditional visualizations.
Implications
The ability to visualize high-dimensional biomedical data with integrated temporal information has significant implications for understanding biological processes, such as developmental stages in scRNA-seq and temporal trends in metagenomics. The open-source nature of IRIS encourages further exploration and application in various biomedical research contexts.
Graph-Conditioned Mixture of Graph Neural Network Experts for Traffic Forecasting
Graph Learning
Time Series
- GC-MoE utilizes a dual-pathway routing mechanism that combines static topology features with dynamic traffic input signals.
- The framework allows for expert specialization by assigning different experts to different nodes based on their unique traffic patterns.
- GC-MoE achieves significant improvements in forecasting accuracy while maintaining a low parameter count.
- The optional output refinement layer can enhance performance further without substantial additional costs.
Read more
Graph-Conditioned Mixture of Graph Neural Network Experts for Traffic Forecasting
Summary
The paper introduces GC-MoE, a novel framework for spatio-temporal forecasting on sensor graphs, specifically targeting traffic prediction. Traditional approaches often apply a single model across all nodes, which can be suboptimal due to the varying dynamics of different road segments. GC-MoE addresses this by employing a mixture of experts approach, where each node is assigned a personalized combination of frozen forecasting experts based on the graph's topology and recent traffic data. The framework integrates multiple pretrained spatio-temporal graph neural network (GNN) experts and utilizes a lightweight, input-aware routing mechanism that adapts to current traffic conditions. Additionally, the authors explore an optional graph-conditioned output refinement layer and conduct ablation studies using node-adaptive ST-LoRA adapters. The results demonstrate that GC-MoE outperforms a zero-parameter ensemble baseline in terms of Mean Absolute Error (MAE) while maintaining competitive performance in Root Mean Square Error (RMSE) and Mean Absolute Percentage Error (MAPE), all while training only approximately 17K parameters on top of 1.5M frozen expert weights.
Methodology
The GC-MoE framework involves pretraining multiple diverse spatio-temporal GNN experts, freezing them, and then training a lightweight routing module that assigns expert weights based on both static and dynamic inputs. The framework also includes an optional output refinement layer and evaluates the use of node-adaptive ST-LoRA adapters for further performance insights.
Results
GC-MoE demonstrated improved MAE over a zero-parameter ensemble baseline across four standard benchmarks (PEMS04, PEMS07, METR-LA, and PEMS-BAY). It achieved competitive RMSE and MAPE metrics while only requiring about 17K trainable parameters, leveraging the representational capacity of 1.5M frozen expert weights.
Implications
The proposed framework has significant implications for urban traffic forecasting, enabling more accurate predictions tailored to specific road segments. This could enhance traffic management systems, inform urban planning, and improve overall transportation efficiency.
When are LLMs Sufficient Policy Optimizers for Sequential RL Tasks?
Reinforcement Learning
Large Language Models
Robotics
- Prompted Policy Optimization (PromptPO) leverages LLMs to optimize policies for RL tasks.
- PromptPO often outperforms standard RL algorithms in terms of performance and sample efficiency.
- The method generates a diverse range of policies based on the provided context.
- LLMs may struggle with tasks requiring fine-grained control, indicating limitations in certain environments.
Read more
When are LLMs Sufficient Policy Optimizers for Sequential RL Tasks?
Summary
This paper investigates the effectiveness of large language models (LLMs) as black-box policy optimizers for reinforcement learning (RL) tasks, specifically through a method called Prompted Policy Optimization (PromptPO). PromptPO prompts an LLM with Python descriptions of the state space, action space, and reward function, allowing it to generate and refine executable policies based on feedback from environment rollouts. The authors demonstrate that PromptPO often matches or exceeds the performance of classical RL algorithms while requiring significantly fewer interactions with the environment. The study reveals that LLMs can produce a variety of policies, from simple proportional controllers to more complex planning algorithms, depending on the context provided. However, the method shows limitations in environments requiring fine-grained continuous control, such as MuJoCo domains. Overall, the findings suggest that LLM-based optimization can serve as a competitive alternative to traditional RL methods in many sequential decision-making tasks.
Methodology
The authors introduce Prompted Policy Optimization (PromptPO), which involves prompting an LLM with Python-formatted descriptions of the state space, action space, and reward function. The LLM generates policies that are evaluated through rollouts in the environment, with feedback used to refine the policies iteratively.
Results
PromptPO matches or exceeds the performance of standard RL algorithms like SAC, PPO, and DQN in various environments, including Meta-World robotics tasks and real-world control problems. However, it underperforms in MuJoCo continuous control tasks, highlighting its limitations in environments requiring precise control.
Implications
The findings suggest that LLMs can be effectively utilized for policy optimization in RL, potentially simplifying the design process by reducing the need for extensive hyperparameter tuning and algorithm selection. This could lead to broader adoption of RL techniques in practical applications.
Generalized Intention Modeling in Multi-Agent Reinforcement Learning
Reinforcement Learning
- Introduction of a task-adaptive opponent modeling framework that combines multiple intent representations.
- Development of reward-predictive intention embeddings that enhance the ego-agent's understanding of opponent impact on returns.
- Demonstration of improved performance stability and robustness compared to traditional single-component modeling methods.
- Insights into the varying effectiveness of opponent modeling strategies across different environments.
Read more
Generalized Intention Modeling in Multi-Agent Reinforcement Learning
Summary
This paper addresses the challenge of modeling opponents' intentions in multi-agent reinforcement learning (MARL), which is crucial for effective decision-making in competitive environments. Traditional methods rely on fixed episode components, such as predicting the opponent's next action or future states, which may not universally represent intent across different tasks and environments. The authors propose a task-adaptive opponent modeling framework that learns a mixture of multiple intent representations, optimizing for performance-driven outcomes. A novel intention representation is introduced that maximizes mutual information with the ego-agent's future returns, allowing for a more relevant capture of opponent information. The proposed architecture combines various episode components dynamically, enabling the ego-agent to adapt its understanding of intent based on the specific environment. Empirical results demonstrate that this adaptive approach outperforms state-of-the-art baselines across several multi-agent benchmarks, providing insights into the effectiveness of different opponent modeling strategies and their dependence on the environment.
Methodology
The authors propose an adaptive architecture that employs a mixture-of-experts approach to model opponent intentions. This architecture includes separate modules for different episode components (actions, observations, future states, and rewards) that are combined using a learned mixing mechanism. Additionally, they introduce a contrastive InfoNCE objective to model intention embeddings predictive of future rewards, maximizing mutual information with the ego-agent's returns.
Results
The adaptive opponent modeling framework consistently matches or exceeds the performance of state-of-the-art baselines across diverse tasks, including Level-Based Foraging, Predator-Prey, Kuhn Poker, and a customized Google Research Football scenario. The results indicate that modeling intentions based on future ego-agent rewards can yield more informative representations than traditional methods focusing on future states or actions.
Implications
This work has significant implications for improving decision-making in competitive multi-agent environments, potentially enhancing applications in areas such as robotics, game AI, and strategic planning. The insights gained from the adaptive modeling approach could inform the design of more effective agents that better understand and predict the behavior of opponents.
SigmaMedStat: Temporal Signal Modeling for ICU False Alarm Reduction
Time Series
- Introduces a novel temporal CWT-LSTM architecture for ICU alarm classification.
- Achieves a mean AUC of 0.822, significantly outperforming static classification methods.
- Demonstrates the importance of temporal chunking and multi-channel signal fusion.
- Identifies specific alarm types that are easier or harder to classify.
Read more
SigmaMedStat: Temporal Signal Modeling for ICU False Alarm Reduction
Summary
The paper addresses the critical issue of alarm fatigue in intensive care units (ICUs), where clinical monitors generate an overwhelming number of alarms, most of which are clinically irrelevant. This desensitization can lead to missed true emergencies, posing risks to patient safety. The author introduces SigmaMedStat, a machine learning system designed to evaluate the trustworthiness of physiological alarm signals before clinical action is taken. The proposed methodology involves a temporal modeling framework that segments each 60-second alarm recording into six consecutive 10-second chunks. Each chunk is processed using Continuous Wavelet Transform (CWT) to generate scalograms, which are then encoded with an EfficientNet-B0 encoder and analyzed by a two-layer Long Short-Term Memory (LSTM) network. The system was evaluated on the PhysioNet/Computing in Cardiology Challenge 2015 dataset, achieving a mean AUC of 0.822 ± 0.016 through five-fold stratified cross-validation. The results indicate that temporal modeling significantly outperforms static classification methods, with ablation studies confirming the independent contributions of temporal chunking and multi-channel signal fusion to performance. The analysis also highlights the varying classification accuracy across different alarm types, with Ventricular Flutter being the most accurately classified and Asystole the most challenging. The findings suggest that temporal structure in physiological signals is a valuable feature for improving alarm classification in clinical settings.
Methodology
The methodology involves splitting 60-second ICU alarm recordings into six 10-second chunks, applying Continuous Wavelet Transform (CWT) to generate scalograms for each chunk, encoding them with an EfficientNet-B0 encoder, and analyzing the resulting feature sequences using a two-layer Long Short-Term Memory (LSTM) network. The system was validated using five-fold stratified cross-validation on the PhysioNet dataset.
Results
The proposed system achieved a mean AUC of 0.822 ± 0.016, outperforming a static EfficientNet baseline by 18.1 AUC points. Ablation studies confirmed that both temporal chunking and multi-channel signal fusion independently enhance classification performance. The analysis revealed that Ventricular Flutter was classified with an AUC of 0.820, while Asystole had the lowest classification accuracy (AUC 0.722).
Implications
The findings suggest that incorporating temporal modeling into alarm classification systems can significantly reduce false alarms in ICUs, potentially improving patient safety and clinical response times. The approach may be applicable to other areas of healthcare monitoring where alarm fatigue is a concern.
Bridging Chemists and AI: An Expert-Augmented Framework for Interpretable Route Evaluation
Interpretability
- Introduces an expert-augmented framework combining machine learning with chemists' expertise for route evaluation.
- Utilizes a DeepSets-based model trained on tree edit distances and fine-tuned with expert evaluations.
- Achieves significant improvements in scoring accuracy and interpretability compared to existing methods.
- Provides a dual-output evaluation system that aligns with real-world decision-making in synthesis planning.
Read more
Bridging Chemists and AI: An Expert-Augmented Framework for Interpretable Route Evaluation
Summary
This paper addresses the challenge of selecting efficient multi-step synthetic routes in organic synthesis, particularly in medicinal and process chemistry. The authors propose an expert-augmented, data-driven scoring framework that integrates machine learning with chemists' domain knowledge to provide both numerical scores and interpretable qualitative assessments of synthetic routes. The framework employs a DeepSets-based model trained on tree edit distances between reference and machine-generated routes, which is then fine-tuned using expert evaluations. This approach allows for a dual-output evaluation, providing a multi-objective quality score alongside a feasibility rating. The model demonstrates significant improvements over previous baselines in both quantitative scoring and qualitative assessments, achieving a Spearman correlation coefficient of 0.78 and a Pearson correlation of 0.77 for category assessments, as well as a top-1 ranking accuracy of 60.2% for score predictions. The results indicate that the framework effectively captures the nuances of expert chemical judgment, making it a valuable tool for retrosynthetic planning.
Methodology
The authors developed a DeepSets-based scoring model that processes synthetic routes by comparing tree edit distances. The model is trained on a large dataset of patent routes and fine-tuned using expert evaluations to enhance its predictive capabilities. The framework translates complex chemical reasoning into a learnable format, allowing for both numerical scoring and qualitative assessments of route feasibility.
Results
The proposed model achieved a Spearman correlation of 0.78 ± 0.05 and a Pearson correlation of 0.77 ± 0.06 in predicting category assessments. It also reached a top-1 ranking accuracy of 60.2% for score predictions, significantly outperforming the previous baseline of 17.5%. The model effectively captures expert judgment nuances, achieving a classification accuracy of 67 ± 6.4% in a three-tier rating system.
Implications
This framework has the potential to streamline the process of synthetic route evaluation in organic chemistry, making it more efficient and scalable. By integrating expert knowledge with machine learning, it can enhance decision-making in drug discovery and development, ultimately leading to more effective and cost-efficient synthesis strategies.
Self-Certifying Transport MCMC via Dual Spectral-Gap Certificates
Theory
Generative Models
Efficient ML
- Introduction of CerT-MCMC framework for learned-transport MCMC with convergence certificates.
- Development of two complementary certificates: covering certificate and quantile-core certificate.
- Quantile-core certificate provides non-vacuous spectral-gap bounds in high dimensions.
- Demonstrated effectiveness on various datasets, including synthetic and real-world applications.
Read more
Self-Certifying Transport MCMC via Dual Spectral-Gap Certificates
Summary
This paper introduces CerT-MCMC, a novel framework that enhances learned-transport Markov chain Monte Carlo (MCMC) methods with automatic and rigorous convergence certificates. The framework utilizes a normalizing flow to map a Gaussian reference distribution to an approximation of the target posterior, which serves as both the proposal in the independence Metropolis–Hastings (IMH) algorithm and the basis for computable spectral-gap bounds. Two complementary certificates are developed: the covering certificate, which bounds weight-ratio oscillation across the full proposal support, and the quantile-core certificate, which focuses on a high-probability core where oscillation is controlled by empirical quantiles. The paper demonstrates that the quantile-core certificate provides non-vacuous spectral-gap bounds even in high dimensions, outperforming the covering certificate in scenarios where the latter becomes ineffective. The framework is validated through experiments on synthetic targets, structural-engineering posteriors, and real-data logistic regression, showing that the quantile-core certificate can effectively track empirical effective sample sizes. This dual-certificate approach is the first of its kind to provide automatic, dimension-aware convergence certificates for learned-transport MCMC, distinguishing between genuine transport failures and limitations of proof techniques.
Methodology
The methodology involves using normalizing flows to create proposals for the IMH algorithm, and deriving two types of spectral-gap certificates. The covering certificate uses finite-sample covering arguments to bound oscillation over the full proposal support, while the quantile-core certificate restricts to a high-probability core, applying one-dimensional empirical quantiles for oscillation control.
Results
The quantile-core certificate consistently delivered non-vacuous spectral-gap bounds across various dimensions (up to D=20) and datasets, while the covering certificate became ineffective in higher dimensions. The quantile-core certificate tracked empirical effective sample sizes within 7%, demonstrating its practical utility.
Implications
The findings suggest that the CerT-MCMC framework can significantly improve the reliability of learned-transport MCMC methods in Bayesian computation, providing practitioners with robust tools for convergence assessment in high-dimensional settings.
Unlearning in Diffusion Models: A Unified Framework with KL Divergence and Likelihood Constraints
Generative Models
Optimization
Theory
- Introduces a principled constrained optimization framework for unlearning in diffusion models.
- Formulates three optimization problems based on KL divergences and likelihood constraints.
- Establishes strong duality for the proposed problems, enabling effective solution characterization.
- Demonstrates superior performance of KL-constrained methods over traditional weight-based approaches.
Read more
Unlearning in Diffusion Models: A Unified Framework with KL Divergence and Likelihood Constraints
Summary
This paper addresses the challenge of unlearning in diffusion models, which involves removing undesirable data or concepts while maintaining the utility of pretrained models. The authors propose a constrained optimization framework that formulates unlearning as minimizing the deviation from a pretrained model while enforcing constraints to separate the model from unlearning distributions. They introduce three optimization problems based on reverse and forward KL divergences and likelihood constraints, generalizing existing approaches for concept and data unlearning. The paper establishes strong duality for these nonconvex problems, allowing for the characterization of optimal solutions and the development of primal-dual algorithms. Experimental results show that the KL-constrained approach outperforms weight-based baselines in achieving a better retention-unlearning tradeoff, while the likelihood-based method effectively balances unlearning and concept preservation.
Methodology
The authors employ a constrained optimization framework to minimize the distance between a pretrained model and the unlearning targets, using reverse and forward KL divergences to formulate the problems. They develop primal-dual algorithms based on the strong duality of the formulated problems, which allows for efficient computation of optimal solutions.
Results
The experimental results indicate that the KL-constrained approach achieves better retention-unlearning tradeoffs compared to weight-based methods. The likelihood-based unlearning method matches the effectiveness of unlearning while better preserving the retained concepts, demonstrating the advantages of the proposed framework.
Implications
This work has significant implications for the ethical deployment of generative models, allowing for the removal of harmful content while maintaining model utility. It provides a systematic approach to machine unlearning that can be applied in various domains where data privacy and ethical considerations are paramount.
Automating Formal Verification with Reinforcement Learning and Recursive Inference
Reinforcement Learning
Large Language Models
Theory
- Introduces RLVR to improve formal verification in LLMs.
- Achieves a verified reward increase from 2.2% to 58.1% using RLVR.
- Identifies and addresses specification hacking in model outputs.
- Develops a verifier-guided inference scaffold that improves proof generation success rates.
Read more
Automating Formal Verification with Reinforcement Learning and Recursive Inference
Summary
This thesis addresses the challenges of automating formal verification for large language models (LLMs), particularly due to the scarcity of data for proof assistants and the need for precise machine-checkable specifications. The author proposes a novel approach that combines reinforcement learning from verifiable rewards (RLVR) and verifier-guided inference-time search to enhance the generation of verified programs and proofs. The study begins with training open-source models in Dafny using RLVR, achieving a significant increase in verified rewards from 2.2% to 58.1%. However, issues such as specification hacking were identified, where models exploited weak formal specifications. By filtering out underspecified tasks and employing multi-turn RLVR, the verified pass rate improved from 9.7% to 31.1%. Additionally, a verifier-guided inference scaffold in Lean was developed, treating proof generation as a structured search over subgoals, which led to an increase in the pass rate from 46.2% to 69.2% on a pilot set. The thesis also introduces Dalek-Bench, a benchmark derived from the Rust curve25519-dalek verification project, although initial results indicate a need for stronger evaluation methods. Overall, the findings suggest that formal verifiers can significantly enhance LLM capabilities when they are utilized as sources of reward and feedback, provided that the environment offers clean data and robust specifications.
Methodology
The methodology involves training models using reinforcement learning techniques, specifically Group Relative Policy Optimization (GRPO), to optimize the generation of verified programs. The approach includes filtering tasks to eliminate specification vulnerabilities and employing a structured search framework for proof generation in Lean, which incorporates verifier feedback and diagnostics.
Results
The initial experiments showed a verified reward increase from 2.2% to 58.1%, with further refinement leading to a pass rate improvement from 9.7% to 31.1%. The verifier-guided inference scaffold improved the pass rate from 46.2% to 69.2% on a pilot set, and the new benchmark Dalek-Bench revealed areas needing stronger evaluation methods.
Implications
The findings suggest that integrating formal verification processes into LLM training can enhance the reliability and correctness of generated code, which is crucial for applications in cybersecurity and software development. This work lays the groundwork for future research in automated verification methods and their application in real-world scenarios.
De-attribute to Forget for LLM Unlearning
NLP
Large Language Models
Reinforcement Learning
- Introduces a novel data de-attribution objective for LLM unlearning.
- Presents DareU, the first LLM unlearning framework utilizing reinforcement learning.
- Demonstrates effective unlearning while preserving model utility.
- Outperforms existing LLM unlearning methods in empirical evaluations.
Read more
De-attribute to Forget for LLM Unlearning
Summary
This paper addresses the challenges of unlearning in large language models (LLMs), particularly in the context of inappropriate training data. Existing methods often focus on optimizing prediction losses, which can lead to issues like over-forgetting and degraded model performance. The authors propose a novel framework called DareU, which reframes the unlearning objective as reducing data attribution scores for the forget set. By employing reinforcement learning, specifically Proximal Policy Optimization (PPO), DareU aims to minimize the attribution of LLM-generated responses to the data owners from whom data is to be forgotten. The empirical results demonstrate that DareU effectively balances the quality of forgetting with the utility of the model, outperforming existing unlearning baselines. This approach not only provides a more precise optimization target but also ensures that the model does not generate incoherent outputs post-unlearning.
Methodology
The authors propose DareU, which uses reinforcement learning to optimize LLM responses by minimizing the attribution scores associated with the forget set. The framework employs Proximal Policy Optimization (PPO) to align the model's outputs with the goal of reducing the influence of specific data owners.
Results
DareU was empirically validated and shown to achieve a better balance between forget quality and model utility compared to existing unlearning baselines. The results indicate that DareU effectively reduces the attribution scores of LLM-generated responses to the forget set while maintaining coherent output.
Implications
The findings suggest that DareU can be applied in scenarios requiring compliance with data protection regulations, such as GDPR, by providing an efficient method for LLMs to forget specific data without extensive retraining. This has significant implications for privacy and data management in AI applications.
Improving Relative Representations with Learned Anchors and Whitened Inner Products
Multimodal
- Introduces learned anchors as robust semantic prototypes for improved relative representations.
- Utilizes a geometry-aware similarity metric that preserves magnitude information and is invariant to affine transformations.
- Demonstrates significant performance gains in cross-model communication across vision and language tasks.
- Enables stable zero-shot communication between models of varying scales and architectures.
Read more
Improving Relative Representations with Learned Anchors and Whitened Inner Products
Summary
This paper addresses the challenge of aligning independently trained neural models that converge to incompatible latent representations, which hinders modular AI systems. The authors propose an enhanced framework for cross-model communication through two key improvements: learning robust semantic anchors and employing a geometry-aware similarity metric. Traditional methods rely on randomly sampled anchors and cosine similarity, which often fail to capture the complex geometries of modern architectures like Transformers. The proposed method ensures better coverage of the data manifold and preserves discriminative magnitude information while being invariant to affine shifts. The results demonstrate significant performance improvements and consistency across various vision and language tasks, enabling nearly lossless information transfer and stable zero-shot communication even among heterogeneous architectures.
Methodology
The authors developed a framework that involves learning anchors to ensure they are informative and stable, covering the data manifold effectively. They also introduced a covariance-aware similarity measure that retains useful angular and magnitude information while being robust to affine distortions. This dual approach addresses the limitations of traditional relative representation methods.
Results
The proposed framework showed substantial improvements in performance and consistency across multiple tasks, allowing for effective zero-shot stitching of embeddings from different models. The method achieved nearly lossless information transfer, even between models with significant architectural differences.
Implications
This work has implications for the development of modular AI systems, enabling better interoperability between different neural models. It can enhance applications in areas requiring cross-model communication, such as transfer learning and multi-task learning in both vision and language domains.
OISD: On-Policy Internal Self-Distillation of Language Models
NLP
Large Language Models
Reinforcement Learning
- Introduction of On-Policy Internal Self-Distillation (OISD) for language models.
- Utilizes the final layer as an internal teacher to guide intermediate layers.
- Employs logit and attention alignment mechanisms for effective knowledge transfer.
- Demonstrates substantial improvements in reasoning tasks over strong RL baselines.
Read more
OISD: On-Policy Internal Self-Distillation of Language Models
Summary
This paper introduces a novel framework called On-Policy Internal Self-Distillation (OISD) aimed at enhancing reasoning capabilities in language models through reinforcement learning (RL). Traditional RL post-training methods often focus solely on optimizing the final output policy using sparse rewards, neglecting the rich predictive signals present in intermediate model representations. OISD addresses this gap by utilizing the final layer of the model as both the acting policy and an internal teacher for selected intermediate layers. The framework employs two key mechanisms: logit alignment, which transfers high-level reasoning behaviors, and attention alignment, which ensures consistent attention patterns across layers. This approach allows for the distillation of informative intermediate representations without relying on external privileged information. The authors demonstrate the effectiveness of OISD through experiments on four mathematical reasoning tasks, showing significant improvements over existing strong RL baselines. The results suggest that leveraging internal computations for distillation can lead to more accurate and coherent reasoning trajectories in language models.
Methodology
The OISD framework operates by keeping the final layer as the sole acting policy during rollout and optimization, while using it as a detached internal teacher for selected intermediate layers. The framework employs logit alignment to transfer predictive beliefs and attention alignment to ensure consistent attention patterns, facilitating the learning of stronger intermediate representations without external supervision.
Results
Experimental results indicate that OISD significantly outperforms strong reasoning RL baselines across four mathematical reasoning benchmarks, showcasing the effectiveness of internal self-distillation in enhancing model reasoning capabilities.
Implications
The findings suggest that internal self-distillation can be a powerful approach for improving reasoning in language models, potentially leading to advancements in various applications such as mathematical reasoning, coding, and complex instruction-following tasks. This research opens avenues for further exploration of internal mechanisms in model training.
MIC: Maximizing Informational Capacity in Adaptive Representations via Isotropic Subspace Alignment
NLP
Efficient ML
Optimization
- Introduction of MIC framework for optimizing multi-granular embeddings.
- Development of Soft Collapse Regularization to manage redundancy in nested subspaces.
- Implementation of Spectral Isotropy Regularization for ensuring uniformity in embeddings.
- Demonstrated significant performance improvements over existing MRL baselines.
Read more
MIC: Maximizing Informational Capacity in Adaptive Representations via Isotropic Subspace Alignment
Summary
The paper introduces MIC, a novel framework designed to enhance multi-scale representation learning by addressing issues of dimensional redundancy and spectral collapse in nested subspaces. The authors propose two key regularization techniques: Soft Collapse Regularization (SCR) and Spectral Isotropy Regularization (SIR). SCR minimizes redundancy between prefix and residual subspaces through cross-correlation penalties, while SIR ensures uniform distribution of embeddings in low-dimensional prefixes. By integrating these strategies within a self-distillation objective, MIC generates semantically dense representations that retain high discriminative power, particularly in scenarios requiring high compression. The framework shifts the focus of Matryoshka Representation Learning from usability to maximizing informational capacity through geometric alignment. Extensive experiments demonstrate that MIC outperforms standard baselines, especially in low-dimensional settings where maintaining informational capacity is critical.
Methodology
The MIC framework enhances Matryoshka Representation Learning by combining a nested contrastive loss with two novel regularizers: Soft Collapse Regularization and Spectral Isotropy Regularization. SCR employs a thresholded correlation penalty to manage redundancy between prefix and residual subspaces, while SIR ensures isotropic distribution of embeddings. The framework utilizes self-distillation to optimize these geometric properties, preventing representation collapse and ensuring high semantic density.
Results
The experiments conducted show that MIC consistently outperforms state-of-the-art MRL baselines across various tasks, particularly excelling in high-compression scenarios where the preservation of informational capacity is crucial. The results indicate that MIC effectively mitigates redundancy and enhances the discriminative power of embeddings.
Implications
The findings suggest that MIC could be applied in various domains requiring efficient representation learning, such as Natural Language Processing and other fields where high-dimensional embeddings are prevalent. The techniques developed could lead to more efficient models that balance performance with resource constraints.
Federated Variational Preference Alignment with Gumbel-Softmax Prior for Personalized User Preferences
Federated Learning
Reinforcement Learning
Large Language Models
- Introduces FedVPA-GP to address limitations of monolithic reward models in federated learning.
- Utilizes a Federated Mixture Prior to stabilize variational inference and prevent posterior collapse.
- Incorporates Orthogonal Loss to ensure semantic separation of conflicting preference prototypes.
- Demonstrates significant performance improvements over traditional methods on the HH-RLHF dataset.
Read more
Federated Variational Preference Alignment with Gumbel-Softmax Prior for Personalized User Preferences
Summary
This paper presents Federated Variational Preference Alignment with Gumbel-Softmax Prior (FedVPA-GP), a novel framework designed to address the challenges of personalizing user preferences in a federated learning setting. Traditional federated learning approaches often rely on monolithic reward models that fail to capture the diverse and conflicting preferences of users, such as the balance between helpfulness and harmlessness. The authors identify that existing methods lead to posterior collapse due to local data scarcity and heterogeneity, which undermines the effectiveness of preference learning. FedVPA-GP introduces a Federated Mixture Prior that allows clients to utilize the aggregate population distribution as a dynamic prior, stabilizing variational inference. Additionally, an Orthogonal Loss is incorporated to enforce the separation of preference prototypes in the latent space, preventing posterior collapse. Experimental results on the HH-RLHF dataset demonstrate that FedVPA-GP significantly outperforms traditional monolithic baselines, effectively disentangling conflicting user intents and enabling dynamic preference switching based on context. The framework thus provides a promising solution for personalized user preference alignment while maintaining data privacy.
Methodology
The FedVPA-GP framework combines Federated Learning with Variational Preference Learning. It employs a Federated Mixture Prior to aggregate learned distributions from clients, stabilizing local inference. An Orthogonal Loss is introduced to maintain the separation of distinct preference prototypes in the latent space. The Gumbel-Softmax relaxation is used for end-to-end differentiability, facilitating the learning of personalized reward models without sharing raw data.
Results
Experiments conducted on the HH-RLHF dataset reveal that FedVPA-GP significantly outperforms existing monolithic baselines. The qualitative analysis confirms the framework's ability to disentangle user preferences in the latent space, allowing the model to switch dynamically between helpful and harmless modes based on the context inferred from user interactions.
Implications
The proposed framework has significant implications for developing personalized AI systems that respect user privacy while effectively capturing diverse preferences. It can be applied in various domains, including recommendation systems, conversational agents, and any application requiring nuanced understanding of user intents.
RL2ML: Finite-Rollout Surrogate Objectives from Reinforcement Learning to Maximum Likelihood
Reinforcement Learning
NLP
Large Language Models
- RL2ML connects standard reinforcement learning, maximum-likelihood training, and beyond-maximum-likelihood objectives.
- Introduces a closed-form unbiased gradient estimator for finite-rollout surrogate objectives.
- Identifies a subcritical-supercritical update-scale transition that influences the effectiveness of surrogate objectives.
- Demonstrates that the optimal choice of surrogate objective depends on evaluation metrics, local sensitivity, and estimator variance.
Read more
RL2ML: Finite-Rollout Surrogate Objectives from Reinforcement Learning to Maximum Likelihood
Summary
This paper introduces RL2ML, a novel family of finite-rollout surrogate objectives designed to bridge the gap between reinforcement learning (RL) and maximum likelihood (ML) training. The primary focus is on optimizing language models using binary feedback from sampled outputs, specifically addressing the conflation of the objective optimized in expectation and the stochastic update geometry induced by finite rollout groups. RL2ML provides a closed-form, unbiased gradient estimator that maintains alignment between the estimator and the objective under a fixed rollout budget. The paper reveals a subcritical-supercritical update-scale transition that is not apparent in traditional population-level objective notation, emphasizing the importance of local sensitivity and estimator variance in selecting the best surrogate objective. The findings suggest that the choice of the surrogate objective can be framed as a one-dimensional optimization problem, rather than an unconstrained hyperparameter search, thus simplifying the optimization process.
Methodology
The paper develops the RL2ML framework by defining a truncated power-likelihood surrogate objective and deriving an unbiased estimator under a finite rollout budget. It employs calibrated local-gain analysis and variance decomposition to analyze the update-scale geometry and the implications of different surrogate objectives. The methodology includes a detailed examination of the MaxRL estimator and its relationship to the proposed RL2ML framework.
Results
The results indicate that the RL2ML framework effectively preserves the estimator-objective alignment of MaxRL while allowing for a continuous degree of freedom in the choice of surrogate objectives. The analysis reveals that the optimal choice of the surrogate objective is not solely determined by its proximity to maximum likelihood but is influenced by various factors including the evaluation metric and estimator variance. The paper provides insights into how to allocate weight to low-success prompts more effectively in finite-horizon training.
Implications
The findings of this paper have significant implications for the training of language models and other machine learning tasks that rely on binary feedback. By providing a structured approach to selecting surrogate objectives, RL2ML can enhance the efficiency and effectiveness of reinforcement learning algorithms, particularly in scenarios with limited rollout budgets. This could lead to improved performance in applications such as natural language processing and other areas where feedback is sparse or binary.
TabCausal: Pretraining Across Causal Environments for Tabular Causal Discovery
Graph Learning
- TabCausal addresses the limitations of existing CDFMs by utilizing a broad causal pretraining framework.
- The model can perform one-pass dataset-to-graph inference, enhancing efficiency in causal discovery.
- Dynamic task construction enables learning from a wide range of causal environments, improving transferability.
- TabCausal outperforms classical and neural causal discovery methods on synthetic and semantic benchmarks.
Read more
TabCausal: Pretraining Across Causal Environments for Tabular Causal Discovery
Summary
The paper presents TabCausal, a causal discovery foundation model (CDFM) designed to enhance the recovery of causal relations from observational and interventional data in tabular formats. The authors identify that existing CDFMs struggle to outperform classical methods due to limitations in causal pretraining task construction. To address this, TabCausal employs a broad causal pretraining framework that samples diverse causal environments, including various graph priors, structural mechanisms, noise models, and intervention regimes. This dynamic task construction allows the model to learn transferable structural cues, enabling it to predict directed edge probabilities or adjacency matrices in a single forward pass without the need for dataset-specific retraining. The model is evaluated on large-scale synthetic benchmarks and a novel semantic causal environment benchmark, demonstrating superior performance in structure recovery, particularly when interventional data is available. The findings suggest that broad causal pretraining is crucial for effective causal discovery across diverse scenarios.
Methodology
TabCausal employs a data-driven approach to causal discovery, utilizing a causal environment engine that samples a variety of causal settings. It constructs dynamic tasks that combine different graph structures, mechanisms, and intervention types, allowing the model to learn from a diverse set of causal environments. The model predicts causal relationships in a single forward pass, leveraging broad pretraining to enhance its structural inference capabilities.
Results
TabCausal achieved the best overall average rank in causal discovery tasks across large-scale synthetic benchmarks, particularly excelling in scenarios with interventional evidence. The model demonstrated robust structure recovery capabilities, outperforming classical, neural, and amortized baselines in both observational and mixed-interventional regimes.
Implications
The findings suggest that broad causal pretraining can significantly improve the performance of causal discovery models, making them more applicable in real-world scenarios where data may be limited or complex. This approach could enhance decision-making processes in various fields such as biology, finance, and social sciences by providing more reliable causal insights.
How Much Is a Dataset Worth? Scaling Laws, the Vendi Score, and Matrix Spectral Functions
Theory
Optimization
Efficient ML
- Dataset value is influenced by factors beyond size and compute budget.
- The Vendi Score and neural scaling laws are shown to be submodular.
- Matrix spectral functions provide a broader framework for dataset appraisal.
- A new optimization method yields a 35,000× speedup for maximizing the Vendi Score.
Read more
How Much Is a Dataset Worth? Scaling Laws, the Vendi Score, and Matrix Spectral Functions
Summary
This paper addresses the challenge of appraising the value of datasets in machine learning, emphasizing that dataset value is not solely determined by size or compute budget. The authors explore the relationship between neural scaling laws and the Vendi Score, both of which exhibit submodularity. They introduce a broader class of objectives called matrix spectral functions, which includes the Vendi Score and determinantal point processes (DPPs). The paper presents a novel method for efficiently optimizing these objectives using secular-equation-based updates, achieving significant speed improvements. The authors evaluate various data appraisal methods, including the Vendi Score and facility location, against held-out test performance across multiple datasets. The findings reveal that while the Vendi Score is predictive within moderate ranges, it can perform poorly at higher values, whereas facility location consistently outperforms other methods. The study concludes that dataset value is complex and cannot be reduced to size, class balance, or training budget alone.
Methodology
The authors analyze the submodularity of neural scaling laws and the Vendi Score, introducing matrix spectral functions as a generalization. They develop a fast secular-function strategy for optimizing these objectives, significantly reducing computational overhead. The performance of various data appraisal methods is compared using empirical evaluations on ImageNet-1K-scale datasets.
Results
The study demonstrates that while the Vendi Score can predict dataset value, it is less reliable at higher score ranges. Facility location consistently outperforms the Vendi Score and other matrix spectral variants in predicting held-out test performance. Random sampling of datasets shows limited variety in appraisal scores and performance.
Implications
The findings suggest that more nuanced methods for dataset appraisal could enhance the efficiency of machine learning training processes. The introduction of matrix spectral functions may lead to better data selection strategies, impacting various applications in machine learning.
On Distributional Reinforcement Learning in Chaotic Dynamical Systems
Reinforcement Learning
Theory
Optimization
- Distributional RL objectives are smoother than expectation-based objectives in chaotic systems.
- Return distributions under mild statistical stability assumptions are Lipschitz continuous in the 1-Wasserstein metric.
- Empirical analysis shows that distributional objectives lead to smoother loss landscapes and lower variance targets.
- Distributional Q-learning methods outperform non-distributional approaches in chaotic control experiments.
Read more
On Distributional Reinforcement Learning in Chaotic Dynamical Systems
Summary
This paper addresses the challenges posed by chaotic dynamical systems in the context of Reinforcement Learning (RL), particularly focusing on the high sensitivity to initial conditions that leads to high-variance bootstrap targets and poorly conditioned gradient updates. The authors argue that traditional RL methods, which optimize expected returns through scalar value functions, fail in chaotic environments due to the irregularities introduced by chaotic dynamics. They propose that under certain statistical stability assumptions, the return distribution evolves more smoothly than individual trajectories when measured using the 1-Wasserstein metric. This insight leads to a distributional RL framework that aligns optimization with the structure of return distributions, resulting in better-conditioned learning. The paper provides a theoretical foundation for the advantages of distributional methods in chaotic systems and empirically demonstrates that these methods yield smoother optimization landscapes and improved performance in chaotic control tasks compared to non-distributional approaches.
Methodology
The authors analyze the optimization landscape of chaotic systems and demonstrate that distributional RL can provide a smoother optimization objective. They employ theoretical proofs regarding the Lipschitz continuity of return distributions and conduct empirical experiments to compare distributional and non-distributional RL methods in chaotic environments.
Results
The study finds that distributional RL methods result in smoother loss landscapes and lower variance in one-step targets, which in turn leads to improved episodic returns in chaotic control tasks. The theoretical results confirm that the distributional RL objective is more stable than traditional scalar value function approaches.
Implications
The findings suggest that distributional RL could be a more effective approach for learning in chaotic environments, which are prevalent in various scientific and engineering domains. This could enhance the reliability of RL applications in fields such as climate modeling, fluid dynamics, and multi-agent systems.
Singularity-aware Optimization via Randomized Geometric Probing: Towards Stable Non-smooth Optimization
Optimization
Theory
- Introduction of Singularity-aware Adam (S-Adam) to address issues in non-smooth optimization.
- Development of the Local Geometric Instability (LGI) metric for estimating instability in loss landscapes.
- Adaptive damping mechanism that modulates step sizes based on local geometric conditions.
- Rigorous convergence guarantees established through differential inclusions.
Read more
Singularity-aware Optimization via Randomized Geometric Probing: Towards Stable Non-smooth Optimization
Summary
The paper addresses the challenges of optimizing deep learning models that incorporate non-smooth components, such as ReLU activations and quantization operators, which lead to gradient chattering and poor convergence. The authors propose a new optimizer called Singularity-aware Adam (S-Adam) that stabilizes training by dynamically adjusting step sizes based on local geometric instability. The key innovation is the Local Geometric Instability (LGI) metric, which estimates the Clarke subdifferential diameter using the variance of randomized directional derivatives. S-Adam employs an adaptive damping mechanism to slow down updates in regions of high instability while allowing for rapid convergence in smoother areas. The authors provide a rigorous convergence analysis, demonstrating that S-Adam converges almost surely to Clarke stationary points at an optimal rate of O(1/√T). Empirical evaluations show that S-Adam outperforms existing optimizers like AdamW and Prox-SGD in scenarios involving Quantization-Aware Training and high-noise small-batch learning, achieving significant accuracy improvements on benchmark datasets.
Methodology
The authors developed S-Adam by integrating the LGI metric to assess local geometric instability and employing an adaptive damping mechanism that adjusts learning rates in real-time. They conducted a convergence analysis using differential inclusions and performed empirical evaluations on various datasets to compare S-Adam's performance against existing optimizers.
Results
S-Adam demonstrated significant improvements in accuracy, achieving up to +6% on CIFAR-100 and +3% on TinyImageNet compared to AdamW and Prox-SGD. The optimizer effectively mitigated gradient oscillations and improved convergence stability in non-smooth optimization scenarios.
Implications
The findings suggest that S-Adam can serve as a robust alternative to existing adaptive optimizers, particularly in deep learning contexts where non-smooth components are prevalent. This work bridges theoretical insights in non-smooth optimization with practical applications in deep learning, potentially enhancing model training efficiency and performance.
Bifurcated Remaining Useful Life Prediction: A Hybrid Approach for Realistic Uncertainty Characterization
Time Series
- Introduces a hybrid prognostic framework for RUL estimation that incorporates uncertainty quantification.
- Utilizes a bifurcated approach to classify engine states into healthy and degraded regimes.
- Employs an LSTM-based autoencoder for state classification and a Conditional Weibull model for RUL estimation.
- Generates continuous state probabilities for improved prediction accuracy and uncertainty characterization.
Read more
Bifurcated Remaining Useful Life Prediction: A Hybrid Approach for Realistic Uncertainty Characterization
Summary
This paper presents a novel hybrid framework for estimating the Remaining Useful Life (RUL) of turbofan engines, focusing on uncertainty quantification. Utilizing the NASA C-MAPSS dataset, the authors bifurcate the operational lifespan of engines into 'healthy' and 'degraded' states. An LSTM-based autoencoder is employed to classify these states based on reconstruction error from nominal data. For the healthy regime, a Conditional Weibull Survival Analysis is used for estimating Mean Residual Life, while a Probabilistic Neural Network with Monte Carlo Dropout addresses uncertainties in the degraded regime. The framework innovatively uses a calibrated sigmoid function to convert autoencoder outputs into continuous state probabilities, allowing for dynamic weighting of predictions. This approach generates physically consistent uncertainty bands, enhancing prediction confidence, particularly near end-of-life scenarios, and provides a robust tool for risk-informed maintenance decisions.
Methodology
The methodology involves a hybrid approach where an LSTM-based autoencoder classifies engine states into healthy and degraded regions. For RUL estimation, a Conditional Weibull Survival Analysis is applied in the healthy region, while a Probabilistic Neural Network with Monte Carlo Dropout is utilized in the degraded region. The predictions from both models are fused using probability weights derived from the autoencoder's output.
Results
The proposed framework successfully generates uncertainty bands that are physically consistent, yielding high-confidence predictions, especially as the engine approaches end-of-life. The integration of uncertainty quantification significantly enhances the reliability of RUL predictions compared to traditional methods.
Implications
This research has significant implications for predictive maintenance in safety-critical sectors such as aviation, automotive, and heavy manufacturing. By providing a robust tool for RUL estimation with integrated uncertainty quantification, it aids operators in making informed maintenance decisions, thereby enhancing operational safety and reliability.
MAAT: Multi-phase Adapter-Aware Targeted Unlearning
NLP
Large Language Models
Theory
- Introduction of 5WBENCH, a benchmark that quantifies causal unlearning failures.
- MAAT framework achieves high forgetting and retention of Why-type causal knowledge.
- Demonstrates the challenges of gradient dilution and multi-hop reasoning in unlearning.
- Outperforms existing methods on the forget-retain tradeoff across multiple models.
Read more
MAAT: Multi-phase Adapter-Aware Targeted Unlearning
Summary
The paper addresses a significant gap in machine unlearning evaluation, specifically the underrepresentation of causal knowledge in existing benchmarks. The authors introduce 5WBENCH, a balanced benchmark consisting of 5,000 samples categorized into Who, What, When, Where, and Why questions, highlighting the inadequacy of current methods in handling causal knowledge. They demonstrate that existing unlearning methods struggle to achieve both high forgetting and high retention of Why-type causal knowledge due to the complexity of multi-hop reasoning and gradient dilution. To tackle this issue, the authors propose MAAT (Multi-phase Adapter-Aware Targeted Unlearning), a novel three-phase framework that operates on LoRA adapter weights. MAAT employs structured adapter surgery techniques, including gradient projection, SVD-based pruning, task vector negation, and hybrid KL–hidden-state retain repair. The results show that MAAT is the first method to effectively balance forgetting and retention of causal knowledge, achieving a new benchmark on the forget-retain Pareto frontier, outperforming all existing baselines.
Methodology
The MAAT framework consists of three phases: (1) gradient projection to orthogonalize forget updates against retain gradients, (2) SVD-based pruning of adapter dimensions to focus forgetting signals, and (3) hybrid KL–hidden-state retain repair to prevent re-learning of forgotten content. This structured approach allows for targeted unlearning without merging adapter weights into the base model.
Results
MAAT successfully achieves a balance between forgetting and retaining causal knowledge, reaching a new operational point on the forget-retain Pareto frontier. It outperforms all baseline methods evaluated on both Llama 3.2-3B and Gemma 3-4B models.
Implications
The findings suggest that existing unlearning methods may not adequately address causal knowledge, which is critical for applications requiring reliable and interpretable AI systems. The introduction of 5WBENCH and the MAAT framework can guide future research in improving unlearning techniques, particularly in domains where causal reasoning is essential.
Convex Basins in Single-Index Model Loss Landscapes: Applications to Robust Recovery under Strong Adversarial Corruption
Theory
Efficient ML
Optimization
- Introduces a robust recovery algorithm for Gaussian Single Index Models with non-monotonic link functions.
- Establishes the existence of a convex basin in the loss landscape that aids in robust recovery.
- Demonstrates efficient convergence to low estimation error under adversarial conditions.
- Fills a significant gap in robust statistics literature for non-monotonic link functions.
Read more
Convex Basins in Single-Index Model Loss Landscapes: Applications to Robust Recovery under Strong Adversarial Corruption
Summary
This paper addresses the challenge of robustly learning Gaussian Single Index Models (SIMs) amidst heavy-tailed noise and a fraction of adversarially corrupted data. Previous research has focused on specific cases like linear regression and monotonic link functions, but these methods do not extend to generic asymmetric non-monotonic link functions, which are prevalent in modern neural architectures. The authors propose a novel robust recovery algorithm that operates with near-linear sample and time complexity for these non-monotonic link functions, filling a significant gap in the literature. The key contribution is a new structural understanding of the loss landscape under adversarial conditions, demonstrating the existence of a constant-radius convex basin around the true parameter. This basin can be efficiently accessed through robust spectral initialization, allowing for effective gradient descent that converges to a low estimation error. The findings provide the first robust recovery guarantees for a wide range of nonlinear SIMs, which were previously unaddressed, thus advancing the field of robust statistics in high-dimensional settings.
Methodology
The authors utilize a combination of robust spectral initialization and gradient descent techniques to navigate the loss landscape of non-monotonic SIMs. They establish theoretical guarantees for the existence of convex basins around the true parameters, which facilitate efficient recovery even in the presence of adversarial corruption.
Results
The proposed algorithm achieves a final estimation error of O(σ√ϵ) with a time complexity of ˜O(nd) and requires ˜O(d) samples, where ϵ denotes the contamination fraction. This represents a significant improvement over previous methods that either failed under adversarial conditions or were limited to monotonic link functions.
Implications
The findings have broad implications for robust statistical modeling in high-dimensional data settings, particularly in applications involving neural networks and other machine learning models that utilize non-monotonic link functions. The ability to recover parameters robustly in the presence of adversarial corruption enhances the reliability of machine learning systems in real-world scenarios.
How's it going? Reinforcement learning in language models recruits a functional welfare axis
NLP
Large Language Models
Reinforcement Learning
- Reinforcement learning recruits a pre-existing representation of functional welfare in language models.
- The study demonstrates that punishment and reward vectors behave as representations of negative and positive welfare, respectively.
- The effects of these vectors are robust across various training conditions and persist even in pre-trained models.
- The functional welfare axis influences model behavior in unrelated domains, indicating a generalization of learned representations.
Read more
How's it going? Reinforcement learning in language models recruits a functional welfare axis
Summary
This paper investigates how reinforcement learning (RL) influences the internal representations of language models, specifically focusing on the concept of functional welfare. The authors present evidence that RL recruits a pre-existing representation of functional welfare, which reflects how well or poorly the model is performing relative to its goals. They train language models in a novel maze environment with semantically neutral rewards and extract concept vectors for both rewarded and punished trajectories. The analysis reveals that the punishment vector aligns with negative welfare indicators, promoting failure-related tokens and negative emotions, while the reward vector corresponds to positive welfare, encouraging completion-related tokens and positive sentiments. These vectors are shown to be effective even in models prior to maze training, suggesting that the functional welfare axis pre-exists post-training. The findings highlight the ability of minimal reward signals to broadly influence model behavior and have implications for interpretability, post-training dynamics, and alignment.
Methodology
The authors trained language models in a text-based maze environment with neutral rewards. They extracted concept vectors for rewarded and punished trajectories and evaluated their effects on model behavior in various unrelated tasks, including sentiment analysis and confidence assessments.
Results
The analysis showed that the punishment vector (vMOLD) promotes negative outcomes, while the reward vector (vGOLD) encourages positive outcomes. The vectors were nearly antiparallel and effectively influenced model behavior across different tasks. These effects were consistent regardless of model family, scale, and training algorithms.
Implications
The study suggests that minimal reward signals can significantly affect model behavior by recruiting pre-existing welfare-like representations. This has important implications for the interpretability of language models, their post-training dynamics, and alignment with human values.
ScaleMAP: Preserving Local Density and Neighborhood Structure in Low-Dimensional Embeddings
Graph Learning
Theory
Efficient ML
- ScaleMAP preserves local density and neighborhood structure better than existing methods.
- It introduces a change of variables approach to reintroduce scale information in embeddings.
- ScaleMAP matches DensMAP on density preservation while maintaining UMAP-level neighborhood preservation.
- The method successfully recovers critical structures in various scientific datasets.
Read more
ScaleMAP: Preserving Local Density and Neighborhood Structure in Low-Dimensional Embeddings
Summary
The paper introduces ScaleMAP, a novel modification of UMAP designed to preserve local density and neighborhood structure in low-dimensional embeddings. Traditional nonlinear dimensionality-reduction methods, such as UMAP and DensMAP, often fail to maintain the scale of local neighborhoods, leading to the loss of important structures in high-dimensional data. ScaleMAP addresses this issue by reintroducing scale information through a change of variables in the embedding process, rather than adding a competing objective. This method divides the pairwise embedding displacement by the geometric mean of the local radii of the two endpoints in the original space, effectively maintaining neighborhood preservation while enhancing density representation. The authors demonstrate that ScaleMAP achieves comparable neighborhood preservation to UMAP while significantly improving density preservation, particularly in scientific datasets such as transcriptomics, hyperspectral imaging, and flow cytometry. The results indicate that ScaleMAP can recover critical structures that UMAP collapses, such as sparse bridges between cell populations and narrow spectral spikes, while accurately representing density across a wide range of magnitudes.
Methodology
ScaleMAP modifies the UMAP embedding process by adjusting the pairwise distances between points using the geometric mean of their local radii in the original space. This change of variables allows for the preservation of neighborhood structure while incorporating local scale information without introducing additional loss terms.
Results
ScaleMAP was evaluated on standard benchmarks and scientific datasets, demonstrating that it maintains UMAP-level neighborhood preservation while broadly matching or exceeding DensMAP in terms of density preservation. It effectively recovers sparse structures in transcriptomic data and accurately represents density across 17 orders of magnitude in flow cytometry data.
Implications
The development of ScaleMAP has significant implications for exploratory data analysis in various scientific fields, as it enhances the ability to visualize and interpret complex high-dimensional datasets. This method can be particularly useful in applications involving biological data, hyperspectral imaging, and any domain where preserving local density and neighborhood relationships is crucial.
Towards Continuous-time Causal Foundation Models
Time Series
- Introduces a continuity criterion for continuous-time causal priors based on trajectory-law invariance.
- Develops a three-tier taxonomy for categorizing causal priors in time series analysis.
- Demonstrates that fine-grid integration outperforms naive integration in empirical tests.
- Proposes a construction for continuous-time causal models using OU processes and random DAGs.
Read more
Towards Continuous-time Causal Foundation Models
Summary
This paper addresses the limitations of discrete-time causal Prior-data Fitted Networks (PFNs) in time series analysis by proposing a framework for continuous-time causal foundation models. The authors introduce a precise continuity criterion that requires the joint law of a sampled trajectory to be invariant to the observation schedule. They present a three-tier taxonomy of causal priors: discrete, naive observation-grid integration, and fine-grid integration with decoupled observation. The top tier is realized through a construction using Ornstein–Uhlenbeck (OU) processes or small-Multilayer Perceptron (MLP) nonlinear drifts on random directed acyclic graphs (DAGs) with various types of interventions. An empirical evaluation demonstrates that fine-grid integration significantly outperforms naive integration across multiple scenarios, confirming the effectiveness of their proposed continuous-time framework. The authors also release a preliminary zero-shot protocol for real-world applications in pharmacokinetics and physical systems, although detailed results from these applications are deferred to an appendix.
Methodology
The authors propose a three-tier taxonomy for continuous-time causal priors and develop a construction that implements the top tier using Ornstein–Uhlenbeck processes or small-MLP nonlinear drifts on random DAGs. They conduct a 2x2 ablation study comparing encoder and integrator configurations on both linear and nonlinear priors, evaluating performance across different discretizations.
Results
The empirical evaluation shows that fine-grid integration consistently outperforms naive integration in all tested scenarios, with a significant p-value indicating strong statistical significance. The performance gap increases as the evaluation grid is refined, highlighting the advantages of the proposed continuous-time approach.
Implications
The proposed framework has potential applications in fields requiring accurate modeling of time-dependent causal relationships, such as pharmacokinetics, healthcare analytics, and any domain involving irregularly sampled time series data. It opens avenues for more robust causal inference in continuous time, which could enhance predictive modeling and decision-making processes.
PrismFlow: Residual Dynamics for Flow Matching in Time-Series Generation
Generative Models
Time Series
- PrismFlow addresses mode collapse in time-series generation by using a bank of Koopman-inspired experts.
- The method employs a Winner-Take-All training objective to promote expert specialization and reduce averaging effects.
- PrismFlow achieves state-of-the-art performance with significant improvements in key evaluation metrics.
- The approach is robust in low-data settings and effective for various time-series tasks, including forecasting and imputation.
Read more
PrismFlow: Residual Dynamics for Flow Matching in Time-Series Generation
Summary
The paper introduces PrismFlow, a novel method for generating high-quality time-series data that addresses the challenges posed by multimodal patterns and multiscale dynamics in real-world signals. Traditional Flow Matching (FM) methods often rely on a single global vector-field estimator, which can lead to oversmoothing and poor mode coverage due to the averaging of incompatible dynamics. PrismFlow mitigates this by employing a set of Koopman-inspired dynamical experts that learn residual corrections in a latent space, allowing for local nonlinear temporal evolution to be approximated by linear transitions. The authors propose a confidence-aware Winner-Take-All (WTA) objective that encourages specialization among experts, ensuring that only the most relevant expert updates its parameters for each sample. This approach preserves the stability of the global transport field while enabling the recovery of fine-grained temporal structures. Empirical evaluations demonstrate that PrismFlow outperforms standard FM methods, achieving significant improvements in metrics such as Context-FID and Discriminative Score, while remaining effective in low-data scenarios and applicable for forecasting and imputation tasks.
Methodology
PrismFlow utilizes a bank of Koopman-inspired dynamical experts that learn residual corrections to a global transport field. The training employs a confidence-aware Winner-Take-All (WTA) objective, allowing for competitive selection of experts based on their alignment with the current sample, which promotes specialization and reduces regression-to-the-mean behavior.
Results
PrismFlow demonstrated a 15.6% improvement in Context-FID and a 38.6% increase in Discriminative Score compared to standard FM methods. The method effectively recovers diverse modes in time-series generation tasks and maintains robustness in low-data settings.
Implications
The findings suggest that PrismFlow can significantly enhance the generation of time-series data in various fields such as healthcare, finance, and environmental monitoring, where high-fidelity signal synthesis is crucial. Its ability to handle multimodal dynamics and low-data scenarios makes it a valuable tool for real-world applications.
FlagGAM: Rule-Based Generalized Additive Modeling for Explainable Tabular Prediction
Interpretability
- FlagGAM provides a rule-defined basis framework for GAM-style tabular prediction.
- It extends rule construction to handle both numerical and categorical features across classification and regression tasks.
- The framework retains a sparse rule-basis matrix, allowing for feature-specific weighting and flexible prediction heads.
- FlagGAM demonstrates competitive performance against modern additive models and tree-based methods, especially under challenging data conditions.
Read more
FlagGAM: Rule-Based Generalized Additive Modeling for Explainable Tabular Prediction
Summary
The paper introduces FlagGAM, a novel framework for explainable tabular prediction that emphasizes accuracy, transparency, and robustness in high-stakes domains. FlagGAM separates the construction of feature-level rules from the prediction process, utilizing a Flag Core Module that transforms numerical and categorical variables into human-readable univariate bases. These bases include threshold flags, category-level flags, tail-deviation bases, and categorical step functions. The framework employs a default additive head to combine these bases into a restricted Generalized Additive Model (GAM) predictor. Unlike traditional models that reduce triggered rules to compact summaries, FlagGAM retains a sparse rule-basis matrix that supports mixed-type classification and regression, feature-specific weighting, and flexible prediction heads. The authors demonstrate that FlagGAM performs comparably to Explainable Boosting Machines (EBM) in transparent additive mode, significantly outperforms ridge regression in mixed-type regression tasks, and exhibits lower AUROC degradation under missing and noisy data conditions. The flexible heads further enhance accuracy, approaching the performance of strong tree-based models, while maintaining interpretability through a rule-basis representation followed by a nonlinear predictor. Overall, FlagGAM offers a practical solution for tabular settings that require competitive accuracy, clear communication of rules, and resilience to imperfect inputs.
Methodology
FlagGAM employs a Flag Core Module to convert raw variables into univariate basis functions, which are then combined using a default additive head to form a restricted GAM-style predictor. The framework includes training-only, within-feature false discovery rate-controlled cutoff screening and rule-level handling of missing values, ensuring that all rules are derived from training data.
Results
FlagGAM was evaluated on various benchmarks, including clinical, credit-risk, census-income, and housing datasets. The results indicate that it closely matches the performance of modern additive models like EBM, significantly outperforms ridge regression in mixed-type regression, and shows reduced AUROC degradation under conditions of missing and noisy data. The flexible prediction heads further enhance accuracy, approaching tree-based model performance.
Implications
FlagGAM has significant implications for fields requiring explainable AI, such as healthcare and finance, where transparent and interpretable models are crucial for decision-making. Its ability to handle mixed-type data and provide clear rules makes it a valuable tool for practitioners in high-stakes environments.
Trading Complexity for Expressivity Through Structured Generalized Linear Token Mixing
NLP
Large Language Models
Efficient ML
- Introduces a unified framework for causal linear token mixing that generalizes existing architectures.
- Explores the trade-offs between computational complexity and expressive power in token mixers.
- Constructs token mixers with varying complexities, enhancing expressivity while managing runtime.
- Empirical validation on synthetic benchmarks and language modeling tasks supports theoretical claims.
Read more
Trading Complexity for Expressivity Through Structured Generalized Linear Token Mixing
Summary
This paper presents a unified framework for token mixing in language models, focusing on the trade-off between computational complexity and expressivity. The authors introduce a structured approach that separates the direct influence of inputs on outputs from the recurrent propagation of information through past outputs. This framework encompasses various architectures, including attention mechanisms and state-space models, while allowing for higher-order recurrences that depend on multiple past states. By designing new recurrence patterns, the authors achieve a controlled range of complexities, from O(n) to O(n^2), providing theoretical insights into their expressivity. Empirical validation is conducted on synthetic tasks and language modeling, demonstrating the effectiveness of the proposed token mixers. The results offer a comprehensive toolkit for understanding and designing efficient token mixing mechanisms across different model families.
Methodology
The authors develop a general framework for token mixing that decomposes causal linear mixers into two components: direct input influence and recurrent propagation. They design structured recurrence patterns that allow for higher-order dependencies, enabling a controlled exploration of complexity and expressivity. The methodology includes theoretical analysis and empirical testing on various tasks.
Results
The proposed token mixers demonstrate a range of complexities and expressivities, achieving performance improvements on synthetic benchmarks and language modeling tasks. The empirical results validate the theoretical insights regarding the trade-offs between complexity and expressivity, confirming the effectiveness of the structured approach.
Implications
The findings suggest that by strategically designing token mixing mechanisms, it is possible to enhance the performance of language models while managing computational resources. This work could influence future research in model architecture design, particularly in optimizing for long-range dependencies and efficiency in large-scale language models.
Retriever Portfolios: A Principled Approach to Adaptive RAG
NLP
Large Language Models
Optimization
- Introduces retriever portfolio optimization to enhance RAG systems by selecting diverse retrievers for heterogeneous queries.
- Formalizes an expected best-of-k objective to evaluate retriever portfolios, ensuring coverage of different query types.
- Demonstrates that fixed portfolios can achieve comparable or better accuracy with lower latency than adaptive hyperparameter tuning methods.
- Empirical results show significant improvements in retrieval recall and answer accuracy across multiple QA benchmarks.
Read more
Retriever Portfolios: A Principled Approach to Adaptive RAG
Summary
This paper addresses the limitations of traditional retrieval-augmented generation (RAG) systems, which typically rely on a single retriever and a fixed set of hyperparameters, leading to suboptimal performance across diverse query types. The authors propose a novel method called retriever portfolio optimization, which automatically selects a small, diverse subset of retrievers from a larger pool to better cover the heterogeneous distribution of queries. They formalize this approach using an expected best-of-k objective, allowing for efficient portfolio construction with near-optimal guarantees. The proposed method includes a pipeline that learns a static portfolio of complementary retrievers and employs a lightweight router to dynamically select the best retriever for each query. Empirical evaluations on various question-answering benchmarks demonstrate that the learned portfolios significantly outperform both single-retriever and naive multi-retriever baselines in terms of retrieval metrics and answer quality, while also reducing latency and token costs compared to inference-time hyperparameter tuning methods.
Methodology
The authors develop a pipeline that learns a static portfolio of retrievers offline, using an expected best-of-k objective to select a diverse subset of retrievers from a larger pool. A lightweight router is trained to dynamically select the most suitable retriever for each query, thus avoiding costly inference-time hyperparameter tuning.
Results
The proposed method consistently outperforms single-retriever baselines and other adaptive methods like Vendi-RAG across various QA benchmarks, achieving better retrieval recall and answer accuracy while significantly reducing latency and token usage.
Implications
This work has potential applications in improving the efficiency and effectiveness of RAG systems in various domains, including open-domain question answering and knowledge-intensive dialogue systems, by enabling more tailored retrieval strategies that adapt to the complexity of user queries.
Scaling Higher-Order Graph Learning with Maximal Clique Complexes
Graph Learning
- Introduction of sCWL and fCWL tests that preserve expressivity while improving scalability.
- Development of the maximal clique complex for efficient higher-order graph representation.
- CliqueWalk method for sampling maximal cliques, enabling linear scaling with graph size.
- Competitive performance on classification benchmarks compared to existing GNNs.
Read more
Scaling Higher-Order Graph Learning with Maximal Clique Complexes
Summary
This paper addresses the limitations of traditional graph neural networks (GNNs) that primarily model pairwise interactions by introducing a scalable framework for higher-order graph learning. The authors propose simplified and factored cellular Weisfeiler–Leman tests (sCWL and fCWL) that maintain the expressivity of the original CWL test while enhancing computational efficiency. They introduce the maximal clique complex, which encodes only the maximal cliques of a graph, thereby reducing time and memory complexity. To avoid the computational burden of explicit clique enumeration, the authors develop CliqueWalk, a biased random walk that efficiently samples maximal cliques, allowing for linear scaling with graph size. The proposed methods demonstrate strong empirical performance on node and graph classification tasks, achieving competitive results compared to existing GNNs while significantly improving scalability and efficiency.
Methodology
The authors introduce simplified and factored versions of the cellular Weisfeiler–Leman tests (sCWL and fCWL) and their corresponding neural architectures (sCWNs and fCWNs). They propose the maximal clique complex to encode only maximal cliques and develop CliqueWalk, a biased random walk algorithm for efficient sampling of these cliques. This approach allows for scalable higher-order graph learning without the need for exhaustive clique enumeration.
Results
The proposed methods, sCWNs and fCWNs, demonstrate competitive performance on various node and graph classification benchmarks, matching or exceeding the performance of existing GNNs while achieving better scalability and efficiency. The empirical results validate the effectiveness of the introduced techniques in handling larger graphs.
Implications
The findings suggest that the proposed framework can be applied to various domains where higher-order interactions are significant, such as social network analysis, molecular property prediction, and other complex relational data tasks. The scalability of the methods opens up possibilities for their use in large-scale graph learning applications.
The Long-Term Effects of Data Selection in LLM Fine-Tuning
NLP
Large Language Models
Theory
- Introduces the concept of myopic selection in LLM fine-tuning, highlighting the trade-off between short-term gains and long-term adaptability.
- Develops a unified multi-stage evaluation protocol for comparing various data selection strategies.
- Demonstrates through experiments that short-term effective selectors can hinder future learning and increase forgetting.
- Proposes the Long-Horizon Aware Selection (LHAS) objective to improve long-term adaptation and robustness.
Read more
The Long-Term Effects of Data Selection in LLM Fine-Tuning
Summary
This paper investigates the long-term effects of data selection strategies in the fine-tuning of large language models (LLMs). While traditional methods focus on immediate task performance, this study emphasizes the importance of evaluating data selection based on future adaptability, retention, and robustness. The authors introduce the concept of myopic selection, where short-term effective selectors may hinder long-term learning and increase forgetting. They propose a unified multi-stage evaluation protocol to compare various selection strategies, including random, loss-based, gradient-based, diversity-based, and quality-based methods. Through controlled experiments, the authors demonstrate that short-term selectors can lead to rank reversal, improving current performance but slowing down future learning. To address this, they introduce the Long-Horizon Aware Selection (LHAS) objective, which incorporates coverage and anti-concentration terms to mitigate the adverse effects of myopic selection. The findings suggest that data selection should be viewed as a critical training intervention that shapes the model's learning trajectory rather than merely a local efficiency mechanism.
Methodology
The authors conducted controlled experiments using a unified multi-stage protocol to evaluate different data selection strategies. They analyzed the impact of these strategies on future adaptation speed, forgetting, capability imbalance, and out-of-distribution robustness. The study included a theoretical analysis to explain why selectors with equal current-stage gains could differ in future adaptation costs.
Results
The experiments revealed that short-term selectors could lead to rank reversal, where they improve immediate task performance but slow down subsequent learning and increase forgetting. The introduction of the LHAS objective showed potential in reducing the long-term side effects of myopic selection, enhancing future adaptability and robustness.
Implications
The findings suggest that practitioners should reconsider how they evaluate data selection strategies in LLM fine-tuning, focusing on long-term effects rather than just immediate performance. This could lead to more robust and adaptable models in real-world applications.
Fraud Type Decomposition and the Observation-Mechanism Taxonomy: Class-Specific Detection Limits in Payment Networks
Theory
- Fraud detection models often treat diverse fraud types as a single entity, leading to statistical inefficiencies.
- The paper identifies five distinct observation-mechanism classes for fraud types, each requiring specific estimation strategies.
- Class-specific estimation significantly outperforms pooled estimation methods, as quantified by a Jensen penalty.
- The study provides theoretical constraints for fraud detection, emphasizing the need for tailored detection strategies.
Read more
Fraud Type Decomposition and the Observation-Mechanism Taxonomy: Class-Specific Detection Limits in Payment Networks
Summary
This paper addresses the inefficiencies in fraud detection models within payment networks, which typically treat various fraud types as a homogeneous category. The author argues that fraud encompasses diverse phenomena, each with distinct observation pipelines that affect label generation. The study classifies over 30 fraud types into five observation-mechanism classes based on their structural censorship pipelines. This classification is both minimal and complete, ensuring that each fraud type maps to one class without overlap. The author demonstrates that estimating fraud rates separately by class significantly outperforms pooled estimations, with the efficiency gap quantified through a closed-form Jensen penalty. The paper also derives theoretical constraints for each class, revealing that detection strategies must be tailored to the unique characteristics of each fraud type. The findings highlight the limitations of existing fraud detection frameworks and propose a more nuanced approach to improve detection accuracy.
Methodology
The author employs a theoretical framework to classify fraud types based on their observation mechanisms, deriving structural signatures for each type. The study uses mathematical proofs to establish the efficiency of class-specific estimations over pooled estimations and quantifies the efficiency gap.
Results
The classification of fraud types into five observation-mechanism classes is shown to be both minimal and complete. The study proves that class-specific estimation dominates pooled estimation, with a quantifiable efficiency gap. The theoretical constraints derived for each class reveal the unique challenges in detecting different types of fraud.
Implications
The findings suggest that payment networks should adopt class-specific fraud detection strategies to enhance accuracy and efficiency. This approach could lead to better resource allocation in fraud prevention efforts and improve overall security in payment systems.
BOKBO (Best of K Bad Options): Calibrated Abstention for VLA Policies
Robotics
Theory
Multimodal
- BOKBO is the first conformal abstention layer for K-sample VLA inference, providing safety guarantees.
- The method achieves high reliability and task success rates while addressing silent failures in traditional K-sampling.
- A critical analysis reveals that existing nonconformity scores fail to measure policy uncertainty accurately.
- The introduction of a learned violation predictor improves safety calibration significantly.
Read more
BOKBO (Best of K Bad Options): Calibrated Abstention for VLA Policies
Summary
The paper introduces BOKBO, a novel conformal abstention layer designed for K-sample vision-language-action (VLA) policies. Traditional K-sample inference methods assume at least one candidate action is safe, which often leads to silent failures when all candidates are unsafe. BOKBO addresses this by providing finite-sample distribution-free upper bounds on the unsafe execution rate among non-abstained decisions. The authors demonstrate that BOKBO can maintain a conditional risk control (CRC) bound with 86% reliability across bootstrap splits, achieving 78% coverage and 70% net task success on specific benchmarks. They also reveal a critical flaw in existing nonconformity scores used in K-sampling, which correlate more with perturbation than with actual policy uncertainty. The paper proposes a learned violation predictor as a more reliable alternative. Additionally, it highlights the importance of per-task expert force calibration to mitigate inflated violation rates in safety evaluations. Overall, BOKBO represents a significant advancement in ensuring safety in K-sample VLA inference.
Methodology
The authors developed BOKBO by applying conformal risk control (CRC) to the K-sample VLA setting. They tested various nonconformity scores, including base-policy confidence, K-sample disagreement, and a learned violation predictor, to determine their effectiveness in predicting safety violations. The methodology also involved extensive evaluations on benchmarks like LIBERO, assessing the performance of BOKBO against traditional K-sample methods.
Results
BOKBO demonstrated a conditional CRC bound holding on 86% of bootstrap splits, achieving 78% coverage and 70% net task success on the libero_object_temp_x0.1 benchmark. The per-task variant improved the minimum conditional hold from 0.71 to 0.93. The analysis revealed that existing free signals correlated strongly with perturbation rather than uncertainty, while the learned predictor provided better calibration. Additionally, the paper identified that globally-set thresholds inflated violation rates by 5x, which was resolved through expert-calibrated thresholds.
Implications
BOKBO has significant implications for enhancing the safety and reliability of VLA policies in robotics and other applications where K-sampling is employed. By ensuring that unsafe actions can be identified and abstained from, BOKBO can improve the robustness of autonomous systems in uncertain environments. The findings also suggest a need for reevaluating existing methodologies in safety evaluations and the calibration of action thresholds.
Learning Multi-Agent Coordination via Sheaf-ADMM
Optimization
Graph Learning
Robotics
- Introduces Sheaf-ADMM, a framework for multi-agent coordination using differentiable optimization.
- Utilizes cellular sheaf theory to define inter-agent constraints for heterogeneous global consensus.
- Demonstrates improved robustness and performance in tasks like MNIST classification and Sudoku solving.
- Enables distinct analysis of coordination dynamics through the separation of state variables.
Read more
Learning Multi-Agent Coordination via Sheaf-ADMM
Summary
This paper presents Sheaf-ADMM, a differentiable optimization framework designed for multi-agent coordination. The framework allows agents to process overlapping local views of an input, each solving a convex subproblem parameterized by a neural encoder. Coordination among agents is achieved through the Alternating Direction Method of Multipliers (ADMM), with inter-agent constraints defined by a cellular sheaf, which specifies aspects of neighboring solutions that must agree. This approach enables heterogeneous forms of global consensus. The authors demonstrate the effectiveness of Sheaf-ADMM on various tasks, including maze pathfinding, image classification, and Sudoku. The results show that agents, despite having limited local views, can learn to coordinate effectively to produce correct global outputs. Notably, the method improves robustness to distribution shifts in MNIST classification compared to standard CNNs and achieves higher solve rates in Sudoku than parameter-matched message-passing neural network (MPNN) baselines. The structure of ADMM also allows for distinct analysis of primal, consensus, and dual state variables, providing insights into coordination dynamics that are not available in traditional message-passing architectures.
Methodology
The methodology involves formulating coordination as a constrained optimization problem solved using ADMM. Each agent independently solves local subproblems, followed by a consensus step that projects their proposals towards global consistency. The framework is differentiable, allowing for backpropagation through the optimization trajectory, and incorporates neural network parameterizations for agent subproblems.
Results
The evaluation of Sheaf-ADMM on tasks such as maze pathfinding, image classification (MNIST), and Sudoku shows that agents can effectively coordinate despite limited local views. The method outperforms standard CNNs in robustness to distribution shifts and achieves significantly higher solve rates in Sudoku compared to MPNN baselines.
Implications
The Sheaf-ADMM framework has potential applications in areas requiring multi-agent coordination, such as robotics, distributed systems, and collaborative AI. Its ability to learn coordination dynamics and improve robustness to local view limitations could enhance performance in complex, real-world tasks.