AI-generated summaries
Today's ML research,
without the noise.
Daily summaries of the latest machine learning papers from arXiv, processed every 8 hours.
59
Papers today
8h
Update frequency
7
Days of history
Reinforcement Learning for Neural Model Editing
Reinforcement Learning
Computer Vision
NLP
- Introduces a reinforcement learning framework for neural model editing.
- Develops two environments: MaskWorld and ShiftWorld for different weight modification strategies.
- Achieves significant improvements in bias mitigation and machine unlearning tasks.
- Demonstrates the potential of RL to automate model editing without specialized algorithms.
Read more
Reinforcement Learning for Neural Model Editing
Summary
This paper presents a novel framework that formulates the task of editing pretrained neural networks as a reinforcement learning (RL) problem. The proposed approach allows agents to modify neural models based on reward feedback, thus eliminating the need for specialized algorithms tailored to specific editing objectives. The authors introduce two distinct environments for the RL agents: MaskWorld, where weights are scaled multiplicatively, and ShiftWorld, where weights are updated additively. The reward function is designed to balance a utility-preservation objective with a task-specific editing goal, enabling agents to learn targeted modifications while maintaining overall model performance. The framework is evaluated on two critical tasks: bias mitigation in text classification and machine unlearning in image classification. The results demonstrate that the learned policies effectively reduce forget set accuracy to nearly 0% while preserving over 90% accuracy on the retain set in the unlearning task. Additionally, in the bias mitigation scenario, the policies improve bias-related performance by more than 5% without compromising general classification utility. These findings suggest that RL can serve as a general framework for learning neural model editing algorithms, potentially streamlining the editing process across various applications.
Methodology
The authors create RL environments where agents interact with neural network parameters by modifying weights through either multiplicative scaling (MaskWorld) or additive updates (ShiftWorld). The agents receive rewards based on the performance of the modified models, which are designed to reflect both utility preservation and task-specific objectives. The agents learn policies for weight updates through trial and error, optimizing their actions based on the feedback received.
Results
The learned policies achieved nearly 0% accuracy on the forget set while maintaining over 90% accuracy on the retain set in the machine unlearning task. In the bias mitigation task, the policies improved bias-related performance by over 5%, demonstrating the effectiveness of the RL approach in achieving desired editing outcomes without degrading model utility.
Implications
This framework could significantly simplify the process of neural model editing across various applications, such as ensuring fairness in AI systems and enabling efficient model updates without retraining. It opens avenues for further research into automated model modifications and could lead to more robust and adaptable AI systems.
The Mathematics of AI Winters: The mathematical Taxonomy of Paradigm Fragility in AI Winter
Theory
- AI winters were influenced by formal mathematical barriers, not just engineering failures.
- Key limitations include representational capacity, optimization hardness, and statistical learnability.
- The paper synthesizes existing mathematical theories to provide a unified interpretation of AI winters.
- Later breakthroughs in mathematics and algorithms were crucial for overcoming the identified barriers.
Read more
The Mathematics of AI Winters: The mathematical Taxonomy of Paradigm Fragility in AI Winter
Summary
This paper explores the historical phenomenon of AI winters, periods marked by reduced funding and confidence in artificial intelligence research, through a mathematical lens. The authors argue that these downturns were not solely due to engineering failures or commercial disappointments but were also influenced by formal mathematical barriers. They identify key limitations in representation, optimization, computational complexity, statistical learnability, and high-dimensional approximation that contributed to the fragility of AI paradigms during these periods. The paper synthesizes existing mathematical theories, such as the perceptron impossibility results, NP-hardness of neural network training, and vanishing-gradient problems, to illustrate how these barriers aligned with the disappointments experienced in early AI research. The authors emphasize that while these mathematical challenges did not directly cause the AI winters, they played a significant role in the vulnerabilities of the leading paradigms. The paper concludes by discussing how later breakthroughs in mathematics and algorithms helped to mitigate these barriers, leading to the resurgence of AI research.
Methodology
The authors conducted a synthetic analysis of historical AI paradigms, integrating various mathematical theories and results to identify formal barriers that contributed to the fragility of AI research during the winters. They examined classical results in learning theory and computational complexity to support their claims.
Results
The paper identifies several formal barriers that contributed to the AI winters, including representational limitations of early models, computational hardness of training neural networks, and statistical challenges in generalization. It also discusses how subsequent mathematical advancements and algorithmic innovations helped to address these issues, leading to renewed progress in AI.
Implications
The findings suggest that understanding the mathematical foundations of AI can inform future research directions and help avoid similar downturns in the field. By recognizing the importance of overcoming formal barriers, researchers can develop more robust AI systems and methodologies.
Clustering Node Attributed Networks with Graph Neural Networks and Self Learning
Graph Learning
- Introduces DCSL-GNN, a novel unsupervised framework for clustering attributed networks.
- Utilizes self-learning and context generation across multiple rounds to improve node representation.
- Demonstrates superior performance compared to traditional clustering methods that use only network structure or node attributes.
- Empirical results indicate competitiveness with state-of-the-art methods on real datasets.
Read more
Clustering Node Attributed Networks with Graph Neural Networks and Self Learning
Summary
This paper addresses the problem of clustering nodes in attributed networks, where nodes possess informative attributes alongside their connections. The authors propose a novel framework called Dynamic Context Self-Learning Graph Neural Network (DCSL-GNN), which operates in rounds of self-learning in a fully unsupervised manner. In each round, a Graph Neural Network (GNN) generates node representations that are subsequently used for clustering. The clustering results influence the context graph used to generate new node representations in the next round. This iterative process allows the framework to leverage both network edges and node attributes effectively, improving clustering performance. The empirical results demonstrate that DCSL-GNN outperforms traditional methods that rely solely on either network structure or node attributes, particularly when both sources of information are noisy. Additionally, the framework consistently outperforms a single long training round typical of classic GNN approaches. On real datasets, DCSL-GNN shows competitive performance against state-of-the-art methods, especially when cluster sizes are balanced.
Methodology
The DCSL-GNN framework operates in iterative rounds where a GNN generates node representations based on a context graph that evolves with each round. The context graph is influenced by the clustering results from the previous round, allowing for dynamic adaptation and improved representation learning. The final node representations are then clustered using a classic k-means algorithm.
Results
The proposed methodology outperformed traditional algorithms focused solely on network structure or node attributes in synthetic datasets, particularly in scenarios where both sources of information were noisy. In real datasets, DCSL-GNN achieved competitive results compared to state-of-the-art clustering methods, especially when cluster sizes were balanced.
Implications
The findings suggest that integrating node attributes with network structure can significantly enhance clustering performance in attributed networks. This has potential applications in various fields such as social network analysis, biological network clustering, and information retrieval, where both node features and relationships are crucial for understanding underlying structures.
Understanding helpfulness and harmless tension in reward models
NLP
Large Language Models
Reinforcement Learning
- Mixed-objective reward models underperform compared to single-objective models due to alignment tension.
- Neurons associated with helpfulness and harmlessness objectives exhibit interference, affecting model performance.
- A significant proportion of neurons are shared between the two objectives, contributing to alignment tension.
- The study provides insights into the internal mechanisms of reward models and their representation of alignment objectives.
Read more
Understanding helpfulness and harmless tension in reward models
Summary
This paper investigates the alignment tension in reward models (RMs) used in reinforcement learning from human feedback (RLHF), focusing on the conflicting objectives of helpfulness and harmlessness. The authors explore how RMs trained under different alignment regimes—helpfulness-only, harmlessness-only, and mixed-objective—perform on various tasks. They find that mixed-objective models often underperform compared to single-objective models, indicating interference between the objectives. Through activation-based methods, the study identifies neurons associated with each objective and examines their roles via targeted ablations. The results reveal that many neurons are shared between helpfulness and harmlessness, leading to alignment tension and negatively impacting model behavior. The findings highlight the need for future work on disentangled and controllable alignment methods to address the challenges of multi-objective alignment in RMs.
Methodology
The authors trained reward models under three alignment regimes (helpfulness-only, harmlessness-only, and mixed-objective) and evaluated their performance on diverse tasks. They employed activation-based methods to identify neurons associated with each objective and conducted targeted ablation studies to analyze their functional roles.
Results
The study found that mixed-objective models consistently underperformed compared to single-objective models across various tasks. Additionally, many neurons were identified as shared between helpfulness and harmlessness, which negatively impacted the performance of mixed-objective models. The research demonstrated that alignment tension arises from the interactions between these shared neuron sets.
Implications
The findings suggest that multi-objective alignment in reward models is not just a behavioral trade-off but also a representational interference problem. This has significant implications for the design of future reward models, emphasizing the need for approaches that can effectively disentangle and control alignment objectives.
Beyond the Commitment Boundary: Probing Epiphenomenal Chain-of-Thought in Large Reasoning Models
NLP
Large Language Models
Interpretability
- Introduces a causal framework for analyzing CoT reasoning in language models.
- Identifies a commitment boundary where models stabilize their answers.
- Demonstrates that reasoning beyond the commitment boundary is often redundant.
- Uses lightweight attention probes to predict answer-formation stages accurately.
Read more
Beyond the Commitment Boundary: Probing Epiphenomenal Chain-of-Thought in Large Reasoning Models
Summary
This paper investigates the causal influence of individual steps in Chain-of-Thought (CoT) reasoning within large language models. The authors introduce a causal framework to analyze CoT reasoning traces, focusing on the identification of a 'commitment boundary'—a point where the model transitions from transient guesses to a stable, high-confidence answer. They find that this transition typically occurs after a single reasoning step, leading to a significant shift in answer probabilities. Beyond this boundary, additional reasoning steps often do not alter the final answer, indicating the presence of epiphenomenal reasoning. The study employs attention probes to predict answer-formation stages from model activations, achieving high accuracy across various reasoning tasks. The authors demonstrate that early-exiting at the commitment boundary can reduce the length of CoTs by up to 55% without significantly impacting model performance, thus optimizing the reasoning process in large models.
Methodology
The authors utilize a step-level causal framework to measure the impact of each CoT step on the final answer. They employ early exit strategies to identify the commitment boundary and lightweight attention probes to decode answer-formation stages from model activations. This approach allows for monitoring the causal shifts in answer probabilities throughout the reasoning process.
Results
The study finds that the commitment boundary typically occurs after a single pivotal reasoning step, leading to a large shift in final-answer probabilities. The use of attention probes allows for reliable prediction of answer-formation stages, achieving high precision in both in-distribution and out-of-distribution tasks. Early-exiting at the commitment boundary results in an average reduction of 55% in reasoning length with negligible impact on accuracy.
Implications
The findings suggest that optimizing reasoning processes in large language models can lead to more efficient inference without sacrificing performance. This has potential applications in improving the efficiency of AI systems that rely on complex reasoning tasks, enhancing their usability in real-world scenarios.
When Does Routing Become Interpretable? Causal Probes on Block Attention Residuals
Interpretability
Large Language Models
- Introduces a routing-ablation framework for analyzing Block AttnRes models.
- Demonstrates that explicit depth routing alone does not guarantee mechanistic interpretation.
- Identifies three localized causal motifs in the trained Block AttnRes model.
- Finds a sharp dissociation between routing mass and causal importance.
Read more
When Does Routing Become Interpretable? Causal Probes on Block Attention Residuals
Summary
This paper investigates the interpretability of routing mechanisms in Block Attention Residuals (Block AttnRes), a model architecture that allows for explicit cross-layer information flow in Transformers. The study compares two 0.6B Block AttnRes checkpoints: a vanilla Qwen3 model wrapped with a deterministic recency-bias schedule and a Block AttnRes Qwen3 trained from scratch. The research aims to determine whether the explicit exposure of routing weights can lead to mechanistic interpretation of the model's behavior. The findings reveal that while the wrapped baseline produces routing weights that align with the recency-bias schedule, the trained AttnRes model exhibits distinct causal motifs. Notably, the study uncovers a dissociation between average routing mass and causal importance, indicating that high routing mass does not necessarily correlate with significant causal contributions. The paper concludes that while architectural exposure of routing is necessary for interpretability, it is not sufficient, and structured depth routing only emerges when routing is integrated into the training process.
Methodology
The study employs a routing-ablation framework to analyze two Block AttnRes checkpoints under identical routing-ablation interventions. It contrasts a vanilla Qwen3 model wrapped with a deterministic recency-bias schedule against a Block AttnRes model trained from scratch, examining the routing weights produced by both models during probing.
Results
The results indicate that the baseline model's routing weights reproduce the recency-bias schedule's predictions, while the trained AttnRes model reveals three distinct causal motifs. Additionally, the research highlights a significant dissociation between average routing mass and causal contributions, with some source families showing high routing mass but no causal impact.
Implications
The findings suggest that while architectural changes can enhance interpretability, they must be coupled with appropriate training strategies to yield meaningful insights into model mechanisms. This has implications for the design of interpretable AI systems, particularly in understanding complex models like Transformers.
Improving Crash Frequency Prediction from Simulated Traffic Conflicts Using Machine Learning Based Microsimulation
Optimization
- ML-based behavior models improve the realism of traffic microsimulation.
- The study demonstrates that ML-generated conflicts yield better crash predictions compared to rule-based models.
- Current ML models struggle to produce realistic crash scenarios despite accurately simulating conflicts.
- Extreme Value Theory is effectively used to model crash frequency from simulated conflicts.
Read more
Improving Crash Frequency Prediction from Simulated Traffic Conflicts Using Machine Learning Based Microsimulation
Summary
This paper explores the integration of machine learning (ML) into traffic microsimulation to enhance the prediction of crash frequency based on simulated traffic conflicts. Traditional microsimulation approaches utilize simplified rule-based behavior models that, while effective in replicating traffic flow, often fail to accurately reflect conflict dynamics, leading to poor crash prediction accuracy. The authors conducted microsimulation for five real-world signalized intersections in Leeds, UK, employing both a standard rule-based model and an advanced ML model trained on large-scale trajectory datasets. They analyzed simulated vehicle trajectories using a Time-to-Collision metric to identify conflicts, which were then modeled using Extreme Value Theory (EVT) to predict crash frequency. The findings indicate that the ML model produced crash predictions that aligned closely with actual crash data, in contrast to the rule-based model, which lacked meaningful predictive capability. However, the study also revealed that while ML models can effectively simulate conflicts, they do not yet generate realistic crash scenarios. The results suggest that ML-based behavior models hold promise for improving crash frequency predictions without requiring location-specific calibration, paving the way for future advancements in traffic microsimulation methodologies.
Methodology
The authors conducted traffic microsimulation for five signalized intersections using both a standard rule-based model and a state-of-the-art ML model. They analyzed vehicle trajectories with a Time-to-Collision metric to identify conflicts, which were then modeled using Extreme Value Theory to predict crash frequency.
Results
The ML model's predictions of crash frequency were consistent with real-world data, while the rule-based model failed to provide meaningful predictions. The study found that ML models could realistically reproduce traffic conflicts but were not yet capable of generating realistic crash events.
Implications
The findings suggest that ML-based traffic microsimulation can enhance the predictive accuracy of crash frequency assessments, potentially leading to better-informed infrastructure designs and safety interventions. This approach may facilitate the evaluation of new road safety technologies and modifications before their real-world implementation.
WHAR Arena: Benchmarking the State of the Art in Efficient Wearable Human Activity Recognition
Efficient ML
Time Series
- Introduction of a standardized benchmarking framework for WHAR.
- Curation of 30 datasets and 17 models to facilitate fair comparisons.
- Evaluation of performance metrics alongside deployment efficiency.
- Identification of a distributed state of the art in WHAR rather than dominance by a single architecture.
Read more
WHAR Arena: Benchmarking the State of the Art in Efficient Wearable Human Activity Recognition
Summary
The paper addresses the comparability crisis in Wearable Human Activity Recognition (WHAR) by proposing an open-source benchmarking framework that standardizes the evaluation of diverse datasets and models. The authors curate a WHAR Datasets library that includes 30 heterogeneous datasets and a WHAR Models library that unifies 17 representative architectures under a common inference interface. This framework enables large-scale benchmarking and reproducible analysis of performance and efficiency trade-offs. The study evaluates predictive performance alongside on-device latency, peak memory usage, and model size on an Android reference device, revealing that while CNN-HAR achieves the highest mean macro-F1 score, the state of the art is distributed across multiple models. The results indicate that compact models like TinierHAR and classical Random Forests are more efficient for deployment, suggesting that future progress in WHAR should focus on optimizing deployment efficiency and adapting to domain shifts. The authors release their framework to support transparent reuse and extension, aiming to enhance comparability and facilitate further research in WHAR.
Methodology
The authors developed an open-source benchmarking framework that standardizes 30 WHAR datasets and evaluates 17 model architectures under consistent data processing and cross-subject evaluation protocols. They measured predictive performance, latency, RAM usage, and model size using an Android reference device across 4760 training runs.
Results
The benchmarking revealed that while CNN-HAR achieved the highest mean macro-F1 score, the performance of models clustered closely, indicating a convergence near a predictive performance ceiling. Compact models and classical methods defined the Pareto frontier for deployment efficiency, while larger models incurred high costs without significant performance improvements.
Implications
The findings suggest that optimizing for deployment efficiency is crucial for practical applications of WHAR in real-time systems. The standardized framework can serve as a basis for future research, enabling better comparisons and advancements in the field.
Boosting Direct Preference Optimization with Penalization
NLP
Large Language Models
Optimization
- DPOP enhances DPO by incorporating a gated penalty on reference-greedy responses.
- The penalty is activated only when the current policy misclassifies the preferred response.
- Empirical results show DPOP outperforms existing methods like DPO, SimPO, and AlphaDPO.
- The method demonstrates the utility of reference-greedy responses in preference optimization.
Read more
Boosting Direct Preference Optimization with Penalization
Summary
This paper introduces Direct Preference Optimization with Penalization (DPOP), an enhancement of the existing Direct Preference Optimization (DPO) framework for offline preference optimization. DPO simplifies the alignment of large language models with human preferences by framing it as a pairwise classification task, but it has limitations as it only utilizes chosen and rejected responses from a static dataset, neglecting the potential insights from the reference model's greedy responses. DPOP addresses this by incorporating a gated penalty on the likelihood of the reference-greedy response, which is activated only when the current policy ranks the rejected response higher than the preferred one. This method effectively utilizes the reference model's output to improve the learning signal. Empirical evaluations on AlpacaEval 2.0 demonstrate that DPOP significantly outperforms DPO and its variants, achieving notable improvements in length-controlled win rates across different models. The findings suggest that leveraging reference-greedy responses can enhance offline penalization signals, thereby improving preference optimization.
Methodology
The methodology involves augmenting the DPO framework with a penalization mechanism that selectively penalizes the likelihood of responses generated by a reference model. The penalty is applied through a gated mechanism that activates only when the policy's likelihood for the rejected response exceeds that of the preferred response. Various penalty families are explored, including token-level unlikelihood and NPO-style penalties, to optimize the learning process.
Results
DPOP achieved a length-controlled win rate of 46.35 on the Llama-3-8b-it model, improving from 44.01 for SimPO and 41.48 for AlphaDPO. On the Gemma-2-9b-it model, it improved from 73.08 for SimPO and 74.90 for AlphaDPO to 78.22. These results indicate significant relative gains of 5.3% and 4.4% over the respective baselines.
Implications
The findings suggest that incorporating reference-greedy responses can enhance the effectiveness of offline preference optimization methods. This approach could lead to more robust alignment of language models with human preferences, potentially improving their performance in various applications such as conversational agents and content generation.
CLARITree: Cholesky and Lookahead Accelerations for Regression with Interpretable Piecewise Linear Trees
Efficient ML
Interpretability
Optimization
- CLARITree combines lookahead search strategies with Cholesky updates for efficient regression tree construction.
- The algorithm maintains numerical stability and computational feasibility for continuous-feature searches.
- Empirical results show that CLARITree consistently outperforms greedy induction methods and scales better than optimal baselines.
- The method achieves a strong trade-off between runtime and accuracy, making it suitable for large-scale problems.
Read more
CLARITree: Cholesky and Lookahead Accelerations for Regression with Interpretable Piecewise Linear Trees
Summary
The paper introduces CLARITree, a novel algorithm designed for constructing interpretable piecewise linear regression trees. Traditional methods for building regression trees often rely on greedy induction, which can lead to suboptimal performance. While optimal methods exist, they are computationally expensive and not scalable for general linear regression trees. The authors propose a combination of lookahead-style search strategies and efficient rank-one Cholesky updates of the Gram matrix to enhance the performance of regression trees. This approach allows for a favorable balance between computational efficiency, predictive accuracy, and sparsity. Theoretical and empirical analyses demonstrate that CLARITree achieves near-optimal accuracy while significantly improving scalability compared to existing methods. The results indicate that CLARITree outperforms greedy baselines and approaches optimal performance on various datasets, making it a promising tool for interpretable machine learning.
Methodology
The authors developed CLARITree by integrating a lookahead-style split optimization strategy with efficient rank-one Cholesky updates to maintain regularized Gram factorizations. This allows for fast and stable evaluation of candidate splits during the tree-building process. The algorithm is designed to handle continuous features directly, improving the efficiency of the regression tree search.
Results
CLARITree demonstrated superior performance in terms of mean squared error (MSE) and R-squared (R2) metrics compared to greedy methods and other optimal algorithms. For instance, on a synthetic dataset, CLARITree achieved an MSE of 4.03 and an R2 of 0.97, significantly outperforming greedy trees and other state-of-the-art methods.
Implications
The development of CLARITree has significant implications for interpretable machine learning, particularly in applications requiring both accuracy and transparency. Its scalability makes it suitable for large datasets, potentially expanding the use of interpretable models in various domains such as finance, healthcare, and social sciences.
Multi-Bitwidth Quantization for LLMs Using Additive Codebooks
Large Language Models
Efficient ML
Theory
- Drop-by-Drop enables inference-time precision control for LLMs without retraining.
- The method is grounded in information theory and successive refinement, allowing for progressive compression.
- It utilizes Matryoshka-style supervision to enhance the training of additive codebooks.
- The framework maintains low perplexity and strong task accuracy across multiple bitwidths.
Read more
Multi-Bitwidth Quantization for LLMs Using Additive Codebooks
Summary
The paper introduces Drop-by-Drop, a novel multi-bitwidth post-training quantization framework designed for large language models (LLMs) that allows adaptive inference-time precision control over model weights from a single trained model. This approach addresses the challenges posed by deploying LLMs across heterogeneous hardware with varying resource constraints, where maintaining multiple quantized model variants is impractical. The authors leverage information theory and successive refinement principles to show that LLM weights, typically following a Gaussian distribution, can be reconstructed with increasing fidelity as more bits are added. Drop-by-Drop incorporates Matryoshka-style supervision into the loss function, enabling ordered subsets of additive codebooks to yield accurate partial reconstructions at different precision levels. The method significantly reduces storage and memory overhead while maintaining competitive perplexity and accuracy across various architectures, including Qwen, LLaMA, Gemma, and Mistral.
Methodology
The authors propose a multi-bitwidth quantization framework called Drop-by-Drop, which builds on the AQLM method. It employs additive quantization using multiple learned codebooks and introduces Matryoshka-style supervision during training to facilitate accurate partial reconstructions at various precision levels. The theoretical foundation is established through a theorem that connects Gaussian sources with successive refinement under a weighted mean squared error distortion measure.
Results
Drop-by-Drop demonstrates that it can effectively adapt its precision during inference, achieving a smooth trade-off between model performance and bitwidth. The framework maintains competitive perplexity and accuracy across different architectures, showcasing graceful degradation in performance under resource constraints without the need for retraining or recalibration.
Implications
The proposed framework has significant implications for deploying large language models in real-world applications, particularly in environments with limited computational resources. It allows for flexible model deployment across diverse hardware platforms, enhancing the accessibility and efficiency of LLMs in various applications.
Let's Ask Gauss: Improved One-Run Privacy Auditing
Theory
Efficient ML
Federated Learning
- Introduces a Gaussian-pair auditor for improved one-run privacy auditing in DP-SGD.
- Demonstrates that canary-aligned scores converge to a Gaussian distribution, allowing for tighter privacy bounds.
- Proves practical convergence guarantees for the Gaussian asymptotics within typical training steps.
- Achieves significant improvements in empirical lower bounds compared to existing auditing methods.
Read more
Let's Ask Gauss: Improved One-Run Privacy Auditing
Summary
This paper addresses the challenge of privacy auditing in differential privacy (DP) machine learning, specifically focusing on efficient one-run methods for mechanisms like DP-SGD. Traditional one-run approaches often reduce canary-aligned signals to binary membership guesses, discarding valuable information. The authors propose a new framework that leverages the distributional properties of canary-aligned observations, which converge to a Gaussian distribution. By modeling the auditing process with a Gaussian-pair auditor, they derive tighter privacy lower bounds from a single training run. The paper also proves that the Gaussian asymptotics manifest within practical training steps, ensuring the method's applicability. Experimental results demonstrate significant improvements in privacy auditing accuracy compared to existing methods, making this approach particularly relevant for federated learning scenarios where multiple runs are impractical.
Methodology
The authors develop a one-run auditing framework that models the distribution of canary scores as a pair of one-dimensional Gaussians. They utilize the closed-form hockey-stick divergence between these Gaussians to derive tighter lower bounds on the DP distance. The methodology includes proving convergence guarantees for the Gaussian behavior of canary scores and conducting empirical evaluations on DP-SGD and DP-FTRL mechanisms.
Results
The proposed auditing framework yields an empirical lower bound of approximately 6.7 for DP-SGD on CIFAR-10, which is about 84% of the analytic upper bound. This represents a notable improvement over previous methods, which achieved lower bounds of approximately 4.7 and 3.3. The results indicate a 1-2x enhancement across various epsilon regimes, demonstrating the effectiveness of the new auditing approach.
Implications
The findings suggest that the proposed Gaussian-pair auditing method can significantly enhance the accuracy of privacy audits in machine learning models, particularly in scenarios where multiple runs are not feasible, such as federated learning. This could lead to more reliable implementations of differential privacy in real-world applications, ensuring better protection of sensitive data.
The Metric Picks the Winner: Evaluation Choice Flips Model Rankings for Drug-Response Prediction in Unseen Chemistry
Theory
Optimization
Interpretability
- Complex models often fail to outperform simple baselines in drug-response prediction when faced with unseen chemistry.
- A staged approach combining baseline reporting, non-parametric retrieval, and fusion with chemistry embeddings improves prediction accuracy.
- Model rankings can invert based on the evaluation metric, emphasizing the need for careful metric selection in model assessment.
- Deep learning models can outperform simpler models when evaluated with a well-calibrated metric that reflects true predictive performance.
Read more
The Metric Picks the Winner: Evaluation Choice Flips Model Rankings for Drug-Response Prediction in Unseen Chemistry
Summary
This paper addresses the challenge of predicting cellular responses to drugs that have not been previously encountered, a significant issue in computational cell biology. The authors focus on the VCPI DRUG-seq task using THP-1 cells, where they demonstrate that complex models often fail to outperform simple baselines when test compounds are withheld by chemistry. They propose a three-stage approach: first, establishing baseline performance using untreated controls and mean training responses; second, employing non-parametric retrieval to predict responses based on chemical similarity; and third, integrating a chemistry embedding with retrieval features to enhance predictions. The study reveals that model rankings can significantly change based on the evaluation metric used. Under one metric, a linear regression model appears superior, while under the contest's true metric, deep learning models outperform simpler approaches. This highlights the importance of metric selection in model evaluation, confirming that the choice of metric can influence perceived model effectiveness in drug-response prediction tasks.
Methodology
The authors developed a staged pipeline for drug-response prediction that includes baseline performance reporting, a non-parametric retrieval method for predicting responses based on chemical similarity, and a fusion model that combines a chemistry embedding with retrieval features. They implemented a scaffold-based cross-validation protocol to ensure generalization to new chemistry and validated their findings against the official scoring metrics of the VCPI contest.
Results
The results indicate that while retrieval methods perform well under shared chemical conditions, they struggle under strict scaffold splits. The fusion model significantly outperformed the linear regression baseline under the contest's true active-set metric, demonstrating the importance of capturing the biological effects of compounds. The findings confirmed that the choice of evaluation metric can drastically alter model rankings, with deep models ultimately proving more effective than simpler baselines when assessed with the appropriate metrics.
Implications
This research underscores the critical role of evaluation metrics in machine learning for drug-response prediction, suggesting that future studies should prioritize well-calibrated metrics to accurately assess model performance. The findings could lead to improved methodologies in computational biology and drug discovery, enhancing the ability to predict cellular responses to novel compounds.
Speculative Rollback Correction for Quality-Diverse Web Agent Imitation
Reinforcement Learning
Robotics
Optimization
- Introduction of Speculative Rollback Correction (SRC) for web agent imitation learning.
- SRC allows for localized expert intervention, preserving useful exploration while correcting harmful actions.
- The framework achieves significant performance gains on long-horizon tasks compared to baseline methods.
- SRC supports the retention of diverse solution paths, enhancing the training signal for agents.
Read more
Speculative Rollback Correction for Quality-Diverse Web Agent Imitation
Summary
This paper introduces Speculative Rollback Correction (SRC), a novel framework designed to enhance the training of interactive web agents through imitation learning. Traditional methods face challenges in determining the optimal timing for expert intervention, leading to compounded errors or excessive reliance on expert policies. SRC addresses this by implementing a branch-level imitation framework that allows agents to execute short speculative segments before expert review. This approach enables the identification of harmful deviations while preserving useful prefixes of the trajectory. The framework also incorporates a hard verifier to filter successful rollouts and maintain a quality-diversity archive of trajectories. The authors demonstrate that SRC effectively mitigates the compounding error associated with standard behavior cloning and supports diverse solution paths. Evaluations on the WebArena-Infinity benchmark show that SRC collects a substantial number of verifier-passing trajectories and next-action examples, achieving significant performance improvements over baseline methods in long-horizon tasks. The proposed framework is model-agnostic and serves as a general paradigm for training interactive agents, promoting a shift from passive imitation to autonomous execution.
Methodology
The SRC framework employs a branch-level training mechanism where the student agent executes short speculative segments of actions. After this segment, an expert reviews the trajectory to identify the first harmful deviation. If a deviation is found, the framework rolls back to preserve the useful prefix of the trajectory while correcting the harmful suffix. A hard verifier assesses the overall success of the trajectory, and a lightweight quality-diversity archive retains high-quality successful trajectories for training.
Results
The SRC framework demonstrated its effectiveness by collecting 977 verifier-passing trajectories and 9,183 next-action examples on the WebArena-Infinity benchmark. It showed consistent performance improvements over baseline methods, particularly in long-horizon tasks, and exhibited strong cross-domain generalization capabilities.
Implications
The SRC framework has the potential to significantly enhance the training of interactive agents in web and GUI environments, enabling them to learn more effectively from real-world interactions. This could lead to more robust and versatile agents capable of handling diverse tasks autonomously.
Quantizing Time-Series Models As Dynamical Systems: Trajectory-Based Quantization Sensitivity Score
Time Series
Efficient ML
Theory
- Introduction of Trajectory-based Quantization Sensitivity Score (TQS) for quantization sensitivity analysis.
- Decoupling of sensitivity estimation from quantization choices allows for flexible quantization budget planning.
- Development of TQS-PTQ, a calibration-free mixed-precision quantization framework.
- Identification of distinct quantization sensitivity patterns in time-series models compared to large language models.
Read more
Quantizing Time-Series Models As Dynamical Systems: Trajectory-Based Quantization Sensitivity Score
Summary
This paper introduces the Trajectory-based Quantization Sensitivity Score (TQS), a novel metric that approaches post-training quantization (PTQ) from the perspective of dynamical systems stability. By modeling the network's output as a discrete-time dynamical system, TQS quantifies how errors from quantization propagate over time. Unlike traditional PTQ methods that intertwine sensitivity analysis with quantization choices, TQS allows for independent sensitivity estimation, facilitating quantization budget planning for black-box models. The authors also present TQS-PTQ, a mixed-precision framework that operates without calibration data or complex approximations. The findings indicate that viewing quantization through a dynamical systems lens yields a robust method for low-precision deployment, particularly in resource-limited environments. The paper further discusses the transferability of PTQ assumptions from large language models to time-series forecasting models, revealing that quantization sensitivity is concentrated in specific model components, which can inform better quantization strategies.
Methodology
The authors propose TQS as a Lyapunov-inspired metric that evaluates the sensitivity of quantization by modeling the network's output as a discrete-time map. They compute the growth rate of output prediction divergence over a specified rollout horizon, allowing for a clear assessment of how quantization errors propagate. The TQS-PTQ framework is introduced to allocate mixed-precision quantization without requiring calibration data, relying instead on sensitivity scores derived from the TQS metric.
Results
The experimental results demonstrate that the TQS framework provides a reliable method for estimating quantization sensitivity, enabling effective low-precision deployment of time-series models. The TQS-PTQ framework shows promise in maintaining model performance while significantly reducing computational and memory requirements. Additionally, the study reveals that quantization sensitivity in time-series models behaves differently than in large language models, suggesting the need for tailored quantization strategies.
Implications
The findings have significant implications for deploying time-series models in resource-constrained environments, such as early warning systems in meteorology. The ability to plan quantization budgets effectively can enhance the operational efficiency of these models, ensuring rapid and accurate forecasts while minimizing resource usage.
Dolph2Vec: Self-Supervised Representations of Dolphin Vocalizations
Audio & Speech
- Dolph2Vec is the first large-scale, species-specific SSL model for dolphin vocalizations.
- The dataset includes over 180,000 whistles collected over five years, enabling detailed analysis of dolphin communication.
- Dolph2Vec significantly outperforms general-purpose models in signature whistle classification and whistle detection tasks.
- The model's embeddings capture interpretable acoustic units, aiding in the understanding of dolphin communication patterns.
Read more
Dolph2Vec: Self-Supervised Representations of Dolphin Vocalizations
Summary
The paper introduces Dolph2Vec, a self-supervised learning (SSL) model specifically designed for analyzing dolphin vocalizations. The authors highlight the limitations of existing SSL models in bioacoustics, which often prioritize generalization across species rather than focusing on the intricate structures of individual communication systems. To address this, they present a novel dataset comprising over five years of longitudinal recordings from five dolphins in a semi-naturalistic marine environment, which is significantly larger than previous datasets. Dolph2Vec is based on the Wav2Vec2.0 architecture and is trained exclusively on this dataset. The model is benchmarked on two biologically relevant tasks: signature whistle classification and whistle detection, where it outperforms general-purpose baselines. The learned embeddings and codebook structure provide interpretable acoustic units that align with dolphin whistle categories, facilitating a fine-grained analysis of communication patterns. The findings underscore the potential of SSL as both a modeling approach and a scientific tool to explore hypotheses in animal communication research, thereby bridging the gap between machine learning and bioacoustics.
Methodology
The authors adapted the Wav2Vec2.0 architecture for dolphin vocalizations and trained the model on a newly collected dataset of dolphin whistles. They benchmarked the model on tasks relevant to dolphin communication, specifically focusing on signature whistle classification and whistle detection.
Results
Dolph2Vec achieved superior performance compared to general-purpose baselines in both classification and detection tasks. The learned embeddings provided insights into the structure of dolphin whistles, aligning with known categories and potentially revealing sub-whistle structures.
Implications
The study highlights the potential of self-supervised learning in bioacoustics, offering a new approach to analyze animal communication without extensive manual annotation. It opens avenues for further research into the complexities of dolphin communication and could inspire similar methodologies for other species.
Order Is Not Control
Theory
Interpretability
Large Language Models
- Control requires a receiver-gated response law that maps states and actions to measurable outcomes.
- Order is distinct from control; interventions can induce order without achieving control.
- Empirical evidence from biological systems and LLMs supports the proposed response law framework.
- Local control is defined by the ability to move a target response while keeping side effects bounded.
Read more
Order Is Not Control
Summary
The paper argues that the concepts of order and control in adaptive systems are often conflated, particularly in the context of AI alignment and interpretability. It introduces a framework for understanding control through a receiver-gated response law, which maps various states and actions to measurable responses. The authors provide empirical evidence from biological systems (mouse ALM, C. elegans, zebrafish) and large language models (LLMs) to illustrate their points. They emphasize that control is only achieved when a finite effort can move a target response while keeping side effects bounded. The paper delineates between induced order, observable response evidence, and local control, proposing a driven-dissipative response-system account that operates at a mesoscopic level. The findings suggest that while interventions can induce structure, they do not necessarily equate to control unless they lead to measurable outcomes under specific conditions. The implications extend to organizational alignment, where established principles and prompts are evaluated based on their ability to produce desired responses without exceeding defined limits.
Methodology
The authors employed empirical studies across biological systems and large language models to identify and validate the proposed receiver-gated response laws. They analyzed the response behavior under various conditions, measuring the accuracy of generated outputs and the predictability of response vectors.
Results
The study found that response vectors in LLMs were predictable with 72.8-73.7% accuracy, increasing to 84.3-84.8% for nonzero components. Observers were able to predict system effects and target families with 93.6% and 91.7% accuracy, respectively. The evidence supports the existence of local admitted control and measurable stochastic response operators.
Implications
The findings suggest a need to reassess how control is defined in adaptive systems, particularly in AI alignment contexts. The framework could inform the design of interventions in organizational settings, ensuring that principles and prompts lead to desired outcomes without exceeding defined limits.
Scale Buys Interpolation, Structure Buys a Horizon: Certified Predictability for Equivariant World Models
Theory
Robotics
Time Series
- Introduces a certified horizon for equivariant latent world models, stratified by Lyapunov spectrum.
- Establishes that only equivariant models can achieve a predictable horizon, with a matching lower bound for approximate equivariance.
- Empirical validation shows that equivariant networks significantly outperform non-equivariant models in predicting chaotic dynamics.
- The certificate allows for training-free audits of pretrained models, enhancing trustworthiness without additional data.
Read more
Scale Buys Interpolation, Structure Buys a Horizon: Certified Predictability for Equivariant World Models
Summary
This paper addresses the limitations of traditional world models in predicting future states by introducing a method to certify the predictable horizon of equivariant latent world models. The author argues that while average prediction error is commonly used to evaluate models, it does not provide insight into the reliability of specific predictions over time. The key contribution is the development of a computable, multi-step certificate that guarantees the predictable horizon, stratified by the model's Lyapunov spectrum. The paper presents several theorems and propositions that establish the relationship between equivariance and predictability, demonstrating that only equivariant models can achieve a certified horizon. Empirical results on the 40-D Lorenz-96 system show that equivariant networks outperform non-equivariant baselines in recovering the Lyapunov spectrum, achieving a high R² value of 0.98. The findings suggest that the proposed certificate can be used to audit pretrained models without additional calibration data, providing actionable insights into their reliability and performance.
Methodology
The paper employs theoretical analysis to derive the relationship between equivariance and predictability, using Lyapunov spectrum to stratify the predictable horizon. It includes theorems and propositions that provide bounds and conditions for equivariant models. Empirical validation is conducted on the Lorenz-96 system to demonstrate the effectiveness of the proposed method.
Results
The main results include a computable certificate for the predictable horizon of equivariant models, with empirical evidence showing that only equivariant networks can recover the full Lyapunov spectrum. The proposed method allows for effective audits of pretrained models, confirming their reliability across various tasks without requiring additional calibration data.
Implications
This work has significant implications for the development of robust and reliable world models in machine learning, particularly in applications involving chaotic systems and dynamic environments. The ability to certify predictability can enhance the deployment of AI systems in real-world scenarios, where understanding the limits of model predictions is crucial.
Once-for-All: Scalable Simultaneous Forecasting via Equilibrium State Estimation
Time Series
Efficient ML
Theory
- ESE enables simultaneous forecasting of multiple systems in a single pass, improving efficiency.
- The method demonstrates a 10–70× speedup compared to state-of-the-art methods while maintaining accuracy.
- ESE can be integrated with existing predictors, enhancing their capabilities for multi-prediction.
- The approach is robust under diverse perturbations and scales effectively with the number of systems.
Read more
Once-for-All: Scalable Simultaneous Forecasting via Equilibrium State Estimation
Summary
This paper introduces Equilibrium State Estimation (ESE), a novel approach for simultaneous forecasting of multiple interacting systems, which is particularly relevant in fields like economics and healthcare. Traditional methods often predict each system individually, which can be inefficient and overlook inter-system interactions. ESE addresses this by estimating a collective equilibrium state across all systems, allowing for holistic forecasts based on the deviations from this equilibrium. The methodology consists of two components: an Equilibrium State Estimator that infers the latent equilibrium and a Predictor that forecasts system states based on this estimation. The authors conducted extensive experiments on synthetic and real-world datasets, including currency exchange rates and COVID-19 spread modeling, demonstrating that ESE achieves comparable accuracy to state-of-the-art methods while significantly improving speed. ESE's linear-time complexity allows it to scale efficiently as the number of systems increases, and it remains robust under various perturbations. Additionally, ESE can be integrated with existing prediction models, enhancing their performance in simultaneous multi-prediction tasks.
Methodology
The methodology involves estimating the equilibrium state across multiple interacting systems and generating forecasts based on the differences between the current state and the estimated equilibrium. It consists of two main components: an Equilibrium State Estimator and a Predictor, which can operate independently or in conjunction with existing forecasting models.
Results
The experiments showed that ESE is at least as accurate as state-of-the-art forecasting methods while being significantly faster. It achieved a speedup of 10–70× and maintained linear-time complexity, making it scalable as the number of systems increases. ESE also demonstrated robustness against various perturbations.
Implications
ESE has potential applications in fields requiring simultaneous forecasting of interdependent systems, such as economics, healthcare, and environmental modeling. Its efficiency and scalability make it a valuable tool for real-time predictions in complex scenarios.
How Useful is Causal Invariance for Domain Adaptation in Finite-Sample Settings?
Theory
- Causal invariance can improve supervised domain adaptation by identifying invariant predictors.
- Finite-sample gains depend on the target-risk margins and finite-source estimation errors.
- An adaptive aggregation procedure can outperform target-only learning under certain conditions.
- The study connects theoretical results to structural shifts in linear Structural Causal Models (SCMs).
Read more
How Useful is Causal Invariance for Domain Adaptation in Finite-Sample Settings?
Summary
This paper investigates the role of causal invariance in supervised domain adaptation (sDA) when dealing with finite sample sizes. The authors highlight that machine learning models often perform poorly when applied to target distributions that differ from their training source distributions. They explore how shared causal structures can lead to invariant predictors, which are models that maintain stable risk across domain shifts. The study focuses on linear regression, where causal knowledge can define invariant feature subsets that yield candidate predictors. The authors derive upper and lower bounds demonstrating that the finite-sample gains depend on the target-risk margins between candidates and the finite-source estimation error. They propose an adaptive aggregation procedure that can match the best candidate predictor when these margins are sufficiently large. Conversely, if the margins are too small, no algorithm can effectively leverage the candidate collection to achieve better performance than target-only learning. The theoretical findings are validated against real-world causal benchmarks, providing insights into the practical utility of causal knowledge in finite-sample settings.
Methodology
The authors analyze linear regression models under the framework of causal invariance, deriving theoretical bounds for performance based on the relationship between candidate predictors and their associated risks. They utilize an adaptive aggregation approach to optimize the selection of predictors based on available causal knowledge and sample sizes.
Results
The results indicate that when the target-risk margins are large relative to the sample sizes, it is possible to achieve better performance than traditional target-only learning methods. However, if the margins are small, leveraging causal knowledge does not yield significant improvements. The theoretical framework is supported by empirical validation on causal benchmarks.
Implications
The findings suggest that incorporating causal knowledge can enhance the performance of machine learning models in domain adaptation scenarios, particularly when sample sizes are limited. This has implications for various applications in fields where domain shifts are common, such as healthcare and social sciences.
Limits of spectral learning under noise
Theory
Interpretability
- Noise induces a predictable drift in spectral coefficient vectors.
- The magnitude of drift is related to the effective number of active spectral modes.
- A closed-form expression for the overlap between noisy and noiseless coefficients is derived.
- Numerical experiments validate the theoretical predictions across multiple spectral bases.
Read more
Limits of spectral learning under noise
Summary
This paper investigates the effects of additive noise on spectral learning, a method used to approximate unknown functions through spectral expansions. The authors focus on supervised regression scenarios where the observed labels are corrupted by noise. They demonstrate that noise leads to a predictable drift in the learned coefficient vector, with the degree of drift depending on the effective number of active spectral modes. By transforming the empirical feature geometry, the authors derive a closed-form expression for the overlap between noisy and noiseless coefficient vectors, revealing a universal degradation curve influenced by a single intrinsic noise scale. The theoretical findings are supported by numerical experiments using various spectral bases, including Fourier, Legendre, Bessel, and Haar bases. The results indicate that there exists a fundamental noise threshold beyond which the stability of coefficient estimates deteriorates, thereby limiting the recovery of functional structures from noisy data. This work highlights the critical understanding of how noise impacts spectral learning and the intrinsic limits it imposes on model discovery from noisy observations.
Methodology
The authors analyze the degradation of spectral learning under additive noise by employing sparse spectral expansions. They derive a closed-form expression for the overlap between noisy and noiseless coefficient vectors and conduct numerical experiments to validate their theoretical predictions across different spectral bases.
Results
The study reveals that as noise increases, the learned functions become distorted, and the spectral coefficients deviate significantly from their noiseless counterparts. The derived intrinsic noise scale indicates the transition point between stable and noise-dominated learning regimes. The numerical experiments confirm the theoretical degradation curve and the existence of a noise threshold affecting the stability of coefficient estimates.
Implications
The findings have significant implications for scientific inference and model discovery in various fields, including physics, chemistry, and engineering. Understanding the limits imposed by noise on spectral learning can guide researchers in developing more robust methods for extracting functional relationships from noisy data.
Disparate Impact in Synthetic Data Generation
Generative Models
Graph Learning
Theory
- Introduces a new definition of fairness in SDG based on disparate impact.
- Highlights the importance of assessing utility equality across sensitive groups.
- Investigates the causes of disparate impact, including estimation errors and sampling biases.
- Proposes a group-wise modeling approach to enhance utility and fairness.
Read more
Disparate Impact in Synthetic Data Generation
Summary
This paper addresses the fairness notion of disparate impact in synthetic data generation (SDG), focusing on whether the utility of generated records is equitable across sensitive groups. Unlike previous approaches that treat SDG as a means to correct biases in real-world data, this work posits that non-disparate impact is achieved when synthetic and real distributions align. The authors explore the reasons for potential failures in achieving this alignment, including approximation and estimation errors that may disproportionately affect different groups. They investigate the expressive power of SDG methods, the impact of sampling errors due to group proportions, and the effects of differential privacy mechanisms. Through experiments on both artificial and real-world datasets, the paper illustrates instances of disparate impact, particularly in SDG methods utilizing probabilistic graphical models (PGMs). A novel strategy for learning group-wise SDG models is introduced, demonstrating improvements in both overall utility and fairness across groups.
Methodology
The authors utilize probabilistic graphical models (PGMs) to learn distributions for synthetic data generation. They conduct experiments on both artificial and real datasets to analyze the disparate impact of SDG methods, focusing on the effects of sampling errors, estimation errors from differential privacy, and the complexity of underlying distributions across different sensitive groups.
Results
The experiments reveal significant instances of disparate impact in synthetic data generation, particularly highlighting how approximation and estimation errors can lead to unequal utility across sensitive groups. The proposed group-wise SDG modeling approach shows promise in mitigating these disparities, leading to improved overall utility and fairness.
Implications
The findings underscore the necessity of considering fairness in synthetic data generation, especially in applications where synthetic data is used for training machine learning models or for educational purposes. The proposed methodologies could be applied in various domains to ensure equitable treatment of different groups, thereby enhancing the trustworthiness of machine learning systems.
Crossing the Validation Crisis: Cross-Validation Reduces Benchmarking Variance Surprisingly Well
Theory
- Cross-validation significantly reduces benchmarking variance and increases confidence in performance estimates.
- The concept of 'sample gain' quantifies the benefits of using multiple cross-validation splits.
- Diminishing returns from additional splits occur later than expected, suggesting more splits can be beneficial.
- A dynamic early-stopping procedure for cross-validation can optimize computational efficiency.
Read more
Crossing the Validation Crisis: Cross-Validation Reduces Benchmarking Variance Surprisingly Well
Summary
This paper addresses the validation crisis in machine learning, where the statistical variability in performance evaluation can obscure genuine advancements due to limited test samples. The authors demonstrate that cross-validation (CV) significantly enhances the reliability of performance estimates when benchmarking learning algorithms. They introduce the concept of 'sample gain,' which quantifies the virtual data augmentation achieved through multiple CV splits. Experiments conducted on both synthetic and real-world datasets, including histopathologic scans and NLP fine-tuning, reveal that using multiple splits can greatly improve the stability of performance estimates. The authors also propose a dynamic early-stopping procedure for cross-validation, allowing for the estimation of potential sample gains from initial folds, thus optimizing computational resources. The findings emphasize the importance of leveraging cross-validation to achieve robust benchmarking in machine learning, particularly in domains with limited data.
Methodology
The authors conducted experiments using synthetic and real-world datasets to evaluate the impact of cross-validation on benchmarking variance. They introduced the concept of sample gain and developed a dynamic early-stopping procedure based on initial CV splits to assess the potential for further gains.
Results
The experiments demonstrated that cross-validation can provide a sample gain equivalent to an increase in statistical power, with values reaching around 10. The results indicated that the benefits of additional splits in cross-validation persist longer than previously anticipated, enhancing the reliability of performance comparisons.
Implications
The findings suggest that adopting more rigorous cross-validation practices can lead to more reliable benchmarking in machine learning, particularly in fields with limited data. This could facilitate more accurate assessments of algorithm performance and foster genuine advancements in the field.
Boltzmann Attention: Learnable Ising Couplings for Cooperative Attention
NLP
Large Language Models
Theory
- Boltzmann Attention introduces learnable pairwise couplings to enhance attention mechanisms.
- The method outperforms standard softmax attention in tasks involving longer sequences.
- Ablation studies confirm that improvements are due to the learnable couplings.
- The Ising model framework allows for potential integration with quantum computing techniques.
Read more
Boltzmann Attention: Learnable Ising Couplings for Cooperative Attention
Summary
This paper introduces Boltzmann Attention, an innovative attention mechanism that enhances standard attention models by incorporating learnable pairwise couplings based on the Ising model. Traditional attention mechanisms primarily compute relevance through individual query-key similarities, which limits their ability to model cooperative or antagonistic relationships among attention decisions. The proposed Boltzmann Attention augments local fields with these learnable couplings, enabling the model to capture inter-position correlations that standard softmax or sigmoid attention cannot. The authors demonstrate the effectiveness of Boltzmann Attention through experiments on character-level language modeling and synthetic bracket matching, showing consistent improvements over standard softmax attention, particularly as sequence lengths increase. An ablation study confirms that the enhancements stem from the learnable couplings. Additionally, the Ising model framework opens avenues for quantum-computing-based sampling strategies, with the authors showcasing that diabatic quantum annealing can serve as a practical training method while maintaining competitive performance with exact Boltzmann computation.
Methodology
The authors formulate attention as an interacting spin system using the Ising model, where each key position is represented as a binary spin. They introduce learnable pairwise couplings between these spins, allowing for the modeling of inter-position correlations. The attention weights are derived from the marginal spin magnetizations under the Boltzmann distribution, enhancing the representational capacity of the attention mechanism.
Results
Experiments demonstrate that Boltzmann Attention consistently outperforms standard softmax attention in character-level language modeling and synthetic bracket matching tasks. The performance improvement is particularly significant as the sequence length increases. The ablation study confirms that the learnable pairwise couplings are the source of these enhancements.
Implications
The findings suggest that explicitly modeling inter-position interactions can significantly improve attention-based sequence modeling. The integration of quantum computing techniques for training Boltzmann Attention could lead to more efficient and powerful models in the future.
Multimodal Graph Negative Learning
Graph Learning
Multimodal
- Introduces GraphMNL, a framework for addressing semantic imbalance in MAGs.
- Utilizes Negative Learning to guide inferior branches without forcing imitation of dominant branches.
- Implements a graph-aware reliability arbitration mechanism for branch selection.
- Achieves significant performance improvements over existing methods on benchmark datasets.
Read more
Multimodal Graph Negative Learning
Summary
This paper introduces GraphMNL, a novel framework for learning on Multimodal Attributed Graphs (MAGs), which combine graph topology with heterogeneous attributes like text and images. The authors identify a critical challenge in MAGs: node-level branch semantic imbalance, where different branches (representing various modalities) provide unequal levels of semantic information across nodes. Existing methods often rely on dominant branches for supervision, which can propagate biases and suppress useful semantics from inferior branches. GraphMNL addresses this by employing Negative Learning to guide inferior branches on what classes a node is unlikely to belong to, rather than forcing them to imitate potentially biased dominant predictions. The framework consists of four modules: branch construction to disentangle prediction pathways, graph-aware reliability arbitration to select dominant branches based on multiple factors, stability gating to manage transfer stability, and target-preserving negative learning to suppress unlikely alternatives. This approach allows for more reliable integration of multimodal information and improves classification performance.
Methodology
GraphMNL employs a four-module architecture: (1) branch construction to separate prediction pathways, (2) graph-aware reliability arbitration for selecting branches based on confidence and context, (3) stability gating to manage the transfer of information, and (4) target-preserving negative learning to suppress unlikely class predictions while maintaining focus on the target class.
Results
GraphMNL achieved a classification accuracy of 72.47% on the Grocery dataset and a 76.60 F1 score on the Reddit M dataset, outperforming the second-best baseline by 1.81% and 3.63%, respectively.
Implications
The proposed framework can enhance applications in recommendation systems, social media analysis, and other domains that utilize multimodal data, by providing more accurate and reliable node representations.
Rubric-Guided Self-Distillation: Post-Training Without Rubric Verifiers
NLP
Large Language Models
Reinforcement Learning
- RGSD eliminates the need for external LLM verifiers in training, reducing computational overhead.
- The method transforms sparse trajectory-level rewards into dense per-token learning signals.
- RGSD achieves competitive performance compared to traditional judge-based methods while being more efficient.
- The study highlights the importance of rubric conditioning in enhancing model responses.
Read more
Rubric-Guided Self-Distillation: Post-Training Without Rubric Verifiers
Summary
This paper introduces Rubric-Guided Self-Distillation (RGSD), a novel training method designed to enhance the performance of language models in open-ended tasks without relying on external rubric verifiers. Traditional approaches utilize large language model (LLM) verifiers to score responses based on predefined rubrics, which can introduce computational overhead and biases. RGSD addresses these issues by allowing the base policy, conditioned on the rubric, to act as a teacher for an unconditioned student model. This method replaces sparse trajectory-level rewards with dense per-token learning signals, effectively removing the need for LLM judges during training. The authors demonstrate RGSD's effectiveness across various models (Qwen-2.5 and Qwen3-Thinking) in medical and scientific domains, achieving comparable rubric satisfaction to judge-based methods while significantly reducing computational costs. The findings suggest that RGSD can serve as a complementary approach when verifier reliability or cost is a concern, with the potential to improve training efficiency and model performance.
Methodology
RGSD employs a self-distillation approach where the student model generates a response based on a prompt, while a teacher model, conditioned on the same prompt and the rubric, provides per-token target distributions. The student is trained to match these distributions using a clipped Jensen-Shannon divergence loss, effectively internalizing the rubric without external grading.
Results
RGSD was evaluated on Qwen-2.5 and Qwen3-Thinking models, showing that it achieves rubric satisfaction scores comparable to judge-based methods (e.g., +6.1 vs. +5.9pp in medical tasks). It also demonstrated improved computational efficiency, requiring only one on-policy rollout per prompt and no verifier calls. Notably, RGSD reached peak quality at shorter response lengths compared to traditional methods.
Implications
The findings suggest that RGSD can be a viable alternative for training language models in open-ended tasks, particularly in scenarios where verifier costs or reliability are significant concerns. This method could enhance the efficiency and effectiveness of model training in various applications, including clinical advice and creative writing.
MiniPIC: Flexible Position-Independent Caching in <100LOC
Large Language Models
NLP
Efficient ML
- MiniPIC enables flexible position-independent caching with minimal code changes.
- It introduces user-controlled primitives for cache reuse, enhancing flexibility.
- The system achieves significant performance improvements in LLM inference tasks.
- MiniPIC integrates seamlessly with existing KV cache implementations.
Read more
MiniPIC: Flexible Position-Independent Caching in <100LOC
Summary
The paper introduces MiniPIC, a novel approach to position-independent caching designed for retrieval-augmented and agentic workloads in large language models (LLMs). Traditional prefix caching methods in systems like vLLM are limited to requests with identical prefixes, which prevents effective reuse of cached key-value (KV) entries when requests share common spans but differ in context. MiniPIC addresses this limitation by implementing a minimalistic design that requires fewer than 100 lines of code changes to the core engine. It utilizes a positional-encoding-free KV cache and provides user-controlled cache-reuse primitives, allowing for flexible caching strategies. The system integrates three user-facing primitives—block-aligned padding, SSEP, and PDEP—that modify hashing behavior and causal attention structure. The results demonstrate that MiniPIC significantly enhances throughput and reduces latency, achieving a 49% improvement in prefill throughput on the 2WikiMultihopQA benchmark while maintaining linear scaling for uncached spans and incurring minimal overhead.
Methodology
The authors developed MiniPIC by creating a position-free KV cache that stores unrotated keys and applies rotary positional encoding (RoPE) during attention computation. They designed three user-facing primitives to control cache reuse, allowing users to modify caching behavior without extensive changes to the inference engine. Additionally, a new scheduling policy was introduced to optimize the prefill process.
Results
MiniPIC demonstrated a 49% increase in prefill throughput on the 2WikiMultihopQA dataset compared to the baseline vLLM. It also reduced the time-to-first-token for cached spans by up to two orders of magnitude and maintained linear scaling for uncached spans, with only a 5.7% worst-case overhead.
Implications
The findings suggest that MiniPIC can significantly enhance the efficiency of LLM inference systems, making them more adaptable to varied workloads. This could lead to improved performance in applications requiring rapid retrieval and processing of structured inputs, such as document generation and interactive AI systems.
μVLA: On Recurrent Memory for Partially Observable Manipulation in VLA Models
Multimodal
Robotics
- Introduces μVLA, a family of recurrent fine-tunes of OpenVLA-OFT focused on isolating the effects of recurrence.
- Demonstrates significant performance improvements on MIKASA-Robo tasks, particularly under partial observability.
- Establishes a controlled study framework to evaluate the impact of recurrence without confounding factors.
- Identifies performance regimes for minimal recurrence, highlighting when it is sufficient and when additional memory structures are needed.
Read more
μVLA: On Recurrent Memory for Partially Observable Manipulation in VLA Models
Summary
The paper presents μVLA, a novel approach to enhancing Vision-Language-Action (VLA) models by incorporating recurrent memory to address the challenges posed by partial observability in manipulation tasks. Traditional VLA models predict future actions based solely on current observations, which can lead to failures when relevant information is not visible. The authors conduct a controlled study isolating the effects of recurrence within a strong pretrained VLA backbone, avoiding the confounding influences of auxiliary losses or architectural changes. μVLA introduces a small set of learnable memory tokens that are updated through self-attention and carried across timesteps, trained end-to-end using truncated backpropagation through time (TBPTT). The study evaluates μVLA on the MIKASA-Robo benchmark, demonstrating significant improvements in task performance, particularly in scenarios with partial observability. The results indicate that minimal in-backbone recurrence can effectively enhance model performance under certain conditions, while also identifying limitations where additional memory structures may be necessary. This work contributes to the understanding of how recurrence can be effectively utilized in VLA models, providing insights into the conditions under which it is beneficial.
Methodology
The methodology involves augmenting a pretrained VLA model with learnable memory tokens that are updated through self-attention across timesteps. The model is trained using truncated backpropagation through time (TBPTT) without auxiliary losses or architectural modifications, allowing for a focused investigation of the role of recurrence in performance.
Results
On the MIKASA-Robo benchmark, μVLA improved the average success rate from 0.42 to 0.84 in five training tasks. For held-out tasks with the same memory structure, it achieved a success rate of 0.23 compared to 0.07 for the memoryless baseline. In fully observable settings, the strongest recurrent variant reached an average success rate of 96.2%.
Implications
The findings suggest that incorporating minimal recurrent memory can significantly enhance the performance of VLA models in partially observable environments, which has implications for robotics and other applications requiring decision-making under uncertainty. This work also lays the groundwork for future research on memory structures in machine learning models.
Emerging Flexible Designs for Geospatial Multimodal Foundation Models
Computer Vision
Multimodal
- Standardized benchmarking of geospatial foundation models using unified pretraining objectives and evaluation protocols.
- Insights into the impact of tokenization and fusion strategies on model robustness and spectral reasoning.
- Identification of trade-offs between flexibility and homogeneity in model architectures.
- Demonstration of Flex's adaptability to missing or heterogeneous bands compared to standard architectures.
Read more
Emerging Flexible Designs for Geospatial Multimodal Foundation Models
Summary
This paper investigates the architectural diversity of foundation models (FMs) in geospatial multimodal reasoning, focusing on their performance trade-offs. The authors conduct a systematic comparison of three leading FM architectures—DOFA, SatMAE, and Flex—using standardized pretraining and evaluation protocols. By employing identical self-supervised learning objectives and datasets, the study isolates architectural contributions from other variables, allowing for a fair assessment of model flexibility, modality alignment, and downstream task performance. The results reveal critical insights into how tokenization and fusion strategies impact model robustness and spectral reasoning, highlighting trade-offs between flexibility and generalization. The findings suggest that while Flex's modular design enhances adaptability to diverse spectral bands, it may underperform in homogeneous settings, emphasizing the need for alignment between architecture and data diversity. Overall, this research provides practical guidance for developing next-generation geospatial foundation models capable of robust multimodal reasoning.
Methodology
The study compares three foundation model architectures (Flex, DOFA, and SatMAE) under controlled conditions using a shared Sentinel-2 dataset. All models are pretrained with identical self-supervised learning strategies and evaluated on the GeoBench benchmark for classification and segmentation tasks. The evaluation employs linear probing for classification and a shared decoder for segmentation, ensuring consistent adaptation across modalities.
Results
The results indicate that architectural choices significantly influence model performance in geospatial tasks. Flex demonstrated improved adaptability to diverse spectral bands but showed limitations in spectrally homogeneous scenarios. The standardized benchmarking approach provided clear insights into the strengths and weaknesses of each architecture, facilitating a better understanding of design trade-offs.
Implications
The findings have significant implications for the development of future geospatial foundation models, particularly in enhancing their ability to process and reason across multiple modalities. This research could lead to improved applications in environmental monitoring, disaster response, and agricultural assessment by leveraging robust multimodal reasoning capabilities.
A Stabilized Path-Space Approach to Diffusion-Based Posterior Sampling
Generative Models
Optimization
Theory
- Introduces a stabilized path-space framework for diffusion-based posterior sampling.
- Connects posterior sampling to stochastic optimal control, enhancing uncertainty quantification.
- Eliminates bias from the initial value function through time reparameterization.
- Demonstrates improved accuracy and robustness in benchmark inverse problems.
Read more
A Stabilized Path-Space Approach to Diffusion-Based Posterior Sampling
Summary
This paper presents a novel stabilized path-space framework for diffusion-based posterior sampling, addressing the limitations of existing methods that rely on heuristic approximations. The authors propose a method that formulates posterior sampling as a measure-matching problem in the path space of a diffusion model. By defining a likelihood-weighted target measure on trajectories, they connect diffusion posterior sampling to stochastic optimal control while maintaining the necessary Bayesian structure for uncertainty quantification. A key innovation is the introduction of a time reparameterization that eliminates bias from the unknown initial value function without the need for auxiliary training. The control is learned through a trust-region path-space optimization method with log-variance objectives. The framework also provides a unified perspective on existing guidance-based samplers and quantifies sampling errors induced by approximate controls. The proposed method is evaluated on benchmark inverse problems, demonstrating improved accuracy and robustness compared to leading approaches, thus offering insights into the behavior of diffusion-based posterior samplers.
Methodology
The authors develop a path-space formulation for posterior sampling that involves defining a likelihood-weighted target measure on trajectories. They optimize divergences between the controlled path measure and the target using a trust-region path-space optimization method, allowing for effective learning of the control without auxiliary training.
Results
The proposed framework was tested on a suite of benchmark inverse problems, showing significant improvements in sampling accuracy and robustness compared to existing diffusion-based posterior samplers. The experiments provided principled assessments of sampling accuracy and uncertainty quantification.
Implications
This work has potential applications in various fields requiring Bayesian inference and uncertainty quantification, such as fluid dynamics, medical imaging, and computational biology. The improved sampling methods can enhance decision-making processes in these domains.
Reliability of Probabilistic Emulation of Physical Systems
Generative Models
Theory
Efficient ML
- Developed a framework to evaluate the reliability of probabilistic emulation methods.
- CRPS-trained ensembles generally provide more reliable uncertainties and faster inference than generative models.
- Generative models trained in latent space can achieve comparable coverage to CRPS ensembles but with higher latency.
- Introduced AutoCast and AutoSim for modular modeling and flexible dataset generation.
Read more
Reliability of Probabilistic Emulation of Physical Systems
Summary
This paper addresses the reliability of probabilistic emulation methods for physical systems, focusing on two dominant approaches: generative models and CRPS-trained ensembles. While both methods have shown strong predictive accuracy, their uncertainty quantification (UQ) reliability has not been systematically evaluated. The authors develop a framework to assess these approaches across various 2D spatiotemporal physical systems, emphasizing empirical coverage of predictive intervals, accuracy, and computational efficiency. The findings reveal that CRPS-trained ensembles generally provide more reliable uncertainties and faster inference compared to generative models, particularly when trained in ambient space. When generative models are trained in latent space, they can achieve comparable coverage to CRPS ensembles but at a higher inference cost. The paper introduces AutoCast, a modular framework for implementing both modeling approaches, and AutoSim, a dataset generation package that facilitates rapid prototyping. The results underscore the importance of reliable UQ in real-world applications, highlighting future research directions for improving probabilistic forecasts in complex physical emulation.
Methodology
The authors compared two probabilistic modeling approaches: generative models (e.g., diffusion and flow matching) and CRPS-trained ensembles. They assessed their performance using empirical coverage of predictive intervals, accuracy metrics, and computational efficiency across diverse 2D spatiotemporal physical systems. The study utilized AutoSim for dataset generation and AutoCast for implementing the modeling frameworks.
Results
The study found that CRPS-trained ensembles typically achieved better empirical coverage and faster inference times compared to generative models. When generative models were trained in ambient space, they exhibited higher latency but comparable coverage to CRPS ensembles. The results highlight the trade-offs between reliability and computational efficiency in probabilistic emulation.
Implications
The findings have significant implications for the deployment of probabilistic emulators in real-world applications, such as climate modeling and materials science, where reliable uncertainty quantification is crucial for informed decision-making and risk assessment. The introduction of AutoCast and AutoSim also facilitates further research and development in this area.
SupraBench: A Benchmark for Supramolecular Chemistry
Large Language Models
NLP
- Introduction of SUPRABENCH, the first benchmark for supramolecular chemistry tasks.
- Development of four fundamental tasks and one auxiliary task for evaluating LLMs.
- Release of SUPRAPMC, a large corpus of supramolecular chemistry articles for research.
- Benchmarking reveals significant performance gaps in existing LLMs, highlighting areas for improvement.
Read more
SupraBench: A Benchmark for Supramolecular Chemistry
Summary
The paper introduces SUPRABENCH, the first benchmark specifically designed for evaluating large language models (LLMs) in the context of supramolecular chemistry. This field focuses on non-covalent host–guest assemblies, which are crucial for various applications such as drug delivery and chemical sensing. The authors highlight the challenges in designing host–guest systems, which typically require extensive computational verification. To address this, they propose a systematic evaluation framework comprising four fundamental tasks: binding affinity prediction, top-binder selection, solvent identification, and host–guest description, along with an auxiliary vision-based task for molecular identification. Additionally, they release SUPRAPMC, a 16M-token corpus of supramolecular chemistry literature to facilitate domain adaptation for LLMs. The evaluation of multiple LLMs reveals significant performance gaps, indicating that while domain adaptation improves results, there remains substantial room for improvement across all tasks. The authors provide insights into the strengths and weaknesses of current LLMs in this domain, aiming to foster further research and development in supramolecular chemistry.
Methodology
The authors collaborated with domain experts to define the benchmark tasks and curated a comprehensive dataset (SUPRAPMC) from existing literature. They then evaluated a range of open and proprietary LLMs on these tasks, analyzing their performance and identifying specific failure modes.
Results
The benchmarking results indicated that LLMs exhibit substantial headroom across all evaluated tasks. Domain adaptation through pretraining on SUPRAPMC improved performance, particularly in in-distribution regression tasks, but also revealed trade-offs in output formatting. The difficulty profile varied significantly across task families, exposing distinct challenges in supramolecular chemical reasoning.
Implications
The introduction of SUPRABENCH and SUPRAPMC is expected to accelerate research in supramolecular chemistry by providing a standardized evaluation framework for LLMs. This could lead to more efficient design processes for host–guest systems, ultimately benefiting applications in drug delivery and chemical sensing.
Simplex-Constrained Sparse Bagging: Transitioning from Uniform Priors to Sparse Posteriors in Ensemble Learning
Optimization
Efficient ML
Theory
- SCSB transitions from uniform priors to sparse posteriors in bagging ensembles.
- Introduces a concave quadratic penalty to address the L1-simplex paradox.
- Achieves up to 96% ensemble compression with linear inference speedups.
- Improves probability calibration and generalization accuracy.
Read more
Simplex-Constrained Sparse Bagging: Transitioning from Uniform Priors to Sparse Posteriors in Ensemble Learning
Summary
This paper introduces Simplex-Constrained Sparse Bagging (SCSB), a novel framework aimed at enhancing the performance of bootstrap-based bagging ensembles by transitioning from uniform priors to sparse posteriors. Traditional bagging methods, such as Random Forests and Bagged SVMs, assign equal voting power to all base estimators, which can lead to overconfidence in model predictions and inefficient use of computational resources. SCSB addresses these issues by formulating ensemble pruning and calibration as a joint optimization problem over the probability simplex, minimizing Out-Of-Bag (OOB) loss to ensure effective model performance without data leakage. A key innovation is the introduction of a concave quadratic penalty to overcome the L1-simplex paradox, which typically hinders sparsity in simplex-constrained optimization. The authors demonstrate that SCSB can achieve up to 96% ensemble compression, resulting in faster inference times and improved probability calibration while maintaining or enhancing generalization accuracy. The framework is model-agnostic and provides a mathematically rigorous approach to optimizing ensemble weights, making it a significant contribution to the field of ensemble learning.
Methodology
The SCSB framework formulates the optimization of ensemble weights as a constrained problem over the probability simplex, minimizing OOB loss. It employs a concave quadratic penalty to induce sparsity in the weight vector, allowing for the pruning of less competent estimators. Analytical gradients for both classification and regression tasks are derived to facilitate efficient optimization using Sequential Least Squares Programming (SLSQP).
Results
The empirical results indicate that SCSB can compress ensembles by up to 96%, leading to significant reductions in computational latency during inference. Additionally, the method enhances probability calibration, as evidenced by a lower Expected Calibration Error (ECE), while preserving or improving the generalization accuracy of the models.
Implications
The SCSB framework has the potential to make ensemble learning more efficient and applicable in resource-constrained environments, such as real-time systems. By improving model calibration and reducing redundancy, it can enhance the deployment of machine learning models in various applications, including those requiring high reliability and speed.
CausalMoE: A Billion-Scale Multimodal Foundation Model for Granger Causal Discovery with Pattern-Routed Heterogeneous Experts
Time Series
Multimodal
Large Language Models
- CausalMoE addresses the limitations of existing GCD methods by modeling patch-level temporal heterogeneity.
- The model utilizes a Pattern-Routed Mixture of Heterogeneous Experts to route time-series data to specialized experts.
- Integration of LLMs and VLMs allows for the incorporation of multimodal semantic priors in causal discovery.
- CausalMoE achieves state-of-the-art results on supervised benchmarks and demonstrates effective generalization in few-shot settings.
Read more
CausalMoE: A Billion-Scale Multimodal Foundation Model for Granger Causal Discovery with Pattern-Routed Heterogeneous Experts
Summary
CausalMoE introduces a novel approach to Granger Causal Discovery (GCD) by addressing the limitations of existing neural GCD methods that rely on a uniform distribution model, which often leads to spurious causal graphs due to the inability to capture dynamic regime changes in time series data. The proposed model is a billion-scale multimodal foundation model that employs a Pattern-Routed Mixture of Heterogeneous Experts (MoHE) architecture. This architecture dynamically identifies latent temporal patterns and routes time-series patches to specialized domain experts, effectively separating regime-specific mechanisms from shared dynamics. Additionally, CausalMoE integrates Large Language Models (LLMs) and Vision-Language Models (VLMs) to align numerical signals with textual and visual information, enhancing causal estimation in complex scenarios. The model employs a Causality-Aware Self-Attention mechanism to ensure interpretable graph recovery, yielding sparse Granger causal graphs through proximal optimization. Extensive experiments demonstrate that CausalMoE achieves state-of-the-art performance on fully supervised benchmarks and shows robust generalization capabilities in few-shot settings, outperforming traditional methods that struggle in such scenarios.
Methodology
CausalMoE employs a Pattern-Routed Mixture of Heterogeneous Experts architecture that dynamically routes time-series patches to appropriate domain experts based on identified latent temporal patterns. It integrates LLMs and VLMs to leverage multimodal information, enhancing the causal discovery process. The model also incorporates a Causality-Aware Self-Attention mechanism for interpretable graph recovery, utilizing proximal optimization to yield sparse Granger causal graphs.
Results
CausalMoE establishes a new state-of-the-art performance on fully supervised benchmarks for Granger causal discovery and effectively generalizes to few-shot settings, outperforming traditional GCD methods that rely on a uniform distribution model.
Implications
The development of CausalMoE has significant implications for causal discovery in complex systems, enabling more accurate analysis of temporal dependencies in various fields such as economics, genetics, and social sciences. Its multimodal capabilities could enhance the understanding of causal relationships where numerical data alone is insufficient.
DynamicPTQ: Mitigating Activation Quantization Collapse via Residual-Stream Dynamics
Large Language Models
Efficient ML
NLP
- DynamicPTQ addresses the issue of quantization collapse in activations during PTQ.
- The method introduces phase-aware mixed-precision quantization based on residual-stream dynamics.
- DynamicPTQ improves model performance while maintaining low memory overhead.
- The approach can be integrated with existing PTQ methods like QuaRot and SpinQuant.
Read more
DynamicPTQ: Mitigating Activation Quantization Collapse via Residual-Stream Dynamics
Summary
The paper addresses the challenges of post-training quantization (PTQ) for large language models (LLMs), particularly when quantizing activations to 4-bit precision. The authors identify that massive activations, which dominate the activation range, lead to significant quantization errors. Existing methods primarily focus on transformation-based smoothing techniques but fail to account for the cross-layer dynamics of the residual stream. The authors propose DynamicPTQ, a novel quantization policy that recognizes the phase-wise behavior of massive activations across network depth. By introducing metrics such as Jump Ratio and Historical Feature SNR, they demonstrate that static smoothing methods are insufficient for addressing dynamic quantization instability. DynamicPTQ selectively applies 8-bit precision to quantization-sensitive layers while maintaining 4-bit precision for others, allowing for improved performance without extensive memory overhead. Experiments on LLaMA-2 and LLaMA-3 show that DynamicPTQ enhances perplexity and zero-shot QA performance while achieving throughput improvements.
Methodology
The authors analyze the behavior of massive activations across different phases of network depth and introduce metrics to quantify the impact of residual changes on quantization. DynamicPTQ is developed to dynamically assign quantization precision based on the sensitivity of layers to quantization errors, allowing for a mixed-precision approach.
Results
DynamicPTQ consistently improves perplexity and zero-shot QA performance on LLaMA-2 and LLaMA-3 under W4A4KV4 quantization settings. The method achieves a throughput improvement of 1.05× to 1.07× with modest memory overhead, demonstrating its effectiveness in practical low-bit LLM inference.
Implications
The findings suggest that incorporating dynamic analysis of activation behavior can lead to more robust quantization strategies for LLMs, potentially enabling more efficient deployment in resource-constrained environments while maintaining performance.
Representing Time Series as Structured Programs for LLM Reasoning
Large Language Models
Time Series
- Introduction of T2SP, a structured representation for time series that aligns with LLM capabilities.
- T2SP is deterministic, invertible, and training-free, making it compatible with off-the-shelf LLMs.
- Demonstrated improvements in reasoning performance and reduced inference time across various tasks.
- Addresses the representation mismatch that hampers LLMs' performance on time series analysis.
Read more
Representing Time Series as Structured Programs for LLM Reasoning
Summary
This paper addresses the challenge of effectively representing time series data for large language models (LLMs), which excel in reasoning and instruction-following but struggle with raw numerical sequences. The authors propose a novel representation method called Time-Series-to-Structured-Program (T2SP), which transforms time series into a structured symbolic program format. This approach decomposes time series into trends, periods, and salient events, allowing LLMs to leverage their existing capabilities without the need for fine-tuning or extensive computational resources. The T2SP representation aligns with the textual and code-like modalities that LLMs are trained on, thus facilitating better reasoning over time series data. The authors evaluate T2SP on three reasoning tasks: editing, captioning, and question answering, demonstrating that it consistently improves performance, reduces reasoning time, and lowers failure rates compared to traditional raw-string representations. The findings suggest that T2SP provides an effective interface between time series and LLMs, enhancing the models' ability to understand and analyze temporal data.
Methodology
The authors developed the Time-Series-to-Structured-Program (T2SP) representation, which decomposes time series data into structured components (trends, periods, events) and expresses them in a program-friendly syntax. This method allows LLMs to reason directly over the temporal structure without requiring them to infer it from raw numerical sequences. The evaluation involved testing T2SP on three reasoning tasks to assess its performance compared to raw-string representations.
Results
The evaluation results showed that T2SP consistently outperformed raw-string representations in terms of reasoning performance, reduced inference time, and lower failure rates across the three tasks (editing, captioning, and question answering). The structured representation allowed LLMs to leverage their existing reasoning capabilities more effectively, particularly with longer sequences.
Implications
The findings suggest that T2SP can significantly enhance the ability of LLMs to analyze and interpret time series data, making it a valuable tool for applications in various domains such as finance, healthcare, and environmental monitoring where time series analysis is critical. This approach may also inspire further research into structured representations for other types of data.
Different Layers, Different Manifolds: Module-Wise Weight-Space Geometry in Transformer Optimization
Optimization
Large Language Models
NLP
- Different transformer modules (attention vs. MLP) prefer different weight-space geometries.
- Stiefel geometry for attention layers and DGram geometry for MLP layers yields optimal performance.
- Uniform manifold constraints can lead to instability in training dynamics.
- Singular value growth in DGram-constrained attention weights can cause softmax saturation.
Read more
Different Layers, Different Manifolds: Module-Wise Weight-Space Geometry in Transformer Optimization
Summary
This paper investigates the impact of weight-space geometry on the optimization of transformer models, specifically focusing on the GPT-2 architecture. The author questions the conventional approach of applying uniform manifold constraints across all weight matrices and proposes that different transformer modules, such as attention and MLP layers, may benefit from distinct manifold geometries. The study employs the Manifold Muon optimization technique to analyze layer-wise assignments of Stiefel and DGram constraints. The findings reveal that constraining attention layers with Stiefel geometry while applying DGram geometry to MLP layers yields the best performance. In contrast, the reverse assignment and an all-DGram configuration lead to instability during training. The paper attributes this instability to singular value growth in DGram-constrained attention weights, which can amplify logits and induce softmax saturation, thereby degrading training dynamics. The results advocate for module-specific optimization strategies in transformer architectures, emphasizing the need for tailored weight-space geometries to enhance stability and performance.
Methodology
The study utilizes the Manifold Muon optimization framework to explore layer-wise manifold assignments in GPT-2 pretraining. It compares five different configurations of Stiefel and DGram constraints applied to attention and MLP layers, assessing their impact on training stability and performance.
Results
The experiments demonstrate that the HETERO configuration (Stiefel for attention and DGram for MLP) achieves the best validation loss, while the HETERO-INV configuration (DGram for attention and Stiefel for MLP) and the ALL-DGRAM configuration both result in unstable training outcomes. The findings indicate that the choice of manifold geometry significantly affects the training dynamics of transformer models.
Implications
The results suggest that transformer optimization should consider the specific computational roles of different modules, leading to improved training stability and performance. This approach could inform the design of more effective optimization strategies for large language models and other transformer-based architectures.
Loss-Shift Transfer via Bayes Quotients
Theory
- Identifies loss shift as a distinct transfer failure mode from distribution shift.
- Introduces Bayes quotients to compare losses and their refinements.
- Establishes that a representation optimal for a coarser loss is insufficient for a finer loss.
- Quantifies the frozen-transfer gap in terms of conditional mutual information.
Read more
Loss-Shift Transfer via Bayes Quotients
Summary
This paper introduces the concept of loss shift in transfer learning, which occurs when the data distribution remains constant while the loss function changes. Unlike traditional transfer learning that focuses on distribution shifts, the author identifies a failure mode where a representation trained for a coarser loss (like accuracy) may not suffice for a finer loss (like log loss) under the same joint distribution of input-output pairs. The paper formalizes this idea using Bayes quotients, which rank losses by their refinement levels. It demonstrates that a representation that is optimal for a coarser loss is inadequate for a finer one, leading to a frozen-transfer gap quantified by the conditional mutual information between the output and the representation. The findings are supported by experiments across various settings, showing that representations can yield different optimal performances despite being classification-equivalent. This work contributes to the understanding of representation sufficiency in transfer learning and highlights the importance of aligning loss functions with the required predictive information.
Methodology
The paper employs a theoretical framework based on Bayes quotients to analyze the relationship between different loss functions and their impact on representation learning. It formalizes the concept of loss shift and uses experiments to validate the theoretical predictions across various settings, including controlled, learned, synthetic, and real-image scenarios.
Results
The main results indicate that the frozen-transfer gap, which arises when transferring representations trained under different loss functions, can be precisely quantified as the conditional mutual information about the output that is lost due to the representation. This gap highlights the limitations of using representations trained for simpler tasks when faced with more complex predictive objectives.
Implications
The findings suggest that careful consideration of loss functions is crucial in transfer learning scenarios, as mismatches can lead to significant performance drops. This work may influence future research on representation learning, particularly in developing models that can adapt to varying loss requirements without losing critical predictive information.
Forecasting Is Not Attribution: Localizing Decoder Bypass in Graph-Based Neural Marketing Mix Models
Graph Learning
Time Series
Theory
- Identification of 'attribution bypass' in graph-based neural marketing mix models.
- Introduction of DICE-MMM as a two-stage diagnostic framework for graph learning.
- Demonstration that low forecasting error can coexist with misaligned attribution graphs.
- Empirical evidence showing that oracle graphs significantly improve attribution diagnostics.
Read more
Forecasting Is Not Attribution: Localizing Decoder Bypass in Graph-Based Neural Marketing Mix Models
Summary
This paper addresses the distinction between forecasting and attribution in marketing mix models (MMM), particularly focusing on a failure mode termed 'attribution bypass' in graph-based neural MMMs. The authors introduce DICE-MMM, a diagnostic framework designed to separate the recovery of a plausible temporal graph, accurate forecasting, and the alignment of decoder-induced influence with the graph used for attribution. The framework consists of two stages: the first trains a graph encoder with a restricted graph-mediated decoder, while the second freezes the encoder and trains a graph-safe latent decoder. The study evaluates the decoder's performance using counterfactual influence graphs (CIG), autoregressive rollout influence graphs (AR-CIG), and frozen-decoder graph-swap tests. The findings reveal that while DICE improves graph recovery over existing methods, low forecasting error does not guarantee accurate attribution. The paper concludes that the learned graph interfaces and current selection methods are insufficient for reliable attribution, highlighting the need for better deployable graph-support selection.
Methodology
The methodology involves a two-stage framework called DICE-MMM. In Stage 1, a graph encoder is trained with a restricted graph-mediated decoder to prevent high-capacity models from dominating graph discovery. In Stage 2, the encoder is frozen, and a graph-safe latent decoder is trained, ensuring that the decoder's influence is aligned with the supplied graph. The performance is evaluated using various diagnostic tests, including CIG and AR-CIG metrics, as well as frozen-decoder graph swaps.
Results
The results indicate that DICE-MMM improves stable graph recovery compared to existing models, such as CausalMMM, but highlights that low mean squared error (MSE) does not equate to valid attribution. In tests, oracle graphs yielded high AR-CIG scores, while learned graphs performed poorly, indicating that the decoder is not graph-blind but that current graph interfaces are inadequate. The study emphasizes that the bottleneck lies in the selection of deployable graph-support methods rather than in forecasting capabilities.
Implications
The findings suggest that while forecasting models may perform well, they can still fail in providing accurate attribution, which is critical for effective marketing decision-making. The DICE-MMM framework offers a diagnostic tool for identifying when a learned graph is not suitable for attribution, guiding future research towards improving graph-support selection methods.
A green solvent screening tool for emerging materials via uncertainty aware, transformer enhanced transfer learning
Optimization
- Development of a machine learning tool for green solvent screening.
- Integration of uncertainty quantification in predictions.
- High performance achieved on limited data targets.
- Augmentation of solubility descriptor data by up to two orders of magnitude.
Read more
A green solvent screening tool for emerging materials via uncertainty aware, transformer enhanced transfer learning
Summary
This paper addresses the challenge of accurately predicting solubility parameters, which is crucial for the development of sustainable materials in industries such as photovoltaics and batteries. The authors propose a machine learning-based green solvent screening tool that utilizes a pre-trained foundational model on QM9 targets, allowing for effective transfer learning with minimal data requirements. The tool incorporates uncertainty quantification, enabling users to assess the confidence of predictions. The authors successfully predict Hansen solubility parameters and dielectric constants, achieving high performance even on targets with limited data, such as Gutmann donor and acceptor numbers. The methodology significantly augments the available data on solubility descriptors, enhancing it by up to two orders of magnitude. The tool is designed for ease of use and integration with high-throughput laboratories, facilitating the ranking and screening of potential solvent substitutes. Notably, the study not only rediscovered known green solvents but also proposed new candidates, demonstrating its relevance in the search for eco-friendly solvents.
Methodology
The authors employed a transfer learning approach using a pre-trained model on QM9 targets, integrating uncertainty quantification to enhance prediction confidence. The model was adapted to predict various solubility parameters, including Hansen parameters and dielectric constants, with a focus on limited data scenarios.
Results
The proposed tool demonstrated high predictive accuracy for Hansen solubility parameters and dielectric constants, with significant performance on Gutmann donor and acceptor numbers despite limited data. The tool expanded the dataset of solubility descriptors substantially, facilitating the identification of both known and novel green solvents.
Implications
The findings suggest that machine learning can effectively address the challenges of solvent selection in sustainable chemistry, potentially leading to the adoption of greener alternatives in various industrial applications. The tool's design allows for integration into existing laboratory workflows, enhancing the efficiency of solvent screening processes.
Getting Better at Working With You: Compiling User Corrections into Runtime Enforcement for Coding Agents
Large Language Models
NLP
- Identifies a gap between user correction access and compliance in coding agents.
- Introduces Trace, a skill-layer pipeline that converts user corrections into runtime-enforceable rules.
- Demonstrates that memory alone is insufficient for ensuring compliance with user preferences.
- Achieves significant reductions in preference violations across multiple coding tasks.
Read more
Getting Better at Working With You: Compiling User Corrections into Runtime Enforcement for Coding Agents
Summary
This paper addresses the limitations of interactive coding agents in adhering to user corrections across sessions. Despite the ability to remember user corrections, these agents often fail to comply with them, leading to repeated user frustration. The authors introduce a novel approach called Test-time Rule Acquisition and Compiled Enforcement (Trace), which transforms user corrections into enforceable runtime rules. This method compiles user feedback into a library of atomic rules that agents must follow during task execution. The effectiveness of Trace is evaluated through simulated user-in-the-loop experiments on two benchmarks: ClawArena and MemoryArena. The results demonstrate a significant reduction in preference violations, indicating that Trace effectively bridges the gap between memory access and compliance, thereby enhancing the collaborative capabilities of coding agents.
Methodology
The authors developed Trace as a drop-in skill-layer pipeline that extracts user corrections from natural language feedback, formulates them into atomic rules, and compiles these rules into runtime checks for coding agents. The effectiveness of Trace was evaluated using simulated user-in-the-loop experiments on two coding task benchmarks, ClawArena and MemoryArena, measuring preference violation rates in both in-distribution and out-of-distribution scenarios.
Results
On ClawArena tasks, Trace reduced preference violations from 100.0% to 37.6% for in-distribution tasks and from 100.0% to 2.0% for out-of-distribution tasks. For MemoryArena-derived tasks, it decreased in-distribution violations from 100.0% to 60.5%, while matching or exceeding the strongest memory baseline in task success rates.
Implications
The findings suggest that implementing runtime enforcement of user corrections can significantly enhance the reliability and usability of coding agents, reducing the need for users to repeatedly provide the same corrections. This could lead to more efficient workflows and improved user satisfaction in interactive coding environments.
Out-of-Distribution (OOD) Detectors for Open-Set RF Fingerprinting
Theory
- Introduces a unified mathematical framework for applying OOD detection methods to RF fingerprinting.
- Demonstrates the feasibility of tuning OOD detectors without access to OOD data.
- Achieves comparable performance to traditional methods using true OOD data on the POWDER dataset.
- Establishes a baseline for future research in open-set RF fingerprinting.
Read more
Out-of-Distribution (OOD) Detectors for Open-Set RF Fingerprinting
Summary
This paper addresses the challenges of applying Out-of-Distribution (OOD) detection methods to Open-Set Radio-Frequency (RF) fingerprinting, a domain where signals from unknown transmitters and temporal drift can lead to distribution shifts during testing. The authors highlight that traditional OOD detectors often require auxiliary OOD data for parameter tuning, which is difficult to obtain in RF environments. They propose a unified mathematical framework based on information theory to systematically analyze and adapt existing OOD detection methods for RF fingerprinting. The study introduces post-hoc OOD detection methods, particularly focusing on feature-shaping approaches, and demonstrates their effectiveness in tuning without OOD data. The evaluation is conducted on the POWDER RF fingerprinting dataset, showing that the proposed methods achieve performance comparable to traditional methods that utilize true OOD tuning data, while significantly outperforming those that do not. This work establishes a baseline for future research in open-set RF fingerprinting and highlights the practical viability of OOD detection in this context.
Methodology
The authors adapt state-of-the-art OOD detection methods from machine learning literature, particularly focusing on feature-shaping approaches. They employ a unified mathematical framework based on information theory to analyze these methods and demonstrate their applicability in RF fingerprinting. The evaluation is performed on the POWDER dataset, comparing the performance of OOD detectors tuned without OOD data against traditional methods.
Results
The experimental results show that the OOD detectors tuned without any given OOD data achieve performance levels comparable to those that use true OOD tuning data. Furthermore, these detectors significantly outperform baseline approaches that lack access to true OOD tuning data, indicating their practical viability for RF fingerprinting applications.
Implications
The findings suggest that OOD detection methods can be effectively utilized in open-set RF fingerprinting scenarios, enhancing the security and robustness of RF identification systems. This work opens avenues for further research in adapting OOD techniques for various applications in wireless communication and security.
Circuit Synchronization Precedes Generalization: Causal Evidence from Fourier Structure in Grokking Transformers
Theory
Interpretability
Large Language Models
- Introduction of Frequency Synchronization Degree (FSD) as a predictor for grokking in transformers.
- FSD synchronizes significantly before the grokking event, providing a causal link between circuit formation and generalization.
- Weight decay influences the timing of generalization, with a derived empirical scaling law relating timing to weight decay.
- Multi-block transformer architectures show stronger FSD precursors compared to single-layer models.
Read more
Circuit Synchronization Precedes Generalization: Causal Evidence from Fourier Structure in Grokking Transformers
Summary
This paper investigates the phenomenon of grokking in transformers, where a model trained on modular arithmetic experiences a sudden transition from low to high validation accuracy. The author introduces the Frequency Synchronization Degree (FSD), a novel metric for assessing the synchronization of Fourier-based algorithmic circuits without prior knowledge of the circuit structure. The study finds that FSD reaches its peak 500 to 3,000 steps before the grokking event, establishing it as a reliable predictor of generalization. The research also provides causal evidence that the gap between circuit formation and generalization is a regularization phase, influenced by weight decay during training. The findings suggest that the timing of generalization can be predicted based on the weight decay parameter, with a derived empirical scaling law. Additionally, the paper explores the role of different model architectures in grokking, demonstrating that multi-block circuits exhibit the strongest FSD precursors. Overall, this work enhances the understanding of the mechanisms behind grokking and the formation of Fourier circuits in transformers.
Methodology
The author trained a two-layer transformer on various modular arithmetic configurations and employed the Frequency Synchronization Degree (FSD) to measure Fourier circuit synchronization. Causal interventions were conducted by forking training at specific steps with varying weight decay parameters to observe their effects on grokking timing. The study also included architecture ablation experiments to compare the performance of different model configurations.
Results
The results indicate that FSD reaches its peak 500 to 3,000 steps before grokking across all tested configurations, with a mean lead of +1,722 steps. The causal interventions confirmed that earlier grokking is achievable through specific weight decay settings, demonstrating a consistent inverse relationship between weight decay and timing. The findings also showed that an attention-only transformer model could grok with a strong FSD precursor, while an MLP-only model did not exhibit grokking.
Implications
This research provides insights into the mechanisms of generalization in neural networks, particularly in transformer architectures. The findings could inform the design of more efficient training strategies and architectures that leverage the timing of circuit formation for improved performance in tasks requiring generalization.
Select and Improve: Understanding the Mechanics of Post-Training for Reasoning
NLP
Large Language Models
Reinforcement Learning
- Identifies two core mechanisms of RL post-training: strategy selection and strategy improvement.
- Demonstrates that diverse reasoning patterns in pre-training data are essential for effective strategy selection.
- Shows that RL training with more difficult questions enhances strategy improvement and out-of-distribution generalization.
- Links observed phenomena like strategy amplification and composition to the core mechanisms rather than treating them as separate.
Read more
Select and Improve: Understanding the Mechanics of Post-Training for Reasoning
Summary
This paper investigates the mechanisms by which reinforcement learning (RL) enhances the capabilities of reasoning and coding models, specifically focusing on the Qwen-2.5-1.5B model. The authors identify two primary mechanisms: strategy selection and strategy improvement. Strategy selection involves routing problems to existing reasoning patterns learned during pre-training, while strategy improvement enhances these existing patterns. The study emphasizes the importance of high-quality supervised fine-tuning (SFT) data and the role of reinforcement learning data in activating these mechanisms. The findings reveal that strategy selection is crucial for performance and requires diverse reasoning patterns in the pre-training data, whereas strategy improvement necessitates more challenging questions in the RL dataset. The paper also discusses phenomena such as strategy amplification and composition, linking them to the core mechanisms identified. Overall, the research provides a mechanistic understanding of RL post-training, highlighting the need for effective pre-training data to scale reasoning capabilities in language models.
Methodology
The authors conducted controlled experiments using a synthetic finite-field arithmetic task to analyze the effects of reinforcement learning post-training on the Qwen-2.5-1.5B model. They employed a standard language model training setup involving both supervised fine-tuning (SFT) and reinforcement learning (RL), evaluating model performance based on the RL objective on a held-out dataset.
Results
The experiments revealed that strategy selection significantly drives performance improvements, while strategy improvement is contingent on the difficulty of the RL dataset. The study found that RL does not induce novel reasoning patterns but refines existing ones acquired during pre-training. Additionally, phenomena such as strategy amplification and composition were observed as outcomes of the core mechanisms.
Implications
The findings suggest that future advancements in RL post-training should focus on enhancing the quality and diversity of pre-training data. This could lead to more effective scaling of reasoning capabilities in language models, informing both academic research and practical applications in AI development.
Fed-FBD: Federated Functional Block Diversification for Isolation, Privacy, and Surgical Unlearning
Federated Learning
- FED-FBD provides architecturally guaranteed block-level isolation to prevent adversarial contamination.
- The framework achieves privacy-by-design, with membership inference indistinguishable from chance before additional privacy measures.
- Surgical unlearning is facilitated through aggregate block replacement, achieving minimal AUC loss without retraining.
- Experimental results show that FED-FBD maintains accuracy close to FedAvg while providing enhanced security and privacy.
Read more
Fed-FBD: Federated Functional Block Diversification for Isolation, Privacy, and Surgical Unlearning
Summary
The paper introduces FED-FBD, a novel federated learning architecture designed to address critical issues in medical data training, such as adversarial contamination, privacy concerns, and the right to be forgotten. Unlike traditional methods like FedAvg, which treat clients as black boxes, FED-FBD decomposes a ResNet backbone into six functional blocks, each independently tracked and contributor-stamped. This modular approach ensures block-level isolation, preventing adversarial clients from affecting clean model outputs. Additionally, it incorporates privacy-by-design principles, achieving a membership inference AUC of 0.50 before any privacy mechanisms are applied. The framework also enables surgical unlearning, allowing for the removal of a client's contributions in less than a second without retraining. Experimental results demonstrate that FED-FBD maintains competitive accuracy while providing robust isolation and privacy guarantees across various datasets, including MedMNIST-2D and CIFAR-10.
Methodology
FED-FBD employs a modular architecture that decomposes a ResNet backbone into functional blocks, each tracked and stamped with client contributions. This design allows for block-level accountability and isolation. The framework utilizes a warehouse of color variants, enabling the assembly of models from independently verified blocks. The methodology also includes techniques for surgical unlearning through aggregate block replacement, ensuring rapid removal of client contributions.
Results
Experiments on six MedMNIST-2D datasets, PathMNIST, and CIFAR-10 indicate that FED-FBD incurs a modest accuracy gap of 0.3%–3.1% compared to traditional methods. It successfully confines adversarial attacks to the affected client's blocks, with clean model outputs showing minimal deviation (±0.01 AUC). The framework achieves a membership inference AUC of 0.50 ± 0.01, indicating effective suppression of memorization. Surgical unlearning is accomplished with less than 0.25% AUC loss in sub-second time.
Implications
The proposed FED-FBD framework has significant implications for federated learning in sensitive domains such as healthcare, where data privacy and security are paramount. Its ability to isolate adversarial contributions and facilitate rapid unlearning could enhance trust in collaborative model training, making it more viable for real-world applications involving sensitive patient data.
Uncertainty Estimation for Molecular Diffusion Models
Generative Models
- Introduces a method for estimating uncertainty in molecular diffusion models.
- Utilizes a Laplace approximation to measure noise prediction variability.
- Demonstrates that the uncertainty score correlates negatively with established quality metrics.
- Shows that filtering based on uncertainty can improve sample quality without retraining.
Read more
Uncertainty Estimation for Molecular Diffusion Models
Summary
This paper addresses the challenge of uncertainty estimation in molecular diffusion models, which are widely used for 3D molecular generation but lack a reliable mechanism to signal low-quality outputs. The authors propose a post-hoc method that leverages a Laplace approximation of the denoising network to estimate per-sample uncertainty. By measuring the variability of noise predictions throughout the generation process, they derive an uncertainty score that correlates negatively with established quality metrics for generated molecules. This score can be utilized to filter out low-quality samples, thereby enhancing the overall performance of the model during test-time scaling. The empirical results demonstrate that the proposed uncertainty score is more predictive of molecular quality than traditional diffusion likelihood metrics, particularly in experiments conducted on the QM9 dataset. The work represents a significant step towards integrating uncertainty estimation into molecular diffusion models, which is crucial for applications where downstream evaluations are costly and time-consuming.
Methodology
The authors employ a post-hoc approach to uncertainty estimation by fitting a Laplace approximation to the denoising network of a pretrained molecular diffusion model. They measure the variability of noise predictions across the generation trajectory and aggregate this variability into a single uncertainty score for each generated molecule. This involves sampling from the posterior distribution of model parameters and calculating the variance of noise predictions at selected timesteps.
Results
The proposed uncertainty score was found to be informative of molecular sample quality, exhibiting a negative correlation with metrics such as molecular stability and validity. In experiments on the QM9 dataset, the uncertainty score proved to be more predictive of quality than the diffusion likelihood baseline. Additionally, the method allowed for effective filtering of high-uncertainty samples, leading to improved model performance during test-time scaling.
Implications
The findings suggest that incorporating uncertainty estimation into molecular diffusion models can significantly enhance the reliability of generated samples, making it easier to identify candidates for expensive downstream evaluations. This advancement could facilitate more efficient drug discovery processes and other applications in computational chemistry.
Predicting Cognitive Load from Speech and Interaction Dynamics in Dyadic Conversations
Audio & Speech
Multimodal
- Cognitive load can be predicted from speech dynamics in natural dyadic conversations.
- The study employs a regression approach rather than classification to capture continuous variations in cognitive load.
- Turn-taking dynamics and speaker participation are critical indicators of cognitive load.
- The research utilizes a diverse dataset of remote collaborative tasks to enhance ecological validity.
Read more
Predicting Cognitive Load from Speech and Interaction Dynamics in Dyadic Conversations
Summary
This paper investigates the prediction of cognitive load during dyadic conversations using speech and interaction dynamics. Unlike previous studies that primarily focused on controlled environments, this research analyzes audio from 53 dyads engaged in nine collaborative tasks to extract various acoustic and interaction features. A two-head Gated Recurrent Unit (GRU) encoder is employed to model cognitive load as a regression task, addressing gaps in existing literature that often treated cognitive load as a classification problem. The study finds that conversational dynamics, such as turn-taking and speaker participation, significantly correlate with perceived cognitive load, particularly in relation to time pressure and mental effort. The results underscore the importance of considering both acoustic features and interaction patterns in modeling cognitive load in naturalistic settings, providing insights into the complexities of remote collaboration and its impact on cognitive demands.
Methodology
The study analyzes audio recordings from dyadic conversations, extracting static acoustic, dynamic, and interaction features. A two-head Gated Recurrent Unit (GRU) encoder is trained to predict cognitive load scores, evaluated using metrics such as Concordance Correlation Coefficient (CCC) and Pearson correlation. The research emphasizes cross-dyad generalization for realistic assessments.
Results
The findings indicate that conversational interaction provides valuable signals for predicting cognitive load, with specific dynamics linked to temporal and mental demands. Temporal demand correlates with turn-taking behaviors, while mental demand is associated with participation imbalances. The model demonstrates effective prediction of cognitive load, highlighting the relevance of interaction features.
Implications
This research has significant implications for designing tools and systems that monitor cognitive load in real-time during remote collaborations. It suggests that understanding interaction dynamics can enhance decision-making and performance in knowledge work environments, potentially leading to improved well-being and productivity.
Detecting Explanatory Insufficiency in Learned Representations: A Framework for Representational Vigilance
Theory
Interpretability
- Introduces the VER framework for monitoring representational adequacy.
- Distinguishes between representational inadequacy and ordinary prediction errors.
- Emphasizes the importance of identifying persistent residual structures in learned representations.
- Aims to complement existing evaluation methods rather than replace them.
Read more
Detecting Explanatory Insufficiency in Learned Representations: A Framework for Representational Vigilance
Summary
This paper introduces the Vigilant Evaluator of Representations (VER), a conceptual framework designed to monitor the adequacy of learned representations in machine learning. While traditional evaluation metrics focus on predictive performance and robustness, they may overlook persistent residual structures that indicate explanatory insufficiency. VER aims to formalize a diagnostic process that identifies, analyzes, and interprets these residual structures, distinguishing them from ordinary prediction errors and uncertainties. The framework outlines a monitoring sequence that includes representation identification, explanatory-domain delimitation, residual-structure detection, explanatory-resistance evaluation, and vigilance signaling. The authors emphasize that VER is not intended to replace existing evaluation methods but to complement them by explicitly addressing representational adequacy. The paper also discusses a path toward empirical evaluation through representational-vigilance benchmarks, highlighting the importance of understanding whether learned representations adequately organize the phenomena they model.
Methodology
The paper presents a conceptual framework rather than a new algorithm or model architecture. It outlines a diagnostic process for monitoring learned representations, focusing on identifying and analyzing residual structures that indicate explanatory insufficiency.
Results
The VER framework provides a structured approach to detect and evaluate explanatory insufficiency in learned representations, emphasizing the need for representational vigilance in machine learning diagnostics.
Implications
The framework has potential applications in improving the interpretability and robustness of machine learning models by ensuring that learned representations adequately capture the underlying phenomena. It encourages researchers to adopt a more nuanced approach to representation evaluation.
Enhanced Low-Density Region Exploration in Classifier-Guided Diffusion Models Through Modified Reverse Diffusion Sampling
Generative Models
Computer Vision
- Introduces a density-aware extension to classifier-guided diffusion models.
- Targets low-density regions during sampling without additional training.
- Implements dual guidance to enhance sample diversity and fidelity.
- Demonstrates improved recall of rare samples on ImageNet.
Read more
Enhanced Low-Density Region Exploration in Classifier-Guided Diffusion Models Through Modified Reverse Diffusion Sampling
Summary
This paper addresses the challenge of generating rare or low-density samples in classifier-guided diffusion models, which typically concentrate on high-density regions of class-conditional distributions. The authors propose a novel sampling-time, density-aware extension of the classifier-guided diffusion model that enhances exploration of low-density regions without requiring additional training. By modifying the reverse diffusion sampling process, they steer trajectories toward low-confidence areas using a modified classifier gradient while simultaneously guiding the sampling towards predicted real images. This dual guidance approach aims to improve the diversity and fidelity of generated samples. The method is evaluated on ImageNet, demonstrating improved recall of rare samples while maintaining competitive Fréchet Inception Distance (FID) scores. The results indicate that the proposed method effectively balances the exploration of low-density regions with the generation of high-fidelity images.
Methodology
The authors modify the reverse diffusion sampling process of a pretrained conditional diffusion model and classifier. They manipulate the classifier gradient to steer the sampling process towards low-density regions while also guiding it towards predicted real images at each time step. This approach allows for controlled exploration of low-probability samples without altering the training of the underlying models.
Results
The proposed method significantly improves the recall of rare samples in the generated outputs at a resolution of 64x64, while maintaining a comparable FID score. Visual evaluations with a 256x256 ADM model further validate the effectiveness of the dual guidance strategy, showing that it can produce high perceptual quality samples on ImageNet.
Implications
The findings suggest that the proposed sampling strategy can enhance the performance of diffusion models in applications requiring the generation of diverse and high-fidelity images, particularly in scenarios where rare samples are critical. This could have implications for fields such as computer vision, art generation, and any domain where high-quality image synthesis is essential.
How Much Memory Do We Need? Adaptive Memory Gate for Neural Operators
Theory
Efficient ML
Time Series
- AMGFNO introduces a dynamic memory weight modulation mechanism for neural operators.
- The optimal memory weight varies with resolution and viscosity, necessitating an adaptive approach.
- AMGFNO achieves significant performance improvements over fixed-memory approaches.
- The method is validated on complex PDEs, showcasing its practical applicability.
Read more
How Much Memory Do We Need? Adaptive Memory Gate for Neural Operators
Summary
This paper introduces the Adaptive Memory Gate for Neural Operators (AMGFNO), a novel approach to enhance the performance of neural operators in solving time-dependent partial differential equations (PDEs). Traditional memory-augmented neural operators utilize a fixed memory weight, which limits their adaptability to varying observation conditions such as resolution and viscosity. The authors demonstrate through preliminary experiments that the optimal memory weight varies significantly with these conditions, indicating that a static approach is insufficient. AMGFNO addresses this limitation by dynamically modulating memory weight through a learnable gate, allowing it to adapt to the specific requirements of the input data. The effectiveness of AMGFNO is validated through experiments on the Kuramoto-Sivashinsky and Burgers’ equations, where it achieves substantial reductions in normalized root mean square error (nRMSE) compared to existing methods. The results show that AMGFNO can reduce nRMSE by 55-79% at low resolution and 25-40% at high resolution, demonstrating its capability to effectively adapt memory usage based on the input conditions.
Methodology
The authors propose AMGFNO, which incorporates a learnable gate to adjust memory weight dynamically based on the input conditions. This method combines the advantages of using past states as memory, capturing long-range dependencies, and adapting memory weight, which is a first in the field of neural operators.
Results
AMGFNO demonstrates a 55-79% reduction in nRMSE at low resolution and a 25-40% reduction at high resolution when tested on the Kuramoto-Sivashinsky and Burgers’ equations, outperforming the best fixed-memory baseline (S4FFNO). The learned memory gate value decreases from approximately 0.7 to near-zero as resolution increases, aligning with theoretical predictions.
Implications
The findings suggest that adaptive memory mechanisms can significantly enhance the performance of neural operators in various applications involving time-dependent PDEs, potentially leading to more efficient and accurate solutions in fields such as fluid dynamics and reaction-diffusion systems.
Decoding Insect Song: A Multitask Semisupervised Orthoptera Bioacoustic Classifier
Audio & Speech
- PULSE addresses the limitations of existing automated tools for insect bioacoustics by integrating semi-supervised and multi-task learning.
- The framework outperforms state-of-the-art models in species classification metrics, demonstrating the effectiveness of combining labelled and unlabelled data.
- Active learning further improves model performance, indicating the potential for continuous learning in ecological monitoring.
- The embeddings generated by PULSE encode ecologically relevant information, supporting ecological research and conservation efforts.
Read more
Decoding Insect Song: A Multitask Semisupervised Orthoptera Bioacoustic Classifier
Summary
This paper presents PULSE, a novel semi-supervised, multi-task framework designed for the classification of Orthoptera bioacoustics, specifically targeting the challenges of passive acoustic monitoring (PAM) for insects. The authors highlight the limitations of existing automated tools, which are often narrowly trained and lack transferability. PULSE combines weakly-supervised species classification, self-supervised learning on unlabelled field audio, and knowledge distillation from a general-purpose bioacoustic model. The framework utilizes a large collection of unlabelled UK field recordings alongside labelled data from various sources, addressing the scarcity of labelled Orthoptera data. The methodology involves transforming audio into mel spectrograms, employing a modified VGGish backbone for embedding extraction, and optimizing three objectives: supervised classification, ecological prior matching with BirdNET embeddings, and self-supervised learning using the Bootstrap Your Own Latent (BYOL) method. The results demonstrate that PULSE significantly outperforms a state-of-the-art general model across multiple metrics, with active learning further enhancing performance. Additionally, the learned embeddings reveal ecologically meaningful structures, which can be explored through an interactive visualization tool, facilitating ecological discovery.
Methodology
PULSE employs a multi-task framework that includes supervised classification using labelled data, ecological prior matching with BirdNET embeddings, and self-supervised learning on unlabelled field recordings. Audio inputs are transformed into mel spectrograms, which are processed through a modified VGGish backbone to extract embeddings. The model is trained using a joint optimization approach that balances three loss functions corresponding to the different tasks.
Results
PULSE achieves a macro F1 score of 0.21 and an AUC of 0.74, significantly outperforming a state-of-the-art general model (F1: 0.07, AUC: 0.45). With active learning, the F1 score improves to 0.34 and AUC to 0.84. The embeddings produced by the model reveal ecologically meaningful structures.
Implications
The development of PULSE has significant implications for ecological monitoring and conservation strategies, particularly for insect populations. By providing a robust tool for classifying insect sounds, it can enhance biodiversity assessments and inform land management practices.
Accelerating Speculative Diffusions via Block Verification
Generative Models
Efficient ML
Theory
- Introduces efficient Γ-maximal coupling for diffusion models, simplifying existing techniques.
- Adapts block verification from LLMs to enhance acceptance rates in diffusion sampling.
- Presents the Free Drafter, a heuristic that outperforms previous drafting strategies.
- Demonstrates empirical speedups of up to 6.3% in sampling latency without additional training.
Read more
Accelerating Speculative Diffusions via Block Verification
Summary
This paper addresses the challenge of accelerating inference in diffusion models through a novel approach that integrates speculative sampling with block verification. Speculative decoding, originally designed for large language models (LLMs), is adapted to continuous diffusion models, which traditionally struggle with efficient sampling from residual distributions. The authors introduce a new scheme that implements Γ-maximal coupling for diffusion models, allowing for efficient sampling with only a single evaluation of the target model. They also adapt the block verification technique from LLMs to the diffusion context, which enhances the acceptance rate of draft samples. The paper presents the Free Drafter, a heuristic self-speculative mechanism that operates without additional training and empirically demonstrates improved performance over existing methods. The proposed methods yield significant speedups in sampling, achieving up to a 6.3% reduction in latency compared to standard speculative sampling techniques, all while maintaining sample quality.
Methodology
The authors develop a one-step algorithm for Γ-maximal coupling that allows efficient sampling from the residual distribution in diffusion models. They adapt the block verification technique, which verifies entire blocks of draft samples rather than individual tokens, to improve the efficiency of speculative sampling. The Free Drafter is introduced as a self-speculative mechanism that does not require training, enhancing the overall performance of the sampling process.
Results
The proposed methods lead to a significant reduction in wall-clock latency, achieving up to a 6.3% speedup compared to existing speculative sampling methods. The Free Drafter mechanism shows empirical advantages over previously considered strategies, demonstrating improved acceptance rates and efficiency in generating high-fidelity samples.
Implications
The advancements presented in this paper could have substantial implications for the efficiency of generative modeling tasks, particularly in applications requiring real-time inference from diffusion models. The integration of block verification and self-speculative mechanisms may pave the way for more efficient generative models in various domains, including image and audio generation.
Exposure Bias as Epistemic Underidentification in Recursive Forecasting
Theory
Time Series
- Exposure bias in recursive forecasting is linked to epistemic underidentification due to insufficient state representation.
- The authors introduce a formal framework using induced states and provenance to analyze recursive forecasting errors.
- Empirical evidence shows that fixed induced states create distinct local corrective tasks and that closed-loop corrections can improve performance.
- The study highlights the importance of considering provenance information in recursive forecasting to mitigate exposure bias.
Read more
Exposure Bias as Epistemic Underidentification in Recursive Forecasting
Summary
This paper addresses the issue of exposure bias in recursive multi-step forecasting, framing it not only as a distribution shift but also as an epistemic underidentification problem. The authors prove that under conditions of partial observability or state truncation, the one-step Bayes supervision may fail to identify the recursive predictor when it encounters self-generated states. They formalize this concept using induced states and provenance variables, leading to a decomposition of induced-state error into three components: teacher-forcing/rollout mismatch, representation-class approximation, and provenance information gaps. Empirical results demonstrate that rollout leads to a distinct induced-state regime and that fixed induced states create a unique local corrective task. The study also shows that closed-loop correction can enhance performance by altering the induced states encountered during rollout, with improvements being conditional rather than uniform. Overall, the findings recast exposure bias as a challenge of reasoning under self-induced epistemic uncertainty, emphasizing the need for better representation in autoregressive systems.
Methodology
The authors develop a theoretical framework to analyze recursive forecasting by defining induced states and provenance variables. They decompose induced-state errors and conduct empirical experiments to observe the effects of fixed induced states and closed-loop corrections on forecasting performance.
Results
The paper demonstrates that recursive rollout leads to a distinct induced-state regime, and that fixed induced states define a unique local corrective task. Additionally, closed-loop correction strategies can enhance performance by changing the induced states encountered during rollout, although the improvements are conditional.
Implications
The findings suggest that addressing exposure bias in autoregressive models requires a deeper understanding of the epistemic uncertainties involved. This could lead to more robust forecasting methods in various applications, including language generation and dynamic forecasting.
LoRA-Muon: Spectral Steepest Descent on the Low-Rank Manifold
Optimization
Efficient ML
Large Language Models
- LoRA-Muon is derived from the spectral steepest descent rule of the Muon optimizer, tailored for low-rank settings.
- The method ensures optimal learning rates transfer across different model configurations, enhancing tuning efficiency.
- LoRA-Muon is gauge-invariant and avoids the computational overhead of QR-decomposition, improving memory efficiency.
- Empirical results indicate that LoRA-Muon can outperform dense training baselines in specific scenarios.
Read more
LoRA-Muon: Spectral Steepest Descent on the Low-Rank Manifold
Summary
The paper introduces LoRA-Muon, a novel optimization method designed for Low-Rank Adaptation (LoRA) in deep learning, which addresses the challenges associated with tuning existing optimizers like AdamW when applied to low-rank settings. LoRA-Muon utilizes the spectral steepest descent rule from the Muon optimizer, allowing for effective learning rate transfer across various model configurations, including rank, width, and depth. The authors derive a split weight-decay rule to ensure consistency in weight norms and step sizes between full-rank and low-rank settings. They demonstrate that LoRA-Muon is gauge-invariant and more efficient than existing methods, such as Spectron and LoRA-RITE, by avoiding QR-decomposition and second moment storage. Empirical results show that LoRA-Muon achieves lower mean validation loss compared to dense baselines in a compute-matched TinyShakespeare study, indicating its potential as a robust low-rank proxy for dense optimizers.
Methodology
The authors derive LoRA-Muon by applying the spectral steepest descent rule to the low-rank manifold, focusing on optimizing the learning rates and weight updates without QR-decomposition. They also introduce a split weight-decay rule to ensure compatibility between low-rank and full-rank settings.
Results
In experiments using the TinyShakespeare dataset, a rank-2 LoRA-Muon proxy successfully recovers the best learning rate for dense Muon, and at rank 32, it achieves lower mean validation loss than the dense baseline. The results demonstrate the effectiveness of LoRA-Muon in transferring optimal learning rates across various configurations.
Implications
LoRA-Muon has significant implications for the efficient training of deep learning models, particularly in resource-constrained environments. Its ability to maintain performance while reducing computational overhead makes it a valuable tool for practitioners in machine learning, especially in applications requiring low-rank adaptations.
ReCal: Reward Calibration for RL-based LLM Routing
NLP
Large Language Models
Reinforcement Learning
- ReCal introduces a hierarchical reward decomposition mechanism to clarify learning signals for RL-based LLM routing.
- The framework employs variance-aware reweighting and per-dataset normalization to address optimization variability.
- ReCal improves routing performance and training stability across diverse datasets compared to existing methods.
- The approach separates objective-level supervision from distribution-level optimization variability, enhancing policy learning.
Read more
ReCal: Reward Calibration for RL-based LLM Routing
Summary
The paper introduces ReCal, a Reward Calibration framework designed to enhance reinforcement learning (RL)-based routing of large language models (LLMs). The authors identify key challenges in existing routing methods, particularly the ambiguity in learning signals due to the aggregation of multiple objectives into a single scalar reward. This leads to difficulties in credit assignment and optimization biases favoring trivial samples. ReCal addresses these issues through a hierarchical reward decomposition mechanism that provides clearer learning signals by estimating component-wise advantages. Additionally, it employs a distribution-aware optimization strategy that includes variance-aware reweighting and per-dataset normalization to calibrate optimization variability. The proposed framework aims to improve the routing performance and training stability of LLMs across heterogeneous tasks. Experimental results on seven datasets demonstrate that ReCal consistently outperforms baseline methods, highlighting its effectiveness in refining routing policies and enhancing learning stability.
Methodology
ReCal utilizes a two-stage calibration process for policy learning. It first restructures reward signals into clearer supervision through hierarchical reward decomposition, allowing for component-wise advantage estimation. Then, it applies a distribution-aware optimization strategy that includes variance-aware reweighting at the instance level and per-dataset normalization to ensure balanced contributions from different datasets during policy updates.
Results
The experiments conducted on seven datasets show that ReCal significantly improves routing performance and training stability compared to baseline RL-based routing methods. The results indicate that the proposed calibration framework effectively mitigates the issues of ambiguous credit assignment and optimization bias, leading to more reliable routing policies.
Implications
The findings suggest that ReCal can be applied to enhance the performance of RL-based systems in various applications involving LLMs, particularly in scenarios requiring dynamic model selection and reasoning strategy adaptation. This framework could lead to more robust and effective AI systems capable of handling complex tasks with varying difficulty levels.
A Stationary (and Therefore Compatible) Representation is All You Need
Theory
Computer Vision
Efficient ML
- Stationary representations learned via d-Simplex fixed classifiers imply compatibility.
- Combining cross-entropy and contrastive loss captures higher-order dependencies.
- The proposed method achieves state-of-the-art performance in compatible representation learning.
- The approach allows for uninterrupted retrieval services during model updates.
Read more
A Stationary (and Therefore Compatible) Representation is All You Need
Summary
This paper addresses the challenge of learning compatible representations in machine learning, particularly when models are updated over time. The authors demonstrate that stationary representations learned through d-Simplex fixed classifiers inherently satisfy compatibility requirements. They propose a novel training approach that combines cross-entropy loss with contrastive loss to capture higher-order dependencies in representations while maintaining compatibility. The findings are validated through extensive experiments, showcasing state-of-the-art performance in open-set image recognition and demonstrating the effectiveness of stationary representations in scenarios where models are sequentially fine-tuned and occasionally replaced. This work lays a theoretical foundation for future research in compatible representation learning and offers practical solutions for efficient model updates without the need for reprocessing gallery data.
Methodology
The authors utilize d-Simplex fixed classifiers to learn stationary representations and propose a training method that combines cross-entropy loss with contrastive loss. This approach is designed to ensure compatibility while capturing higher-order dependencies in the learned representations.
Results
The experiments conducted demonstrate that the proposed method achieves state-of-the-art performance in open-set image recognition tasks. The results confirm that stationary representations facilitate effective model updates without the need for reprocessing existing gallery images.
Implications
The findings have significant implications for the development of machine learning models that require frequent updates, particularly in applications where computational efficiency and data privacy are concerns. The theoretical insights provided can guide future research in compatible representation learning.
Adaptive Weighted Averaging
Optimization
Theory
Efficient ML
- Introduces the SBern strategy, which is admissible and strictly dominates uniform random selection.
- Constructs the Speel strategy that dominates any arbitrary fixed deterministic strategy.
- Demonstrates impossibility results in non-independent observation settings.
- Establishes new online-to-batch conversion bounds in stochastic optimization.
Read more
Adaptive Weighted Averaging
Summary
This paper addresses the problem of selecting the largest among n unknown values given a single unbiased estimate for each value. The authors propose strategies that are admissible and do not perform worse than a baseline strategy, such as uniform random selection. They introduce a new strategy, SBern, which is derived from the multilinear extension of a base strategy optimized for Bernoulli observations, and prove that it strictly dominates the uniform strategy. The paper also discusses the construction of a new strategy, Speel, that dominates any arbitrary fixed deterministic strategy. Furthermore, the authors explore impossibility results in settings where observations are not independent, demonstrating that uniform sampling is admissible and cannot be dominated. The findings are applied to stochastic optimization, providing an online-to-batch conversion bound that improves upon standard techniques, allowing for better performance in benign settings without sacrificing worst-case optimality. The paper concludes with discussions on applications in ensemble methods and federated learning.
Methodology
The authors utilize a theoretical framework based on Bayesian decision theory to design strategies that maximize utility while ensuring admissibility. They derive strategies through multilinear extensions and analyze their performance against established benchmarks. The paper also employs counterexamples to demonstrate the limitations of certain strategies in non-independent settings.
Results
The SBern strategy is shown to improve the expected value over the mean by a term related to the variance of the means. The Speel strategy is constructed to dominate arbitrary fixed deterministic strategies. The paper provides a new online-to-batch conversion bound that uniformly improves upon standard techniques, particularly in benign settings.
Implications
The findings have significant implications for stochastic optimization, allowing for improved performance in practical applications. The strategies developed can enhance decision-making processes in various domains, including ensemble methods and federated learning, by providing robust methods for aggregating information.
Interpretable Factor Decomposition for Decision Intelligence in Large-Scale Financial Markets: Evidence from China's A-Share Market
Interpretability
- The XGBoost model with TreeSHAP attribution effectively decomposes equity return predictability into interpretable factors.
- Behavioral signals (turnover and momentum) are significantly more predictive than traditional valuation ratios in the Chinese A-share market.
- The model demonstrates strong performance with a mean AUC of 0.547 and an annualized Sharpe Ratio of 2.23.
- Ablation analysis reveals insights into feature substitutability, enhancing the understanding of predictive factors.
Read more
Interpretable Factor Decomposition for Decision Intelligence in Large-Scale Financial Markets: Evidence from China's A-Share Market
Summary
This paper presents an interpretable machine learning pipeline designed to decompose cross-sectional equity return predictability into auditable factor contributions, specifically applied to the Chinese A-share market. The authors employ an XGBoost model enhanced by TreeSHAP attribution to analyze 3,632 Chinese A-share stocks from 2009 to 2019. The study utilizes 60-month rolling windows to evaluate the model's performance over 55 months of out-of-sample data. The findings reveal that the XGBoost model achieves a mean AUC of 0.547 and a long-short spread of +2.38% per month for the top versus bottom quintiles, demonstrating significant predictive power. Notably, the SHAP decomposition indicates that behavioral signals, such as turnover and momentum, account for 58.2% of the predictive attribution, while valuation ratios contribute only 10.7%. The research also includes an ablation analysis to validate the ranking of features, highlighting the importance of feature substitutability. Overall, the study emphasizes the effectiveness of interpretable machine learning in financial decision-making and the critical role of behavioral indicators in stock return predictability.
Methodology
The authors constructed a daily panel of firm-level observations from the baostock API, focusing on 3,632 stocks in the Chinese A-share market from 2009 to 2019. They employed an XGBoost model with fixed hyperparameters and utilized TreeSHAP for feature attribution. The model was validated using rolling windows and ablation analysis to assess the impact of different feature groups on predictive performance.
Results
The XGBoost model achieved a mean AUC of 0.547 and a long-short spread of +2.38% per month for the top versus bottom quintiles. The SHAP decomposition revealed that behavioral indicators accounted for 58.2% of predictive value, while valuation ratios contributed only 10.7%. The model's performance was validated against the Carhart four-factor model, showing persistent alpha generation.
Implications
The findings suggest that behavioral factors play a crucial role in stock return predictability, which could inform investment strategies and portfolio management. The interpretable nature of the model enhances decision-making in financial markets, allowing for transparent and auditable analyses of stock performance.
Positional Encoding in the Context of Memristor-Based Analog Computation for Automatic Speech Recognition
Audio & Speech
Efficient ML
NLP
- Memristors enable efficient analog computation for neural models in NLP.
- Large output values from positional encodings can degrade performance in memristor-based systems.
- Adjusting ADC configurations can significantly reduce performance degradation.
- Relative positional encodings improve model performance in low-precision environments.
Read more
Positional Encoding in the Context of Memristor-Based Analog Computation for Automatic Speech Recognition
Summary
This paper explores the integration of memristor-based analog computation in automatic speech recognition (ASR) models, specifically focusing on the role of positional encodings (PEs). Memristors offer a resource-efficient method for executing vector-matrix multiplications, but they face challenges related to distortion during weight programming and execution. The authors identify that large output values from transformed positional encodings can significantly degrade performance in analog-to-digital conversion (ADC) processes. By adjusting the weight and precision bits of the ADC in specific memristor layers, they achieve a reduction in degradation by approximately 50% while maintaining stable energy consumption. The study also examines scenarios where ADC modifications are not possible, demonstrating a 30% relative reduction in degradation by eliminating encoding-related linear transformations. The findings confirm that relative PEs enhance model performance, particularly in low-precision settings, and highlight the importance of adapting model and device configurations to mitigate performance losses during memristor execution.
Methodology
The authors conducted simulations of a CTC-based Conformer model with relative positional encodings on memristor hardware using SynaptogenML. They analyzed the impact of positional encodings on model performance and degradation during execution, adjusting ADC configurations and exploring scenarios without modification capabilities.
Results
The introduction of relative positional encodings improved model performance by approximately 15% compared to configurations without positional encodings. The degradation during memristor execution was reduced by about 50% with adjusted configurations. In scenarios where ADC modifications were not feasible, a 30% reduction in degradation was achieved by removing certain linear transformations.
Implications
The findings suggest that optimizing positional encoding configurations can enhance the performance of ASR models on memristor hardware, paving the way for more efficient and effective deployment of neural networks in resource-constrained environments. This research can guide future developments in both hardware design and model training strategies for energy-efficient computing.