AI-generated summaries
Today's ML research,
without the noise.
Daily summaries of the latest machine learning papers from arXiv, processed every 8 hours.
59
Papers today
8h
Update frequency
7
Days of history
From Uncertain Judgments to Calibrated Rankings: Conformal Elo Estimation for LLM Evaluation
Large Language Models
NLP
Theory
- Introduces a low-cost evaluation framework for LLMs that avoids reliance on large-scale human annotations.
- Implements calibrated win probabilities to improve Elo estimation accuracy significantly.
- Utilizes split conformal prediction to provide distribution-free uncertainty bounds for Elo ratings.
- Achieves a mean absolute error of 17.9 Elo on held-out models compared to human-derived ratings.
Read more
From Uncertain Judgments to Calibrated Rankings: Conformal Elo Estimation for LLM Evaluation
Summary
This paper addresses the challenges of evaluating large language models (LLMs) using LLM-as-a-judge methodologies, which often suffer from biases leading to miscalibrated rankings. The authors propose a novel approach called Conformal Elo Estimation, which improves the accuracy of Elo ratings derived from LLM judgments by addressing systematic errors such as position bias and self-preference. The methodology involves estimating per-battle uncertainty through calibrated win probabilities instead of binary labels, significantly enhancing Elo estimation accuracy. Additionally, the authors apply split conformal prediction to account for the residual gap between LLM-derived and human-derived Elo ratings, yielding reliable prediction intervals. The results demonstrate that their approach achieves a mean absolute error of 17.9 Elo on held-out models compared to human ratings, providing a cost-effective alternative to traditional human annotation campaigns. The code for the proposed method is made publicly available to facilitate reproducibility.
Methodology
The authors employ a two-tiered approach for Elo estimation: at the local level, they derive calibrated win probabilities from LLM judge scores using maximum likelihood estimation, which enhances the accuracy of individual battle assessments. At the global level, they apply split conformal prediction to the discrepancies between LLM-derived and human-derived Elo ratings, producing reliable prediction intervals with guaranteed coverage.
Results
The proposed method results in a mean absolute error of 17.9 Elo when comparing LLM-derived ratings to human-derived ratings across 55 held-out models on LMArena, indicating a significant improvement in accuracy over traditional methods.
Implications
This work provides a scalable and efficient framework for evaluating LLMs, which could be particularly beneficial for developers seeking to benchmark their models without incurring the high costs associated with human evaluations. The methodology can be applied to various LLM evaluation scenarios, enhancing the reliability of rankings in automated settings.
Predicting Cognitive Load from Speech and Interaction Dynamics in Dyadic Conversations
Audio & Speech
Multimodal
- Cognitive load can be predicted from speech and interaction dynamics in natural dyadic conversations.
- The study employs a regression approach using a two-head Gated Recurrent Unit encoder, enhancing prediction accuracy.
- Turn-taking dynamics and speaker participation are critical indicators of cognitive load.
- The research utilizes a diverse dataset of remote collaborative tasks to assess cognitive load across various contexts.
Read more
Predicting Cognitive Load from Speech and Interaction Dynamics in Dyadic Conversations
Summary
This paper investigates the prediction of cognitive load during dyadic conversations by analyzing speech and interaction dynamics. Unlike previous studies that focused on controlled environments, this research utilizes audio from 53 dyads engaged in nine collaborative tasks to explore how various features can predict cognitive load. The study employs a two-head Gated Recurrent Unit (GRU) encoder to model cognitive load as a regression task, moving beyond traditional classification approaches. The authors extract static acoustic, dynamic, and interaction features, revealing that conversational dynamics such as turn-taking and speaker participation significantly correlate with perceived cognitive load. The findings indicate that temporal demand is linked to turn-taking dynamics, while mental demand relates to the balance of participation between speakers. This research highlights the importance of considering task structure and conversational interaction in modeling cognitive load in natural settings, providing insights for real-time monitoring in remote collaboration contexts.
Methodology
The study analyzes audio recordings from 53 dyads performing nine collaborative tasks, extracting static acoustic, dynamic, and interaction features. A two-head Gated Recurrent Unit encoder is used to model cognitive load as a regression task, with performance evaluated using metrics like Concordance Correlation Coefficient (CCC) and RMSE. The research emphasizes cross-dyad generalization for realistic assessment.
Results
The results demonstrate that conversational interaction provides significant signals for predicting cognitive load, particularly related to time pressure and mental effort. Temporal demand is associated with turn-taking dynamics, while mental demand correlates with participation imbalance. The model shows improved predictive performance when incorporating both temporal and interaction features.
Implications
The findings suggest that speech-based cognitive load prediction can enhance real-time monitoring in remote collaboration, potentially improving decision-making and well-being in knowledge work settings. This research could inform the development of tools for assessing cognitive load in various high-stakes environments.
Crossing the Validation Crisis: Cross-Validation Reduces Benchmarking Variance Surprisingly Well
Theory
Efficient ML
NLP
- Cross-validation significantly reduces benchmarking variance in machine learning evaluations.
- The concept of sample gain quantifies the benefits of using multiple CV splits.
- Diminishing returns from additional splits occur later than anticipated, enhancing reliability.
- A dynamic early-stopping procedure for CV can optimize computational resources.
Read more
Crossing the Validation Crisis: Cross-Validation Reduces Benchmarking Variance Surprisingly Well
Summary
This paper addresses the validation crisis in machine learning, where the statistical variability in performance evaluation can obscure genuine advancements due to limited test samples. The authors demonstrate that cross-validation (CV) significantly enhances the reliability of performance estimates by introducing the concept of sample gain, which quantifies the virtual data augmentation achieved through multiple CV splits. Experiments conducted on both synthetic and real-world datasets, including histopathologic scans and NLP fine-tuning, reveal that using multiple splits can substantially improve the stability of performance estimates, with diminishing returns occurring later than expected. Additionally, the authors propose a dynamic early-stopping procedure for cross-validation that estimates the potential sample gains from initial folds, optimizing the trade-off between computational cost and benchmarking variance. The findings underscore the importance of leveraging cross-validation to achieve robust benchmarking in machine learning.
Methodology
The authors conducted experiments using synthetic and real datasets to evaluate the impact of cross-validation on performance estimation. They introduced the concept of sample gain to quantify the benefits of multiple CV splits and developed a dynamic early-stopping procedure to optimize the number of splits based on initial results.
Results
The experiments demonstrated that cross-validation could provide a sample gain equivalent to a substantial increase in statistical power, with values reaching around 10 in some cases. The results indicated that increasing the number of splits yielded diminishing returns more slowly than expected, and the early-stopping procedure effectively predicted the potential benefits of additional splits.
Implications
The findings suggest that adopting cross-validation more widely can lead to more reliable benchmarking practices in machine learning, particularly in fields with limited data. This could facilitate clearer assessments of algorithm performance and foster genuine advancements in the field.
LongSpike: Fractional Order Spiking State Space Models for Efficient Long Sequence Learning
Time Series
Audio & Speech
Efficient ML
- Introduction of LongSpike framework that utilizes fractional-order dynamics for SNNs.
- Overcomes the memoryless bottleneck of traditional first-order SNNs.
- Efficient parallel training enabled through a state-space representation.
- Demonstrated superior performance on long-sequence benchmarks compared to state-of-the-art SNNs.
Read more
LongSpike: Fractional Order Spiking State Space Models for Efficient Long Sequence Learning
Summary
The paper introduces LongSpike, a novel framework that enhances Spiking Neural Networks (SNNs) by integrating fractional-order State-Space Modeling (f-SSM) to improve long-sequence learning. Traditional SNNs, which rely on first-order Ordinary Differential Equations (ODEs), face limitations in capturing long-range dependencies due to their 'memoryless' nature. LongSpike addresses this by employing fractional-order dynamics, allowing for the hierarchical integration of neuronal dynamics with long-memory kernels. This approach mitigates the computational overhead typically associated with fractional operators by reformulating the dynamics into a parallelizable state-space representation, enabling efficient training on long sequences while maintaining the energy efficiency of SNNs. The authors conduct extensive experiments on benchmarks such as Long Range Arena (LRA), WikiText-103, and Speech Commands, demonstrating that LongSpike outperforms existing SNN architectures in accuracy and long-range modeling capabilities while preserving sparse synaptic computation.
Methodology
The authors extend traditional integer-order State-Space Models to fractional-order dynamics, allowing for the modeling of long-memory effects. They reformulate the dynamics into a parallelizable state-space representation to facilitate efficient training on long sequences, thus overcoming the computational challenges associated with fractional operators.
Results
LongSpike consistently outperformed state-of-the-art SNNs in accuracy and long-range modeling capabilities across various benchmarks, including Long Range Arena, WikiText-103, and Speech Commands, while maintaining high energy efficiency.
Implications
The LongSpike framework has potential applications in various domains requiring efficient long-sequence modeling, such as natural language processing, speech recognition, and time-series analysis, leveraging the energy efficiency of SNNs while capturing complex temporal dependencies.
Exposure Bias as Epistemic Underidentification in Recursive Forecasting
Theory
Time Series
- Exposure bias is not solely a distribution shift but also an epistemic underidentification problem under partial observability.
- Induced states and provenance variables are crucial for understanding recursive forecasting failures.
- Empirical evidence shows distinct induced-state regimes and the impact of fixed induced states on corrective tasks.
- Closed-loop correction can improve performance by changing the induced states during rollout.
Read more
Exposure Bias as Epistemic Underidentification in Recursive Forecasting
Summary
This paper investigates the phenomenon of exposure bias in recursive multi-step forecasting, traditionally viewed as a distribution shift where models trained on observed data are deployed on their own predictions. The authors argue that this framing is insufficient, particularly under conditions of partial observability or state truncation, leading to an epistemic underidentification problem. They demonstrate that even with deterministic dynamics, one-step Bayes supervision may not adequately identify the recursive predictor when it encounters self-generated states. The authors introduce the concepts of induced states and provenance variables to formalize this issue, deriving a decomposition of induced-state error into three components: teacher-forcing/rollout mismatch, representation-class approximation, and provenance information gaps. Empirical results reveal that rollout leads to a distinct induced-state regime and that fixed induced states create a unique local corrective task. The study also shows that closed-loop correction can enhance performance by altering the induced states visited during rollout, although these gains are conditional. Overall, the findings recast exposure bias as a challenge of reasoning under self-induced epistemic uncertainty, providing a deeper understanding of recursive forecasting failures.
Methodology
The authors formalize recursive forecasting using induced states and provenance variables, deriving a decomposition of errors. They conduct empirical experiments to analyze the effects of rollout on induced-state regimes and the efficacy of provenance-aware correction methods.
Results
The study finds that recursive forecasting can enter a distinct induced-state regime, and that fixed induced states lead to a unique local corrective task. Closed-loop correction strategies can improve performance by altering the states encountered during rollout, with gains being conditional rather than uniform across scenarios.
Implications
The findings suggest that addressing exposure bias requires a deeper understanding of epistemic uncertainty in forecasting models. This could lead to improved training methodologies and correction strategies in autoregressive systems, enhancing their robustness in practical applications.
When Does Routing Become Interpretable? Causal Probes on Block Attention Residuals
Interpretability
Large Language Models
- Introduces a routing-ablation framework for Block AttnRes to analyze source families and their contributions.
- Demonstrates that explicit depth routing does not guarantee mechanistic interpretation.
- Identifies three localized routing motifs in the trained Block AttnRes model.
- Finds a significant dissociation between routing mass and causal importance in the model.
Read more
When Does Routing Become Interpretable? Causal Probes on Block Attention Residuals
Summary
This paper investigates the interpretability of routing in Block Attention Residuals (Block AttnRes) as introduced by Chen et al. (2026). The authors replace fixed additive residuals with a learned softmax over earlier depth-source representations, allowing for direct observation of cross-layer routing during the forward pass. The study probes two Block AttnRes checkpoints, one being a vanilla Qwen3 model wrapped with a deterministic recency-bias schedule and the other a Block AttnRes Qwen3 trained from scratch. The findings reveal that while the architectural exposure of routing is necessary for interpretability, it is not sufficient. The vanilla model's routing weights are content-independent and align with the recency-bias schedule, whereas the trained model exhibits distinct routing motifs. The results indicate a dissociation between average routing mass and causal importance, suggesting that descriptive routing summaries should be treated as hypotheses to be tested rather than definitive evidence of mechanism.
Methodology
The methodology involves probing two 0.6B Block AttnRes checkpoints using a routing-ablation framework that masks mutually exclusive source families. The authors conduct identical routing-ablation interventions on both a vanilla Qwen3 model and a Block AttnRes model to analyze the routing weights and their causal contributions.
Results
The results show that the baseline model's routing weights reproduce the recency-bias schedule's predictions, while the Block AttnRes model reveals three distinct routing motifs. Furthermore, there is a clear dissociation between the average routing mass and the causal contributions of different source families, with some carrying significant mass but lacking causal roles.
Implications
The findings suggest that while architectural changes can enhance interpretability, they must be accompanied by training that incorporates routing for meaningful mechanistic insights. This has implications for the design of interpretable models in machine learning, particularly in understanding complex architectures like Transformers.
Order Is Not Control
Theory
Interpretability
Large Language Models
- Control requires a receiver-gated response law, distinguishing it from mere order induction.
- Empirical evidence from biological systems and LLMs supports the proposed response laws.
- Interventions can induce structure without guaranteeing control, necessitating validation through receiver admission.
- The study introduces a stochastic response kernel to formalize the relationship between drives and responses.
Read more
Order Is Not Control
Summary
The paper argues that while AI alignment and interpretability studies identify order-inducing objects, true control requires a receiver-gated response law. This law maps various states and actions to measurable outcomes, distinguishing between induced order and actual control. The authors present empirical evidence from biological systems (mouse ALM, C. elegans, zebrafish) and large language models (LLMs) to illustrate how interventions can induce structure without guaranteeing control. They define a stochastic response kernel that captures the relationship between drives and responses, emphasizing that control is only established when finite effort leads to target movement while keeping side effects bounded. The study highlights the importance of separating order, response evidence, and local control, proposing a framework applicable to organizational alignment and AI systems. The findings suggest that interventions in adaptive systems can be misleading if not validated through receiver admission, and they provide a structured approach to understanding control in both biological and artificial systems.
Methodology
The authors employed empirical studies across biological systems and large language models to analyze response laws. They defined a stochastic response kernel to characterize the relationship between material states, actions, and responses, and used various panels to gather evidence on control and response mechanisms.
Results
The results showed that response vectors in LLMs were predictable with high accuracy (72.8-73.7% for component-sign accuracy, rising to 84.3-84.8% for nonzero components). Additionally, held-out observers predicted system effects with 93.6% accuracy. The findings confirmed the existence of local admitted control and measurable stochastic response operators while excluding broader claims of universal control.
Implications
The findings have significant implications for AI alignment and interpretability, suggesting that interventions must be rigorously validated to ensure they lead to genuine control. This framework can guide the design of AI systems and organizational practices by clarifying the distinction between induced order and effective control.
Simplex-Constrained Sparse Bagging: Transitioning from Uniform Priors to Sparse Posteriors in Ensemble Learning
Optimization
Efficient ML
Theory
- SCSB transitions from uniform priors to sparse posteriors in ensemble learning.
- Introduces a concave quadratic penalty to address the L1-simplex paradox.
- Achieves up to 96% ensemble compression with linear inference speedups.
- Improves probability calibration while preserving or enhancing generalization accuracy.
Read more
Simplex-Constrained Sparse Bagging: Transitioning from Uniform Priors to Sparse Posteriors in Ensemble Learning
Summary
This paper introduces Simplex-Constrained Sparse Bagging (SCSB), a novel framework aimed at enhancing the efficiency and calibration of bootstrap-based bagging ensembles. Traditional bagging methods, such as Random Forests and Bagged SVMs, typically assign uniform voting weights to all base estimators, which can lead to overconfidence in predictions and redundancy among estimators. SCSB addresses these issues by formulating ensemble pruning and calibration as a joint optimization problem over the probability simplex, minimizing Out-Of-Bag (OOB) loss to ensure effective weight assignment without data leakage. A key innovation is the introduction of a concave quadratic penalty to overcome the L1-simplex paradox, which allows for effective sparsity in the weight vector. The methodology is model-agnostic, enabling significant ensemble compression and improved probability calibration while maintaining or enhancing generalization accuracy. The empirical results demonstrate that SCSB can achieve up to 96% compression of ensembles, leading to faster inference times and better-calibrated predictions.
Methodology
The SCSB framework formulates the optimization of estimator weights as a constrained problem over the probability simplex, minimizing the OOB loss. It employs a concave quadratic penalty to induce sparsity in the weight vector, allowing for effective pruning of redundant estimators. Analytical gradients for both classification and regression tasks are derived to facilitate efficient optimization.
Results
Empirical evaluations show that SCSB can compress ensembles by up to 96%, resulting in linear speedups during inference and significantly improved probability calibration, as evidenced by a reduction in Expected Calibration Error (ECE). The method maintains or enhances the generalization accuracy of the ensemble.
Implications
The SCSB framework has potential applications in scenarios where computational efficiency and model calibration are critical, such as in real-time systems and resource-constrained environments. It can be particularly beneficial for deploying ensemble methods in practical applications where speed and reliability are paramount.
Improving Crash Frequency Prediction from Simulated Traffic Conflicts Using Machine Learning Based Microsimulation
Theory
- ML-based behavior models improve the realism of traffic microsimulation compared to traditional rule-based models.
- Simulated conflicts from the ML model align better with real-world crash data, enhancing prediction accuracy.
- Direct application of ML-generated crashes for predicting real-world crash frequencies remains challenging.
- The study highlights the potential of ML to advance traffic safety assessments without extensive calibration.
Read more
Improving Crash Frequency Prediction from Simulated Traffic Conflicts Using Machine Learning Based Microsimulation
Summary
This paper explores the integration of machine learning (ML) models into traffic microsimulation to enhance the prediction of crash frequencies based on simulated traffic conflicts. Traditional microsimulation approaches often rely on simplified rule-based behavior models, which, while effective for traffic flow, fail to accurately replicate the dynamics of traffic conflicts. The authors conducted microsimulation experiments at five real-world signalized intersections in Leeds, UK, comparing a standard rule-based model with an advanced ML model trained on large-scale trajectory datasets. The study employed a Time-to-Collision metric to analyze simulated vehicle trajectories and identify conflicts, which were then modeled using Extreme Value Theory (EVT) to predict crash frequencies. Results indicated that the ML model produced conflict data that aligned more closely with real-world crash statistics, while the rule-based model's predictions were less reliable. However, the study also found that directly using ML-generated simulated crashes to predict real-world crash frequencies yielded poor results, indicating that while ML models can effectively simulate conflicts, they are not yet capable of generating realistic crash scenarios. The findings suggest that ML-based behavior models hold significant promise for improving crash frequency predictions without necessitating location-specific calibrations, paving the way for future advancements in traffic microsimulation methodologies.
Methodology
The authors conducted traffic microsimulation for five signalized intersections using both a standard rule-based model and a machine learning model. They analyzed simulated vehicle trajectories with a Time-to-Collision metric to identify conflicts and applied Extreme Value Theory to predict crash frequencies.
Results
The ML model's conflict data yielded crash predictions consistent with real-world data, while the rule-based model did not provide meaningful predictions. However, using ML-generated simulated crashes for real-world predictions was ineffective, indicating limitations in crash generation realism.
Implications
The findings suggest that integrating ML into traffic microsimulation can significantly enhance the accuracy of crash frequency predictions, which is crucial for proactive traffic safety assessments and infrastructure planning. This approach could facilitate the evaluation of unimplemented safety interventions and contribute to the iterative development of road safety measures.
EPM-JEPA: Operator-Side Experience Modulation in JEPA-Family World Models
Theory
Generative Models
Time Series
- EPM-JEPA introduces operator-side experience modulation to adapt JEPA models to distribution shifts.
- The study compares operand-side injection (EI-JEPA) and operator-side modulation (EPM-JEPA) in a controlled experiment.
- EPM-JEPA shows a 1.90% improvement over a no-memory baseline, while EI-JEPA underperforms compared to this baseline.
- The performance trajectory is governed by three independent dynamical processes, indicating complex interactions in model adaptation.
Read more
EPM-JEPA: Operator-Side Experience Modulation in JEPA-Family World Models
Summary
This paper introduces EPM-JEPA, a novel approach to enhance the Joint-Embedding Predictive Architecture (JEPA) by incorporating operator-side experience modulation to adapt to distribution shifts during inference. Traditional JEPA models utilize a static predictor that does not adjust its weights when faced with changes in dynamics at test time. The study compares two mechanisms for integrating accumulated experience into the JEPA predictor: operand-side injection (EI-JEPA) and operator-side modulation (EPM-JEPA). The authors conduct experiments using the Moving MNIST dataset with a gravity shift to evaluate the performance of both methods. The results indicate that EPM-JEPA yields a null result compared to EI-JEPA, with a performance difference of 4.74%. However, EPM-JEPA shows a 1.90% improvement over a no-memory baseline, suggesting that weight-level modulation is beneficial. The paper also provides a detailed mechanism analysis, revealing that the performance trajectory is influenced by three independent dynamical processes rather than converging to a stable equilibrium. This analysis sets the stage for future work on PEM-JEPA, a physics-grounded successor aimed at addressing the identified limitations.
Methodology
The authors conducted a controlled experiment using the Moving MNIST dataset with a gravity shift to compare the performance of three models: a Vanilla JEPA baseline with no memory, EI-JEPA with operand-side injection of experience, and EPM-JEPA with operator-side modulation of weights using low-rank adaptation (LoRA). Performance was evaluated based on the ability to predict future states under distribution shifts.
Results
EPM-JEPA achieved a performance score of 0.7848 ± 0.0078, while EI-JEPA scored 0.8238, resulting in a null result (|δ| < 5%). EPM-JEPA outperformed the no-memory baseline (0.8000) by 1.90%, indicating that operator-side modulation is effective in this context, while EI-JEPA's operand-side approach did not yield improvements.
Implications
The findings suggest that directly modulating the weights of predictive models based on accumulated experience can lead to better adaptation in dynamic environments. This approach could have significant implications for developing more robust machine learning models in various applications, particularly in scenarios where environmental conditions change unexpectedly.
LLM-Powered Personalized Glycemic Assessment in Type 2 Diabetes with Wearable Sensor Data
Large Language Models
Multimodal
Time Series
- Introduction of GlyLLM, an LLM-powered framework for glycemic assessment.
- Integration of wearable sensor data with personalized static metadata enhances model performance.
- GlyLLM outperforms traditional machine learning methods in glucose forecasting and diabetes categorization.
- Ablation studies reveal the critical role of personal static metadata in glycemic assessment.
Read more
LLM-Powered Personalized Glycemic Assessment in Type 2 Diabetes with Wearable Sensor Data
Summary
This paper addresses the challenge of personalized glycemic assessment in Type 2 Diabetes (T2D) using wearable sensor data, particularly continuous glucose monitors (CGMs). Traditional machine learning methods often rely on historical blood glucose measurements and neglect individual-level context, leading to suboptimal performance across diverse diabetes populations. To overcome these limitations, the authors propose GlyLLM, a framework that integrates large language models (LLMs) with wearable sensor data and structured metadata. GlyLLM utilizes a vision transformer (ViT) encoder to process sensor data and combines it with personalized static metadata to enhance glycemic assessment. The framework is evaluated on the AI-READI dataset, demonstrating significant improvements over traditional ML methods, achieving an average of 13.66% reduction in Root Mean Squared Error (RMSE) for glucose forecasting and 13.08% improvement in Area Under the Receiver Operating Characteristic (AUROC) for diabetes categorization. The study also highlights the importance of diabetes surveys and biometric tests in the assessment process, marking a promising advancement in personalized diabetes care.
Methodology
GlyLLM employs a vision transformer (ViT) encoder to generate representations of wearable sensor data, which are then concatenated with personalized static metadata. This unified input is processed by a pre-trained LLM to model glycemic dynamics effectively. The framework is evaluated through two clinical tasks: glucose level forecasting and diabetes categorization.
Results
GlyLLM achieved an average of 13.66% improvement in RMSE for glucose forecasting and 13.08% improvement in AUROC for diabetes categorization compared to traditional machine learning approaches. The ablation study indicated that diabetes surveys and biometric tests are more significant than other health information for effective glycemic assessment.
Implications
The findings suggest that integrating LLMs with wearable sensor data can significantly enhance personalized glycemic assessment in T2D care, potentially leading to better management strategies and improved patient outcomes.
MiniPIC: Flexible Position-Independent Caching in <100LOC
Large Language Models
Efficient ML
- MiniPIC provides a flexible and minimalistic approach to position-independent caching in LLMs.
- The implementation requires less than 100 lines of code changes, making it easy to integrate into existing systems.
- User-controlled primitives allow for multiple caching strategies without extensive modifications to the inference engine.
- The proposed method achieves a 49% improvement in prefill throughput and significantly reduces time-to-first-token for cached spans.
Read more
MiniPIC: Flexible Position-Independent Caching in <100LOC
Summary
The paper introduces MiniPIC, a novel approach to position-independent caching (PIC) designed for retrieval-augmented and agentic workloads in large language models (LLMs). Traditional prefix caching methods in systems like vLLM are limited as they only reuse key-value (KV) entries with identical prefixes, which hinders performance when prompts share common spans but differ in context. MiniPIC addresses this by implementing a minimalistic design that requires fewer than 100 lines of code changes to the core engine. It utilizes a positional-encoding-free KV cache and user-controlled cache-reuse primitives, allowing for flexible caching strategies. The authors propose three user-facing primitives: block-aligned padding, SSEP, and PDEP, which modify hashing behavior and causal attention structure. The results demonstrate that MiniPIC significantly enhances prefill throughput by 49% on the 2WikiMultihopQA benchmark, reduces time-to-first-token for cached spans by up to two orders of magnitude, and maintains linear scaling for uncached spans while incurring minimal overhead.
Methodology
The authors developed MiniPIC by creating a position-free KV cache that stores unrotated keys and applies rotary positional encoding (RoPE) during attention computation. They introduced three user-facing primitives to control cache reuse and implemented a scheduling policy that interleaves span prefill and final-prompt requests, optimizing throughput without synchronization barriers.
Results
MiniPIC improved prefill throughput by 49% over the baseline vLLM and reduced the cached-span time-to-first-token by up to two orders of magnitude. It preserved the linear prefill scaling of uncached spans and incurred only 5.7% worst-case overhead, demonstrating its efficiency and effectiveness.
Implications
The findings suggest that MiniPIC can enhance the performance of LLM inference systems, particularly in scenarios involving repeated spans across different contexts. This could lead to more efficient deployment of LLMs in applications requiring rapid response times and high throughput.
Existence Precedes Value: Joint Modeling of Observational Existence and Evolving States in Time Series Forecasting
Time Series
- Timeflies reformulates time series forecasting as a joint problem of observability inference and value estimation.
- The framework includes dedicated streams for observations and values, enhancing the modeling of historical irregularities.
- A new benchmark dataset, Shadow, is introduced to evaluate the model's performance in realistic scenarios.
- The Observation-Value Joint Entropy (OVJE) metric provides a comprehensive evaluation of the model's predictability.
Read more
Existence Precedes Value: Joint Modeling of Observational Existence and Evolving States in Time Series Forecasting
Summary
The paper addresses the challenges of time series forecasting in real-world scenarios characterized by incomplete and irregular data due to factors like sensor dormancy and event-driven sampling. Traditional forecasting methods often assume that future observation timestamps are known, which is not the case in many practical applications. To overcome this limitation, the authors propose a novel framework called Timeflies, which reformulates forecasting as a joint problem of future observability inference and value estimation. Timeflies consists of two main streams: an observation stream and a value stream, interconnected through three modules designed for reliability-aware embedding, observation-guided dependency modeling, and joint prediction. The authors also introduce a benchmark dataset, Shadow, which combines natural missingness from public datasets with real-world industrial data, along with a new metric, Observation-Value Joint Entropy (OVJE), for evaluating the predictability of the model. Experimental results demonstrate that Timeflies consistently outperforms existing methods, emphasizing the importance of explicitly modeling future observability in time series forecasting with missing values.
Methodology
The Timeflies framework employs a dual-stream architecture that captures the interaction between observational existence and evolving states. It utilizes three main modules: Reliability-Gated Patch Embedding for assessing observation reliability, Cross-track Conditioned Transformer for modeling dependencies, and a joint prediction mechanism to forecast both the existence of observations and their corresponding values.
Results
The experimental results indicate that Timeflies consistently outperforms traditional forecasting methods, particularly in scenarios with missing data. The introduction of the Shadow benchmark and the OVJE metric allows for a more nuanced evaluation of the model's performance, demonstrating its effectiveness in real-world applications.
Implications
The proposed framework has significant implications for various fields that rely on time series data, such as finance, healthcare, and IoT systems, where accurate forecasting is critical despite incomplete data. By addressing the issue of observability, Timeflies can enhance decision-making processes in these domains.
Keep Policy Gradient in Charge: Sibling-Guided Credit Distillation for Long-Horizon Tool-Use Agents
Reinforcement Learning
Robotics
Large Language Models
- Naive self-distillation can degrade long-horizon tool-use performance by reinforcing shortcuts.
- SGCD introduces a novel approach to credit assignment using sibling rollouts to enhance learning.
- The method maintains the integrity of the policy gradient update while reshaping token credit.
- Empirical results show significant improvements in performance on AppWorld and Ï„ 3-airline benchmarks.
Read more
Keep Policy Gradient in Charge: Sibling-Guided Credit Distillation for Long-Horizon Tool-Use Agents
Summary
This paper addresses the challenges of long-horizon tool-use in reinforcement learning (RL), particularly the inefficiencies of naive self-distillation (SD) that can degrade the performance of agents by amplifying both useful skills and harmful shortcuts. The authors propose a novel method called Sibling-Guided Credit Distillation (SGCD), which focuses on credit assignment rather than acting as a competing loss. SGCD utilizes dynamic sampling to generate mixed rollouts of successful and failed attempts, which are then summarized by an external large language model (LLM) to create a stepwise credit reference. This reference guides the policy gradient updates without directly influencing the actor's behavior during deployment. The method preserves the advantages of group-relative policy-gradient methods like GRPO while enhancing the learning signal through a structured credit assignment process. The empirical results demonstrate that SGCD outperforms matched GRPO comparators across benchmarks, indicating its effectiveness in improving long-horizon tool-use capabilities.
Methodology
The authors developed Sibling-Guided Credit Distillation (SGCD), which involves dynamic sampling of sibling rollouts to create a mixed set of successful and failed attempts. An external LLM summarizes these rollouts into a credit reference that guides the policy gradient updates. The method uses dense teacher/student divergence to reweight the policy gradient advantage without introducing direct distillation losses.
Results
SGCD improved performance on the AppWorld benchmark from 42.9 to 45.6 on test_normal and from 24.7 to 27.0 on test_challenge. For the Ï„ 3-airline benchmark, the pass@1 metric increased from 0.583 to 0.602, demonstrating the effectiveness of the proposed method over traditional GRPO approaches.
Implications
The findings suggest that SGCD can significantly enhance the performance of long-horizon tool-use agents in reinforcement learning, making it applicable in various domains requiring complex decision-making and tool utilization. This method could lead to more robust and efficient RL systems in real-world applications.
Uncertainty Estimation for Molecular Diffusion Models
Generative Models
- Introduces a method for estimating uncertainty in molecular diffusion models.
- Utilizes a Laplace approximation to measure noise prediction variability.
- Demonstrates a negative correlation between uncertainty scores and sample quality metrics.
- Shows that filtering based on uncertainty can improve generated sample quality.
Read more
Uncertainty Estimation for Molecular Diffusion Models
Summary
This paper addresses the challenge of assessing the quality of generated molecules using diffusion models, which have become prevalent in 3D molecular generation. The authors propose a post-hoc method for estimating uncertainty in pretrained molecular diffusion models. By employing a Laplace approximation of the denoising network, they measure the variability of noise predictions throughout the generation process. The resulting uncertainty score is shown to correlate negatively with established quality metrics, indicating its effectiveness in predicting sample quality. The authors also demonstrate that this uncertainty score can be utilized to filter generated samples, enhancing model performance through test-time scaling without the need for retraining. This work represents a significant step towards integrating uncertainty estimation into molecular diffusion models, which is crucial for applications where downstream evaluations are costly.
Methodology
The authors fit a Laplace approximation to the denoising network of a pretrained molecular diffusion model. They measure the variability of noise predictions across the generation trajectory and aggregate this variability into a single uncertainty score for each generated molecule. This score is derived from the sample variance of noise predictions at selected timesteps, atoms, and feature dimensions.
Results
The empirical results indicate that the uncertainty score is informative of molecular sample quality, exhibiting a negative correlation with established metrics such as molecular stability and validity. The proposed method outperforms diffusion likelihood baselines in predicting sample quality in experiments conducted on the QM9 dataset. Additionally, the uncertainty score enables effective filtering of generated samples, leading to improved quality without retraining the model.
Implications
The proposed uncertainty estimation method has significant implications for molecular generation tasks, particularly in contexts where downstream evaluations are expensive. By providing a reliable signal of sample quality, it can facilitate more efficient resource allocation in experimental validation and drug discovery processes.
TWLA: Achieving Ternary Weights and Low-Bit Activations for LLMs via Post-Training Quantization
Large Language Models
Efficient ML
- TWLA achieves significant model compression with ternary weights and low-bit activations.
- The framework effectively addresses heavy-tailed activation distributions, enhancing quantization performance.
- Three innovative components (E2M-ATQ, KOTMS, ILA-AMP) work together to optimize both weights and activations.
- TWLA maintains high accuracy while delivering substantial inference acceleration.
Read more
TWLA: Achieving Ternary Weights and Low-Bit Activations for LLMs via Post-Training Quantization
Summary
The paper presents TWLA, a novel post-training quantization (PTQ) framework designed to optimize large language models (LLMs) by achieving ternary weights and low-bit activations. The authors address the challenges posed by heavy-tailed activation distributions that hinder effective quantization, leading to high precision requirements for activations and limiting inference acceleration. TWLA consists of three main components: (1) the Euclidean-to-Manifold Asymmetric Ternary Quantizer (E2M-ATQ), which minimizes layer-output error during weight ternarization through a two-stage optimization process; (2) the Kronecker Orthogonal Tri-Modal Shaping (KOTMS), which reshapes weights into ternary-friendly tri-modal distributions while suppressing activation outliers; and (3) the Inter-Layer Aware Activation Mixed Precision (ILA-AMP), which optimizes bit allocation by considering the interaction costs between adjacent layers. The extensive experiments demonstrate that TWLA achieves 1.58-bit weight compression and 4-bit activation quantization while maintaining high accuracy, significantly improving inference speed compared to existing methods.
Methodology
TWLA employs a post-training quantization approach that includes three main components: E2M-ATQ for optimizing weight ternarization, KOTMS for reshaping weights into tri-modal distributions and suppressing activation outliers, and ILA-AMP for optimizing bit allocation by considering inter-layer interactions. This systematic approach allows for effective joint optimization of weights and activations.
Results
TWLA achieves 1.58-bit weight compression and 4-bit activation quantization while maintaining high accuracy across various tasks. The framework demonstrates robust performance under both weight-only and weight-activation quantization, outperforming existing methods that struggle with low-bit activations.
Implications
The TWLA framework has significant implications for deploying large language models in resource-constrained environments, enabling efficient inference without compromising model performance. This approach can facilitate the use of LLMs in edge devices and applications requiring low latency and reduced memory usage.
Select and Improve: Understanding the Mechanics of Post-Training for Reasoning
Reinforcement Learning
Large Language Models
NLP
- Identified two core mechanisms of RL post-training: strategy selection and strategy improvement.
- Performance improvement through RL is dependent on the diversity of pre-training data and the difficulty of RL data.
- Strategy selection routes problems to existing reasoning patterns, while strategy improvement enhances these patterns.
- High-quality pre-training data is essential for effective RL training in reasoning tasks.
Read more
Select and Improve: Understanding the Mechanics of Post-Training for Reasoning
Summary
This paper investigates the mechanisms by which reinforcement learning (RL) enhances the capabilities of reasoning and coding models, particularly focusing on math reasoning tasks. The authors identify two primary mechanisms: strategy selection, which routes problems to existing reasoning patterns learned during pre-training, and strategy improvement, which enhances these patterns. The study reveals that the effectiveness of these mechanisms is heavily influenced by the quality and diversity of the pre-training and RL datasets. Notably, strategy selection is crucial for performance, requiring a rich set of reasoning patterns in the pre-training data, while strategy improvement necessitates more challenging questions in the RL dataset. The findings suggest that RL primarily refines pre-existing reasoning patterns rather than introducing new ones, emphasizing the importance of high-quality pre-training data for scaling RL applications in reasoning tasks.
Methodology
The authors conducted controlled experiments using a synthetic finite-field arithmetic task with the Qwen-2.5-1.5B model. They employed a standard language model post-training setup involving supervised fine-tuning (SFT) followed by reinforcement learning (RL). The experiments analyzed how different datasets influenced the activation of strategy selection and improvement mechanisms.
Results
The experiments demonstrated that strategy selection was the primary driver of performance improvements, contingent on the presence of diverse reasoning patterns in pre-training data. Strategy improvement was observed when RL datasets contained more difficult problems than those encountered during SFT. The study also confirmed phenomena such as strategy amplification and composition as outcomes of the core mechanisms rather than separate processes.
Implications
The findings suggest that improving pre-training datasets can significantly enhance the effectiveness of RL in reasoning tasks. This insight can guide future research and practical applications in model training, particularly in developing more capable reasoning and coding models through better data strategies.
Multi-Bitwidth Quantization for LLMs Using Additive Codebooks
Large Language Models
Efficient ML
Theory
- Introduction of Drop-by-Drop, a multi-bitwidth quantization framework for LLMs.
- Establishes a theoretical foundation linking Gaussian source refinement to quantization methods.
- Enables inference-time control over model weight precision without retraining.
- Maintains low perplexity and strong accuracy across various bitwidths.
Read more
Multi-Bitwidth Quantization for LLMs Using Additive Codebooks
Summary
The paper introduces Drop-by-Drop, a novel multi-bitwidth post-training quantization framework designed for large language models (LLMs). As LLMs are deployed across diverse hardware with varying resource constraints, the ability to manage the trade-off between performance and efficiency without retraining is crucial. The authors establish that LLM weights, typically following a Gaussian distribution, can be optimally reconstructed with increasing fidelity as more bits are added, guided by a weighted mean squared error distortion. The Drop-by-Drop framework incorporates Matryoshka-style supervision into the loss function, leveraging the structure of additive codebooks to allow a single model to operate at multiple precision levels. This approach significantly reduces storage and memory overhead while maintaining competitive perplexity and accuracy across major architectures like Qwen, LLaMA, Gemma, and Mistral. The paper bridges the gap between information theory and practical quantization techniques, demonstrating that adaptive precision control during inference can enhance the deployment of LLMs.
Methodology
The Drop-by-Drop framework utilizes additive quantization and Matryoshka-style supervision to train codebooks that allow for progressive compression. It is grounded in information theory, particularly the concept of successive source refinement, which enables the model to adjust its precision dynamically during inference.
Results
The Drop-by-Drop framework demonstrated that it can maintain low perplexity and strong task accuracy across a range of supported bitwidths, effectively adapting to resource constraints without the need for retraining or recalibration.
Implications
This work has significant implications for the deployment of LLMs in resource-constrained environments, allowing for more flexible and efficient use of models across different hardware platforms. It opens avenues for further research in adaptive quantization techniques and their applications in real-world scenarios.
Towards Provably Fair Machine Learning: Bayesian Approaches For Consistent and Transparent Predictions
Theory
- Introduces a formal definition of statistical consistency for predictions in machine learning.
- Develops the Fair Bayesian classifier that ensures consistency across all subgroups.
- Demonstrates that standard classifiers often yield statistically inconsistent predictions.
- Achieves zero consistency error while exceeding baseline accuracy on benchmark datasets.
Read more
Towards Provably Fair Machine Learning: Bayesian Approaches For Consistent and Transparent Predictions
Summary
This paper addresses the challenge of ensuring fairness in machine learning predictions, particularly in high-stakes domains where predictions can disproportionately affect minority groups. The authors highlight the issue of statistical inconsistency in predictions across subgroups, especially those defined by multiple features. They propose a formal definition of statistical consistency, which includes determinism and statistical consistency, leading to the development of the Fair Bayesian classifier. This classifier aims to provide consistent predictions across all subgroups in a categorical dataset and abstains from making predictions when no consistent output is possible. The authors demonstrate that standard classifiers often produce inconsistent predictions, particularly for small subgroups, while the Fair Bayesian classifier achieves zero consistency error and maintains competitive accuracy across various datasets. The findings suggest that enforcing Bayesian consistency can enhance prediction quality and fairness in machine learning applications.
Methodology
The authors define two key requirements for consistent predictions: determinism (identical individuals receive identical predictions) and statistical consistency (predictions cannot be rejected as drawn from the optimal target distribution). They develop the Fair Bayesian classifier based on these principles, which operates across all groups and subgroups simultaneously and abstains from predictions when necessary.
Results
The Fair Bayesian classifier was tested on three benchmark datasets (Adult, COMPAS, and Bank Marketing) and achieved zero consistency error, outperforming standard classifiers like decision trees and neural networks in terms of accuracy on the subgroups where it made predictions. It also performed competitively on the multicalibration metric, indicating its effectiveness in maintaining fairness.
Implications
The findings suggest that statistical consistency is crucial for improving prediction quality and fairness in machine learning, particularly in sensitive applications such as finance and healthcare. The ability to abstain from making predictions when evidence is insufficient can help mitigate the risks of unfair outcomes for minority groups.
Understanding Truncated Positional Encodings for Graph Neural Networks
Graph Learning
Theory
- Truncated spectral and walk-based positional encodings have different expressive powers.
- Truncated spectral PEs can be less expressive than the 1-WL test.
- k-harmonic distances provide a bridge between spectral and polynomial positional encodings.
- A combination of truncated PEs yields better performance than using a single family.
Read more
Understanding Truncated Positional Encodings for Graph Neural Networks
Summary
This paper investigates the theoretical and empirical properties of truncated positional encodings (PEs) used in Graph Neural Networks (GNNs). While spectral and walk-based PEs are known to be equivalent in expressive power when using complete forms, this study reveals that their truncated versions exhibit fundamentally different expressive capabilities. The authors demonstrate that truncated spectral PEs do not surpass the expressivity of the 1-WL test, and they explore the k-harmonic distances as a bridge between spectral and polynomial encodings. The research highlights that a mix of truncated PEs outperforms any single family on real-world datasets, suggesting that practitioners should consider combinations of PEs for improved performance in GNNs.
Methodology
The authors conducted theoretical analyses to compare the expressive power of truncated spectral and walk-based positional encodings. They also performed empirical experiments on real-world datasets to evaluate the performance of various combinations of PEs, including k-harmonic distances and effective resistance.
Results
The study found that truncated spectral and walk encodings differ significantly in their ability to distinguish graphs. Specifically, it was shown that truncated spectral PEs do not exceed the expressivity of the 1-WL test. The k-harmonic distances were identified as a powerful alternative, with the potential to match the expressivity of complete spectral and walk encodings when sufficiently many distances are used. Empirical results indicated that using a mix of truncated PEs consistently outperformed any single encoding approach.
Implications
The findings suggest that practitioners should reconsider the use of truncated positional encodings in GNNs, as combining different types can enhance model performance. This research opens avenues for exploring new positional encodings that may offer competitive expressivity without the computational burden of complete forms.
The Geometry of Phase Transitions in Generative Dynamics via Projection Caustics
Generative Models
Theory
- Introduces a geometric framework for understanding phase transitions in generative dynamics.
- Identifies projection caustics as critical points where generative models exhibit sharp transitions.
- Develops the Critical Boundary Detector (CBD) to diagnose score-direction instability.
- Demonstrates the effectiveness of CBD across toy models and latent text-to-image diffusion models.
Read more
The Geometry of Phase Transitions in Generative Dynamics via Projection Caustics
Summary
This paper explores the geometric underpinnings of phase-transition-like behaviors in continuous-state generative models, such as diffusion and flow-matching models. These models exhibit abrupt qualitative changes in their samples, despite evolving continuously in state space. The authors propose a geometric interpretation of denoising as gradient descent on a free energy landscape, identifying sharp transitions occurring near projection caustics—regions where the nearest-point projection onto the data support is not unique. To diagnose these transitions, they introduce the Critical Boundary Detector (CBD), which identifies instability in score directions along generative trajectories. The CBD is tested across various models, demonstrating its ability to localize mode commitment and predict sensitive intervention windows. The findings establish a connection between the geometry of data and the dynamics of diffusion generation, providing insights into how small perturbations can lead to significant changes in generated outputs.
Methodology
The authors analyze the behavior of generative models through a geometric lens, focusing on the concept of projection caustics. They develop the Critical Boundary Detector (CBD) to track changes in score directions along generative trajectories, allowing for the identification of regions sensitive to perturbations. The methodology includes theoretical analysis and empirical testing on various generative models.
Results
The CBD successfully identifies regions of mode commitment and predicts intervention-sensitive windows in generative processes. The results indicate that sharp transitions in generative dynamics are closely related to the geometry of the underlying data distribution, particularly near projection caustics.
Implications
The findings suggest that understanding the geometric properties of data can enhance the control and predictability of generative models. This could lead to improved techniques for steering generative processes in applications such as image synthesis, text generation, and other creative AI tasks.
Forecasting Is Not Attribution: Localizing Decoder Bypass in Graph-Based Neural Marketing Mix Models
Graph Learning
Time Series
Theory
- Identification of 'attribution bypass' in graph-based neural marketing mix models.
- Introduction of DICE-MMM as a two-stage framework for graph learning and diagnostics.
- Development of CIG and AR-CIG diagnostics to assess decoder alignment with the graph.
- Empirical evidence showing that low forecasting error does not guarantee valid attribution.
Read more
Forecasting Is Not Attribution: Localizing Decoder Bypass in Graph-Based Neural Marketing Mix Models
Summary
This paper addresses a critical issue in marketing mix models (MMM) that combine forecasting and attribution tasks, highlighting a phenomenon termed 'attribution bypass.' The authors argue that high forecasting accuracy does not guarantee valid attribution to marketing channels, as a powerful decoder can achieve low forecasting error without effectively utilizing the underlying graph structure meant for attribution. To tackle this problem, the authors introduce DICE-MMM, a diagnostic framework designed to differentiate between graph recovery, forecasting accuracy, and the decoder's alignment with the graph. DICE-MMM consists of two stages: the first trains a graph encoder with a restricted decoder, while the second freezes the encoder and trains a graph-safe latent decoder. The paper employs counterfactual influence graphs (CIG), autoregressive rollout influence graphs (AR-CIG), and frozen-decoder graph-swap tests to evaluate the model's performance. The findings reveal that while DICE improves graph recovery compared to existing methods, low MSE does not equate to valid attribution. The results indicate that the learned graph interfaces and current sparsification techniques are insufficient for reliable attribution, thus localizing the challenge to the selection of deployable graph support rather than the forecasting or decoder capacity.
Methodology
The methodology involves a two-stage framework (DICE-MMM) where the first stage trains a graph encoder with a restricted decoder to ensure that the graph influences the forecasting process. The second stage freezes the encoder and trains a graph-safe latent decoder, ensuring that the decoder's responses are aligned with the graph structure. The evaluation employs CIG and AR-CIG diagnostics alongside frozen-decoder graph-swap tests to assess the model's performance and alignment with the graph.
Results
The results demonstrate that DICE-MMM improves stable graph recovery compared to existing models like CausalMMM. However, it also reveals that both no-graph and full-graph decoders can achieve similar MSE as oracle graphs, while their AR-CIG alignment remains low. The findings indicate that the decoder is not graph-blind, as it performs significantly better when provided with an oracle graph. Nonetheless, the current learned graph interfaces and sparsification techniques are inadequate for reliable attribution.
Implications
The implications of this research are significant for the deployment of marketing mix models in real-world scenarios. It emphasizes the need for careful evaluation of graph-based models to ensure that they not only forecast accurately but also provide valid attribution to marketing channels. The DICE-MMM framework can serve as a diagnostic tool for practitioners to identify potential pitfalls in their models and improve the reliability of their attribution assessments.
Reliability of Probabilistic Emulation of Physical Systems
Generative Models
Time Series
Theory
- CRPS-trained ensembles provide more reliable uncertainty estimates than generative models.
- Generative models trained in ambient space can match the coverage of CRPS ensembles but are computationally expensive.
- The study introduces AutoCast and AutoSim for benchmarking and dataset generation, respectively.
- Reliable uncertainty quantification is crucial for effective decision-making in physical system modeling.
Read more
Reliability of Probabilistic Emulation of Physical Systems
Summary
This paper addresses the reliability of probabilistic emulation methods for physical systems, focusing on two dominant approaches: generative models and CRPS-trained ensembles. While both methods have shown strong predictive accuracy, their uncertainty quantification (UQ) has not been systematically evaluated. The authors develop a framework to assess the empirical coverage of predictive intervals across various 2D spatiotemporal physical systems, ensuring matched model size and computational budget. The study finds that CRPS-trained ensembles typically yield more reliable uncertainties than generative models, particularly in terms of coverage and computational efficiency. When generative models are trained in ambient space, they can achieve comparable coverage to CRPS ensembles but at a higher inference cost. The paper introduces AutoCast, a modular framework for implementing both modeling approaches, and AutoSim, a dataset generation package for rapid prototyping. The findings highlight the importance of reliable UQ in real-world applications and suggest future research directions for enhancing probabilistic emulation methods.
Methodology
The authors developed a framework to evaluate the reliability of probabilistic emulation methods by inspecting the empirical coverage of predictive intervals. They compared generative models and CRPS-trained ensembles across various 2D spatiotemporal systems, focusing on predictive accuracy, uncertainty quantification, and computational efficiency.
Results
The results indicate that CRPS-trained ensembles generally achieve better empirical coverage and faster inference compared to generative models. When generative models are trained in ambient space, they can achieve similar coverage to CRPS ensembles but with increased latency. The study also highlights the effectiveness of AutoSim for flexible dataset generation.
Implications
The findings suggest that improving the reliability of probabilistic forecasts is essential for real-world applications, particularly in fields like climate modeling and materials science. The developed frameworks can facilitate future research and practical implementations in probabilistic emulation.
Bag of Dims: Training-Free Mechanistic Interpretability via Dimension-Level Sign Patterns
NLP
Large Language Models
Interpretability
- Introduces the Bag of Dims framework for interpretability in transformer models.
- Demonstrates that sign patterns in hidden states can predict semantic content effectively.
- Achieves unsupervised discovery of 175 semantic categories with high accuracy.
- Shows that features are preserved through attention projections and linked to specific neurons.
Read more
Bag of Dims: Training-Free Mechanistic Interpretability via Dimension-Level Sign Patterns
Summary
This paper introduces the 'Bag of Dims' framework, which leverages the standard basis of transformer hidden states to provide a training-free, architecture-general method for mechanistic interpretability. The framework posits that individual dimensions in transformer models encode semantic content through their signs (±1) and confidence via their magnitudes, functioning as independent binary registers. The author validates this framework across three model families (Qwen 3.5-4B, Gemma 3-4B, Mistral 7B) through four experiments. The findings reveal that sign patterns alone can achieve significant predictive accuracy, with top-5 next-token accuracy ranging from 72% to 93%. The study also demonstrates that these sign patterns can be used to discover 175 semantic categories without any training, achieving a mean AUC of 0.80. Furthermore, the research indicates that the discovered features persist through attention projections and that static feedforward network (FFN) weight inspection links a substantial portion of features to individual writer neurons. The results suggest that the standard basis is sufficient for feature reading throughout the transformer compute pathway, requiring no additional training or optimization.
Methodology
The methodology involves treating transformer hidden states as collections of independent binary registers, where each dimension encodes semantic content via its sign and confidence via its magnitude. The framework is validated through a series of experiments that assess the predictive power of sign patterns, the discovery of semantic categories using a single-token type cache, and the analysis of attention projections and FFN weights.
Results
The results indicate that sign patterns alone can achieve 72-93% top-5 next-token accuracy and 80-90% top-4096 accuracy without any learned decoder. The unsupervised discovery process yields 175 semantic categories with a mean AUC of 0.80, and the features discovered are shown to be preserved in attention projections. Additionally, 20% of features are linked to individual writer neurons with high agreement, confirming the effectiveness of the Bag of Dims framework.
Implications
The findings suggest that the Bag of Dims framework can significantly enhance the interpretability of transformer models without the need for extensive training or optimization. This has potential applications in safety-critical domains where understanding model behavior is essential, such as in natural language processing tasks.
Navigating the Safety-Fidelity Trade-off: Massive-Variate Time Series Forecasting for Power Systems via Probabilistic Scenarios
Time Series
- Introduction of PowerPhase, a benchmark for probabilistic forecasting in power systems with up to 36,964 channels.
- Incorporation of voltage-safety metrics to evaluate model performance alongside traditional accuracy measures.
- Identification of a safety-fidelity trade-off in model rankings, emphasizing the need for constraint satisfaction.
- Development of PowerForge, a scenario-based forecasting model tailored for transmission-scale grids.
Read more
Navigating the Safety-Fidelity Trade-off: Massive-Variate Time Series Forecasting for Power Systems via Probabilistic Scenarios
Summary
This paper addresses the challenges of probabilistic forecasting in multivariate power systems, which require consideration of both operational constraints and temporal dependencies. The authors introduce PowerPhase, a new benchmark for probabilistic forecasting that encompasses six transmission grids with up to 36,964 channels, significantly exceeding the capacity of existing benchmarks. PowerPhase incorporates voltage-safety evaluations and a suite of constraint-aware metrics to assess model performance beyond traditional distributional accuracy. The study reveals a trade-off between safety and fidelity in model rankings, highlighting the importance of constraint satisfaction in forecasting accuracy. Additionally, the authors propose PowerForge, a scenario-based quantile forecaster designed to handle the complexities of transmission-scale grids. PowerForge employs architectural innovations such as type-specific decoding heads and a causal bridge to maintain tractability while producing accurate forecasts. The results demonstrate that PowerForge achieves superior performance across various grid sizes and conditions, establishing a new standard for forecasting in power systems.
Methodology
The authors developed PowerPhase as a benchmarking framework for probabilistic forecasting in power systems, generating target trajectories through AC power-flow simulations. They introduced a suite of constraint-aware metrics to evaluate model performance, focusing on voltage safety. PowerForge, the proposed forecasting model, utilizes a scenario-based approach with architectural features that account for the unique characteristics of transmission grids, including type-aware decoding heads and a causal cross-type bridge.
Results
The evaluation of PowerForge against eight baseline models across various grid sizes and seeds showed that it consistently ranked highest in terms of both distributional accuracy and constraint satisfaction. The findings highlighted the trade-off between safety and fidelity, with models performing differently based on their ability to meet operational constraints.
Implications
The introduction of PowerPhase and PowerForge has significant implications for the operational forecasting of power systems, particularly as the integration of renewable energy sources increases grid complexity. This work sets a new standard for evaluating forecasting models in high-dimensional settings and emphasizes the importance of safety in probabilistic predictions.
Different Layers, Different Manifolds: Module-Wise Weight-Space Geometry in Transformer Optimization
NLP
Large Language Models
Optimization
- Different transformer modules (attention and MLP layers) prefer different weight-space geometries.
- Assigning Stiefel geometry to attention layers and DGram geometry to MLP layers yields optimal performance.
- DGram constraints on attention layers can lead to instability due to singular value growth and softmax saturation.
- Module-specific optimization strategies are necessary for effective transformer training.
Read more
Different Layers, Different Manifolds: Module-Wise Weight-Space Geometry in Transformer Optimization
Summary
This paper investigates the impact of weight-space geometry on the optimization of transformer models, specifically focusing on whether different modules (attention and MLP layers) benefit from distinct manifold geometries. The study employs the Manifold Muon framework during the pretraining of the GPT-2 model, comparing the performance of various layer-wise assignments of Stiefel and DGram constraints. The findings reveal a significant asymmetry: assigning Stiefel geometry to attention layers and DGram geometry to MLP layers yields the best performance, while the reverse assignment and an all-DGram configuration lead to instability. This instability is traced back to singular value growth in DGram-constrained attention weights, which can amplify logits and cause softmax saturation, degrading training dynamics. The results suggest that optimization strategies for transformers should be tailored to the specific geometrical preferences of different modules rather than applying uniform constraints.
Methodology
The study utilizes the Manifold Muon framework to apply structured matrix-manifold constraints during the pretraining of the GPT-2 model. It evaluates five different layer-wise manifold assignments: UNCONSTRAINED, ALL-STIEFEL, ALL-DGRAM, HETERO (Stiefel for attention and DGram for MLP), and HETERO-INV (DGram for attention and Stiefel for MLP). Performance is measured through validation loss and stability of training outcomes.
Results
The results indicate that the HETERO assignment (Stiefel for attention and DGram for MLP) achieves the lowest validation loss (3.3544), while the ALL-DGRAM configuration becomes unstable. The study highlights the importance of module-specific geometry in transformer optimization, with the HETERO configuration outperforming others under a shared training setup.
Implications
These findings suggest that transformer optimization can be significantly improved by adopting module-specific weight-space geometries, which could lead to more stable training dynamics and better performance in large language models. This approach may also inform the design of future optimization algorithms tailored for heterogeneous neural network architectures.
Adaptive Weighted Averaging
Optimization
Theory
- Introduces adaptive strategies for selecting the maximum among unknown values based on unbiased estimates.
- Presents the SBern strategy, which is admissible and strictly dominates uniform random selection.
- Constructs the Speel strategy that dominates any fixed deterministic strategy.
- Demonstrates impossibility results for dependent observations and the limitations of certain strategies.
Read more
Adaptive Weighted Averaging
Summary
This paper addresses the problem of selecting the largest value among n unknown values based on single unbiased estimates for each. The authors propose strategies that are both admissible and outperform a baseline strategy, such as uniform random selection. They introduce a method called SBern, which is derived from a multilinear extension optimized for Bernoulli observations, demonstrating that it strictly dominates uniform strategies. The paper also discusses the construction of a new strategy, Speel, that dominates any fixed deterministic strategy. Furthermore, the authors explore the limitations of these strategies in cases where observations are dependent and provide impossibility results. The findings are applied to stochastic optimization, yielding improved online-to-batch conversion bounds that maintain worst-case optimality while enhancing performance in favorable scenarios. The results suggest that the proposed methods can be beneficial in various applications, including ensemble methods and federated learning.
Methodology
The authors develop strategies based on the concept of admissibility and domination over baseline strategies. They utilize a multilinear extension approach for Bernoulli observations to derive the SBern strategy and construct the Speel strategy for arbitrary benchmarks. The paper also includes theoretical analyses and counterexamples to explore the limitations of these strategies in dependent observation settings.
Results
The SBern strategy is shown to improve upon the mean of the unknown values by a term related to their variance, while the Speel strategy is proven to dominate any fixed deterministic strategy. The paper establishes new online-to-batch conversion bounds that enhance performance in benign settings without sacrificing worst-case optimality.
Implications
The proposed strategies can significantly improve decision-making processes in stochastic optimization and have potential applications in ensemble methods and federated learning, where adaptive performance is crucial.
Extracting Governing Equations from Latent Dynamics via Multi-View Contrastive Learning
Theory
Time Series
Interpretability
- Introduction of DYSCO, a novel multi-view contrastive learning framework for system identification.
- Theoretical guarantees for identifying latent dynamical systems under noisy observations.
- Compatibility with symbolic regression for recovering governing equations.
- Empirical validation across diverse dynamical regimes, demonstrating robustness to noise.
Read more
Extracting Governing Equations from Latent Dynamics via Multi-View Contrastive Learning
Summary
This paper addresses the challenge of identifying latent dynamical systems from noisy, high-dimensional measurements, a critical issue in representation learning, system identification, and scientific discovery. The authors introduce DYSCO, a multi-view temporal contrastive learning algorithm that recovers latent trajectories and governing dynamics by utilizing multiple independent noisy observations of the same process. The framework disentangles signal from noise through a structured functional basis for parameterizing dynamics, allowing for symbolic recovery of governing equations within an affine gauge. The authors provide theoretical guarantees for strong identification of the latent system up to affine indeterminacy, extending previous results to noisy nonlinear observations. Empirical results demonstrate the method's effectiveness in accurately recovering latent trajectories and flow fields across various dynamical regimes, including chaotic, oscillatory, and metastable dynamics, even under significant observational noise. This work suggests that structured contrastive learning can effectively bridge the gap between raw observations and interpretable descriptions of underlying computations.
Methodology
The methodology involves a multi-view temporal contrastive learning framework that leverages multiple independent noisy views of a latent dynamical system. The approach includes learning an encoder to recover latent states and a structured dynamics model expressed in a predefined functional basis. The learning process is driven by a temporal contrastive objective that ensures consistency across views while respecting the underlying dynamics.
Results
The results indicate that DYSCO can accurately recover both latent trajectories and flow fields across various dynamical regimes, including chaotic, oscillatory, and metastable dynamics, even in the presence of substantial observational noise. The theoretical framework provides identifiability guarantees, confirming the method's robustness and applicability in realistic settings.
Implications
The findings have significant implications for fields such as neuroscience and physics, where understanding the governing dynamics of complex systems from noisy observations is crucial. The ability to recover interpretable models from high-dimensional data can enhance scientific discovery and improve the development of generalizable machine learning models.
Selecting Samples on Graphs: A Unified Dataset Pruning Framework for Lossless Training Acceleration
Graph Learning
Efficient ML
Optimization
- Introduces a unified graph-based framework for dataset pruning that captures both intrinsic and extrinsic sample values.
- Formulates dataset pruning as a Maximum Weight Clique Problem (MWCP) and provides a principled greedy solution.
- Proves formal approximation guarantees for a broad family of importance metrics under mild conditions.
- Demonstrates significant reduction in training time (over 40%) without sacrificing accuracy on ImageNet-1k.
Read more
Selecting Samples on Graphs: A Unified Dataset Pruning Framework for Lossless Training Acceleration
Summary
This paper addresses the challenge of high computational costs associated with training large datasets by proposing a unified dataset pruning (DP) framework that leverages graph-based modeling. Traditional DP methods either focus on intrinsic signals, assessing samples independently, or extrinsic signals, promoting diversity through pairwise relations. The authors argue that both approaches are limited as they capture only one aspect of sample utility. To overcome this, they introduce a graph-based framework where dataset samples are represented as nodes with weights indicating intrinsic value, and edges representing extrinsic relationships. The pruning task is formulated as a Maximum Weight Clique Problem (MWCP), which is NP-hard. However, the authors develop a greedy solution based on sample-wise marginal gains, proving that their unified objective has formal approximation guarantees under certain conditions. Their extensive experiments demonstrate that this method significantly reduces training time by over 40% on ImageNet-1k with ResNet-50, while maintaining accuracy, thus showcasing the effectiveness of their approach in practical scenarios.
Methodology
The authors model the dataset as a weighted graph, where node weights represent intrinsic importance and edge weights represent extrinsic relationships. They reformulate the dataset pruning problem as a Maximum Weight Clique Problem (MWCP) and derive a greedy algorithm based on sample-wise marginal gains to efficiently approximate the optimal solution.
Results
The proposed method outperforms existing dataset pruning techniques, achieving over a 40% reduction in training time on ImageNet-1k with ResNet-50, while maintaining model accuracy, demonstrating the practical effectiveness of the unified framework.
Implications
This framework has the potential to enhance the efficiency of training large-scale machine learning models, making it applicable in various domains where computational resources are a constraint. It also provides a theoretical foundation for developing new importance metrics in dataset pruning.
MP3: Multi-Period Pattern Pre-training for Spatio-Temporal Forecasting
Graph Learning
Time Series
- MP3 effectively addresses the temporal mirage phenomenon in spatio-temporal forecasting.
- The plugin enhances existing STGNNs by learning multi-period patterns from long time series.
- Experiments show consistent performance improvements across multiple datasets and models.
- MP3 reduces forecasting errors significantly, demonstrating its robustness and scalability.
Read more
MP3: Multi-Period Pattern Pre-training for Spatio-Temporal Forecasting
Summary
The paper introduces MP3, a novel pre-training plugin designed to enhance spatio-temporal forecasting by addressing the challenges posed by temporal mirages in urban data. Temporal mirages occur when similar short-window inputs lead to divergent future trends, complicating accurate predictions. The authors identify three main issues with existing spatio-temporal graph neural networks (STGNNs): incomplete period observation, heterogeneous global spatial correlation, and cross-period superposition causality. MP3 innovatively incorporates multi-period pattern learning, which utilizes edge convolution for temporal modeling and a global memory bank for spatial relations. Additionally, it employs a causality-enhanced Transformer to capture dependencies across different periods. The plugin can be integrated with existing STGNN architectures, significantly improving their forecasting capabilities. Experiments conducted on five STGNN baselines across five real-world datasets demonstrate that MP3 consistently enhances performance, achieving an average reduction of 4.7% in Mean Absolute Error (MAE) and 5.0% in Root Mean Square Error (RMSE).
Methodology
The authors developed MP3 as a pre-training plugin that incorporates multi-period pattern learning through edge convolution for temporal modeling, a global memory bank for capturing heterogeneous spatial correlations, and a causality-enhanced Transformer to understand dependencies across different periods. This approach allows for the effective identification of temporal mirages and enhances the forecasting capabilities of existing STGNNs.
Results
MP3 was tested on five STGNN baselines using five real-world datasets, resulting in an average reduction of 4.7% in MAE and 5.0% in RMSE. The results indicate that MP3 significantly improves forecasting accuracy and demonstrates strong adaptability and scalability across various models.
Implications
The findings suggest that MP3 can be a valuable tool for improving spatio-temporal forecasting in various applications, including transportation, climate modeling, and energy management. By effectively addressing the challenges of temporal mirages, MP3 can enhance decision-making processes in these critical fields.
MaxProof: Scaling Mathematical Proof with Generative-Verifier RL and Population-Level Test-Time Scaling
Reinforcement Learning
Generative Models
Large Language Models
- MaxProof integrates generative-verifier RL with population-level scaling for mathematical proofs.
- The framework combines proof generation, verification, and repair capabilities into a single model.
- MaxProof achieves scores exceeding human gold-medal thresholds in major mathematical competitions.
- The methodology emphasizes tournament selection from a population of candidate proofs to enhance accuracy.
Read more
MaxProof: Scaling Mathematical Proof with Generative-Verifier RL and Population-Level Test-Time Scaling
Summary
The paper presents MaxProof, a novel framework designed to enhance the capabilities of mathematical proof generation through a combination of generative-verifier reinforcement learning (RL) and population-level test-time scaling. The authors detail the training of three distinct proof-oriented capabilities: proof generation, proof verification, and critique-conditioned proof repair, all integrated into a single model known as M3. At test time, MaxProof operates by treating the model as a generator, verifier, refiner, and ranker, employing a population of candidate proofs and utilizing tournament selection to identify the most promising proof. The results demonstrate that MaxProof significantly outperforms previous models, achieving scores of 35/42 on the International Mathematical Olympiad (IMO) 2025 and 36/42 on the USAMO 2026, surpassing the human gold-medal threshold in both competitions. This advancement highlights the potential of combining generative models with RL techniques to tackle complex reasoning tasks in mathematics.
Methodology
The methodology involves training a generative verifier using reinforcement learning to develop three core capabilities: proof generation, verification, and repair. The MaxProof framework then applies a population-level search strategy at test time, utilizing tournament selection to refine and select the best proof from multiple candidates.
Results
MaxProof achieved a score of 35/42 on the IMO 2025 and 36/42 on the USAMO 2026, surpassing the human gold-medal threshold in both competitions, demonstrating its effectiveness in generating high-quality mathematical proofs.
Implications
The advancements presented in this paper suggest that AI can significantly enhance mathematical reasoning and proof generation, potentially impacting educational tools, automated theorem proving, and competitive mathematics. The framework could also be adapted for other complex reasoning tasks across different domains.
Positional Encoding in the Context of Memristor-Based Analog Computation for Automatic Speech Recognition
Audio & Speech
Efficient ML
NLP
- Memristors enable efficient analog computation for neural models in NLP.
- Large output values from positional encodings can degrade performance in memristor-based systems.
- Adjusting ADC configurations can significantly reduce performance degradation.
- Relative positional encodings improve model performance, especially under low-precision conditions.
Read more
Positional Encoding in the Context of Memristor-Based Analog Computation for Automatic Speech Recognition
Summary
This paper explores the integration of memristor-based analog computation in automatic speech recognition (ASR) systems, focusing on the role of positional encodings (PEs). The authors identify that large output values from transformed PEs can lead to significant degradation in performance during analog-to-digital conversion (ADC) in memristor hardware. They propose a method to adjust the weight and precision bits of the ADC to mitigate this degradation by approximately 50% while maintaining stable energy consumption. The study also examines scenarios where ADC modifications are not feasible, suggesting that removing encoding-related linear transformations can reduce degradation by about 30%. The findings confirm that relative PEs enhance model performance, particularly in low-precision environments, and the authors provide insights into optimizing memristor configurations for better execution of neural models in speech processing.
Methodology
The authors utilized a simulated memristor hardware environment (SynaptogenML) to analyze the performance of a CTC-based Conformer model with relative positional encodings. They conducted experiments to assess the impact of positional encodings on model performance and degradation during execution on memristor devices, comparing configurations with and without adjustments to the ADC.
Results
The introduction of relative positional encodings led to a performance improvement of approximately 15% over configurations without positional encodings. Additionally, the degradation during memristor execution was reduced by about 50% through optimized weight and precision adjustments, while a 30% reduction was achieved by removing certain linear transformations when ADC modifications were not possible.
Implications
The findings suggest that optimizing memristor configurations can enhance the efficiency of neural networks in speech recognition tasks, paving the way for more effective deployment of memristive hardware in real-world applications. This research can guide future developments in energy-efficient computing and the design of neural models for low-precision environments.
Hölder++: Improving the Quality-Coherence Trade-off in Multimodal VAEs
Generative Models
Multimodal
- Introduces Hölder++ to improve the generative quality-coherence trade-off in multimodal VAEs.
- First implementation of symmetric Hölder pooling without approximations for multimodal settings.
- Models distinct shared and private representations to enhance coherence and diversity.
- Employs hierarchical inference to further disentangle shared and private latent spaces.
Read more
Hölder++: Improving the Quality-Coherence Trade-off in Multimodal VAEs
Summary
The paper addresses the trade-off between generative quality and coherence in multimodal variational autoencoders (VAEs). Existing methods struggle to produce samples that are both realistic and semantically consistent across different modalities. The authors introduce Hölder++, a novel multimodal VAE that enhances this trade-off through three main contributions: (i) the first implementation of symmetric Hölder pooling without approximations for multimodal VAEs; (ii) an architecture that distinguishes between shared and private (modality-specific) representations; and (iii) hierarchical inference to improve the disentanglement of these representations. The proposed model consistently outperforms state-of-the-art methods in generating high-quality, coherent samples while providing structured latent spaces that are beneficial for downstream tasks.
Methodology
The authors propose Hölder++, which utilizes symmetric Hölder pooling as an aggregation method for multimodal VAEs. The architecture incorporates distinct shared and private latent representations, and employs hierarchical inference to enforce disentanglement. The model is trained using the Evidence Lower Bound (ELBO) to optimize the generative process across multiple modalities.
Results
Experiments conducted on four benchmark datasets (PolyMNIST, MNIST-SVHN, CUBICC, and CelebAMask-HQ) show that Hölder++ significantly improves the generative quality-coherence trade-off. The model yields more structured and disentangled latent spaces, and the shared representations learned are informative for various downstream tasks.
Implications
The findings suggest that Hölder++ can be effectively applied in scenarios requiring high-quality multimodal generation, such as image captioning, cross-modal retrieval, and other applications where coherence across different data types is crucial.
CLARITree: Cholesky and Lookahead Accelerations for Regression with Interpretable Piecewise Linear Trees
Efficient ML
Interpretability
Optimization
- Introduction of CLARITree, an efficient algorithm for learning sparse, piecewise linear regression trees.
- Utilization of lookahead-style split optimization to enhance performance.
- Implementation of rank-one Cholesky updates for maintaining numerical stability during split evaluations.
- Demonstration of superior performance compared to greedy baselines and scalability to larger datasets.
Read more
CLARITree: Cholesky and Lookahead Accelerations for Regression with Interpretable Piecewise Linear Trees
Summary
The paper presents CLARITree, a novel algorithm designed to construct interpretable piecewise linear regression trees efficiently. Traditional methods for building regression trees often rely on greedy induction, which can lead to suboptimal performance. While optimal methods exist, they are computationally expensive and not scalable for general linear regression trees. CLARITree combines a lookahead search strategy with efficient rank-one Cholesky updates of the Gram matrix to achieve a balance between computational efficiency and predictive accuracy. The authors demonstrate that CLARITree outperforms greedy approaches and approaches optimal performance on small to medium datasets while maintaining scalability for larger datasets. The algorithm's design allows for numerically stable and efficient evaluations of candidate splits, making it a significant advancement in the field of interpretable machine learning.
Methodology
CLARITree employs a lookahead search strategy for split optimization and utilizes rank-one Cholesky updates to maintain the Gram matrix during the evaluation of candidate splits. This approach allows for efficient and stable computations, enabling the algorithm to handle continuous features directly.
Results
Empirical evaluations show that CLARITree consistently outperforms traditional greedy regression trees, achieving near-optimal accuracy with significantly lower runtime. For instance, on a synthetic dataset, CLARITree achieved a mean squared error (MSE) of 4.03 and an R² of 0.97, compared to a greedy tree's MSE of 15.41 and R² of 0.88.
Implications
The development of CLARITree has the potential to enhance the application of interpretable machine learning in various domains, particularly where model transparency and accuracy are critical. Its scalability makes it suitable for large datasets, which is often a limitation in traditional regression tree methods.
Viral Proteins Reveal Geometry of Protein Language Models
NLP
Large Language Models
Interpretability
- Identification of a dominant nativeness axis in pLM representation space that tracks reconstruction difficulty.
- Scaling impacts viral protein representation unevenly across different viral families.
- pLM embeddings retain viral-specific signals beyond nativeness, outperforming shallow classification baselines.
Read more
Viral Proteins Reveal Geometry of Protein Language Models
Summary
This paper investigates how protein language models (pLMs) represent underrepresented biological sequences, specifically focusing on viral proteins. The authors analyze the embedding geometry of viral proteins across different ESM model families to understand their representation in comparison to well-modeled cellular proteins. They identify a dominant 'nativeness' axis in the embedding space that correlates with masked-reconstruction perplexity, which orders sequences from cellular proteins to viral proteins and random sequences. The study reveals that scaling affects the representation of viral proteins unevenly across different viral families, with some moving closer to the native region while others remain displaced. Importantly, the authors demonstrate that pLM embeddings retain viral-specific information, as linear probes trained on these embeddings outperform classifiers based solely on perplexity and shallow sequence features. This suggests that while pLMs are structured by a general notion of nativeness, they also preserve distinct biological signals relevant to viral proteins.
Methodology
The authors employed principal component analysis (PCA) to analyze the embedding geometry of viral proteins across various ESM model families. They used masked-reconstruction perplexity as a measure of nativeness and compared the performance of linear probes with shallow sequence-based classifiers and zero-shot classifiers based on perplexity.
Results
The study found that the nativeness axis closely aligns with the difficulty of reconstructing sequences, explaining the shift from viral to cellular proteins. Additionally, it was shown that scaling affects different viral families in varied ways, and that embeddings contain significant viral-specific information, as evidenced by the superior performance of linear probes compared to other classifiers.
Implications
The findings suggest that pLMs can be effectively utilized to analyze and predict properties of underrepresented biological sequences, such as viral proteins, which could enhance our understanding of viral evolution and inform therapeutic strategies.
DynamicPTQ: Mitigating Activation Quantization Collapse via Residual-Stream Dynamics
NLP
Large Language Models
Efficient ML
- DynamicPTQ addresses the issue of quantization collapse caused by massive activations in LLMs.
- The paper introduces Jump Ratio and Historical Feature SNR to analyze residual stream dynamics.
- DynamicPTQ allows for phase-aware mixed-precision quantization, improving model performance.
- Experiments show significant improvements in perplexity and QA performance with modest memory overhead.
Read more
DynamicPTQ: Mitigating Activation Quantization Collapse via Residual-Stream Dynamics
Summary
This paper addresses the challenges of post-training quantization (PTQ) for large language models (LLMs), particularly when quantizing activations to 4-bit precision. The authors identify that massive activations, which dominate the activation range, lead to significant quantization errors. Existing methods primarily use transformation-based smoothing techniques but fail to account for the dynamic behavior of residual streams across layers. The authors introduce the concepts of Jump Ratio and Historical Feature SNR to analyze the phase-wise emergence and disappearance of massive activations throughout the network depth. Based on this analysis, they propose DynamicPTQ, a novel quantization policy that dynamically assigns 8-bit precision to quantization-sensitive layers while keeping other components at 4-bit precision. This approach can be integrated with existing PTQ methods and demonstrates improved performance in perplexity and zero-shot QA tasks on models like LLaMA-2 and LLaMA-3, while also enhancing throughput with minimal memory overhead.
Methodology
The authors analyze the dynamics of massive activations across different phases of network depth and introduce a new quantization policy, DynamicPTQ, which identifies layers sensitive to quantization and assigns them higher precision. They leverage existing PTQ methods and enhance them with their dynamic approach.
Results
DynamicPTQ consistently improves perplexity and zero-shot QA performance under W4A4KV4 quantization settings on LLaMA-2 and LLaMA-3. The method achieves a throughput improvement of 1.05× to 1.07× with only modest memory overhead, demonstrating its effectiveness in practical low-bit LLM inference.
Implications
The findings suggest a more robust approach to low-bit quantization in LLMs, which could facilitate more efficient deployment of these models in resource-constrained environments while maintaining performance. This could enhance applications in natural language processing, especially in scenarios requiring low-latency responses.
VideoMDM: Towards 3D Human Motion Generation From 2D Supervision
Computer Vision
Generative Models
- VideoMDM is the first diffusion training framework for 3D human motion using only 2D supervision.
- The method employs a noisy-teacher scheme to generate approximate 3D poses from 2D inputs.
- A depth-aware reprojection loss is introduced, which is equivalent to 3D supervision under certain conditions.
- VideoMDM achieves competitive results on various datasets, demonstrating its effectiveness in generating realistic 3D motions.
Read more
VideoMDM: Towards 3D Human Motion Generation From 2D Supervision
Summary
The paper introduces VideoMDM, a novel diffusion-based framework designed to generate 3D human motion from 2D pose sequences extracted from monocular videos, without requiring any 3D ground truth data. The framework employs a pretrained 2D-to-3D lifter to produce approximate 3D pose sequences, which are then diffused and denoised in 3D space while being supervised by 2D reprojections. This approach allows the model to learn a coherent 3D motion manifold directly from 2D observations. The authors demonstrate that a depth-weighted 2D reprojection loss can effectively serve as a substitute for direct 3D supervision, leading to the adaptation of standard 3D motion regularizers for a 2D context. The evaluation of VideoMDM across three datasets shows that it achieves performance close to fully 3D-supervised models while significantly improving motion generation quality in real-world scenarios. This work opens new avenues for large-scale training of 3D motion models using readily available 2D video data.
Methodology
VideoMDM utilizes a diffusion model trained on 2D pose data extracted from monocular videos. It incorporates a pretrained 2D-to-3D lifter to generate approximate 3D poses, which are diffused and denoised in 3D space. The model is supervised using a depth-weighted reprojection loss that aligns the 3D predictions with accurate 2D keypoints. Additionally, standard 3D motion regularizers are adapted for use in a 2D context to enforce natural motion dynamics.
Results
VideoMDM achieves an FID score of 0.88 on a 2D-only version of the HumanML3D dataset, nearly matching the performance of fully 3D-supervised models (FID 0.54). In real-world settings, such as the Fit3D dataset, it reduces joint error significantly and produces smoother motion compared to existing methods. On the NBA dataset, VideoMDM is preferred in 64% of human comparisons, indicating its effectiveness in generating realistic motions.
Implications
The ability to generate high-fidelity 3D human motion from 2D video data has significant implications for animation, gaming, and simulation industries. It allows for the utilization of vast amounts of online video content to train motion models, potentially leading to more diverse and realistic character animations in various applications.
WHAR Arena: Benchmarking the State of the Art in Efficient Wearable Human Activity Recognition
Time Series
Efficient ML
- Introduction of a standardized benchmarking framework for WHAR.
- Curation of 30 datasets and 17 models to enable fair comparisons.
- Evaluation of performance and efficiency metrics on an Android device.
- Identification of a distributed state of the art in WHAR, with compact models showing better deployment efficiency.
Read more
WHAR Arena: Benchmarking the State of the Art in Efficient Wearable Human Activity Recognition
Summary
The paper addresses the comparability crisis in Wearable Human Activity Recognition (WHAR) by proposing an open-source benchmarking framework that standardizes the evaluation of diverse datasets and models. The authors curate a WHAR Datasets library containing 30 heterogeneous datasets and a WHAR Models library that includes 17 representative architectures, all under a unified inference interface. This framework enables large-scale benchmarking and reproducible analysis of performance and efficiency trade-offs. The study evaluates predictive performance alongside on-device latency, RAM usage, and model size on an Android reference device, revealing that while CNN-HAR achieves the highest mean macro-F1 score, the state of the art is distributed among multiple models. The findings indicate that compact models and classical methods like Random Forests are more efficient for deployment, suggesting a plateau in predictive performance but highlighting opportunities for optimizing deployment efficiency and adapting to domain shifts. The authors release their framework to facilitate transparent reuse and further research in WHAR.
Methodology
The authors developed an open-source benchmarking framework that integrates 30 diverse WHAR datasets and 17 model architectures. They standardized data processing and evaluation protocols, allowing for consistent cross-subject performance assessments. The evaluation included 4760 training runs measuring predictive performance, latency, RAM usage, and model size.
Results
The results indicated that while CNN-HAR achieved the highest mean macro-F1 score, the performance was distributed among several models, suggesting a convergence in predictive capabilities. Compact models like TinierHAR and classical Random Forests were identified as the most efficient for deployment, while larger models incurred high hardware costs without significant performance improvements.
Implications
The findings suggest that future research in WHAR should focus on optimizing deployment efficiency and adapting models to diverse activity contexts. The standardized framework can help researchers compare new methods effectively and contribute to the development of more efficient wearable technologies.
Boosting Direct Preference Optimization with Penalization
NLP
Large Language Models
Optimization
- DPOP enhances DPO by incorporating a gated penalty on reference-greedy responses.
- The penalty is selectively activated based on the likelihood of preferred versus rejected responses.
- Empirical results show significant improvements in performance over existing methods.
- DPOP demonstrates the utility of reference-greedy responses as effective training signals.
Read more
Boosting Direct Preference Optimization with Penalization
Summary
This paper introduces Direct Preference Optimization with Penalization (DPOP), an enhancement of Direct Preference Optimization (DPO) aimed at improving offline preference optimization for aligning large language models with human preferences. DPO simplifies the reinforcement learning process by framing preference alignment as a pairwise classification task, but it is limited by only utilizing chosen and rejected responses from a static dataset. DPOP addresses this limitation by incorporating a penalty on the responses generated by a reference model, which can provide additional training signals. The penalty is activated only when the current policy assigns a lower likelihood to the preferred response compared to the rejected one, effectively guiding the model to internalize preferences better. Empirical results demonstrate that DPOP significantly outperforms DPO and its variants, achieving notable improvements in length-controlled win rates on the AlpacaEval 2.0 benchmark, suggesting that leveraging reference-greedy responses can enhance preference optimization in language models.
Methodology
The methodology involves augmenting the base preference loss of DPO with a penalization term that targets the likelihood of responses generated by a reference model. The penalty is applied selectively based on a margin condition, ensuring it only activates when the policy has not yet internalized the preference between chosen and rejected responses. Various penalty families are explored, including token-level unlikelihood and negative preference optimization (NPO) styles, to assess their effectiveness in improving model alignment.
Results
On the AlpacaEval 2.0 benchmark, DPOP achieved a length-controlled win rate of 46.35 on the Llama-3-8b-it model, improving upon the win rates of 44.01 for SimPO and 41.48 for AlphaDPO. Similarly, on the Gemma-2-9b-it model, DPOP improved the win rate from 73.08 for SimPO and 74.90 for AlphaDPO to 78.22, indicating relative gains of 5.3% and 4.4% respectively.
Implications
The findings suggest that incorporating reference-greedy responses into preference optimization can significantly enhance the performance of language models. This approach could lead to more effective alignment strategies in various applications, including conversational agents, content generation, and other NLP tasks where understanding human preferences is crucial.
Boltzmann Attention: Learnable Ising Couplings for Cooperative Attention
NLP
Large Language Models
Theory
- Boltzmann Attention introduces learnable pairwise couplings to enhance attention mechanisms.
- The method outperforms standard softmax attention, especially with longer sequences.
- A four-way ablation study confirms that improvements are due to the learnable couplings.
- The Ising model formulation opens avenues for quantum-computing-based training methods.
Read more
Boltzmann Attention: Learnable Ising Couplings for Cooperative Attention
Summary
The paper introduces Boltzmann Attention, an innovative attention mechanism that enhances traditional attention models by incorporating learnable pairwise couplings based on the Ising model. Unlike standard attention methods that primarily rely on individual query-key similarities, Boltzmann Attention allows for the modeling of inter-position correlations, which is crucial for tasks where relational dependencies exist, such as language modeling and structured sequence tasks. The authors demonstrate that this approach improves performance over standard softmax attention in a Transformer architecture, particularly as sequence lengths increase. Through experiments on character-level language modeling and synthetic bracket matching, they show that the inclusion of learnable couplings significantly enhances the model's ability to capture cooperative attention structures. Furthermore, the paper explores the potential of using diabatic quantum annealing as a training method, suggesting a pathway for leveraging quantum computing in attention-based models.
Methodology
The authors formulate attention as an interacting spin system using the Ising model, where each key position is represented as a binary spin. They introduce learnable pairwise couplings between these spins, allowing the model to capture inter-position correlations. Attention weights are derived from the marginal spin magnetizations under the Boltzmann distribution, and the model is integrated into a standard Transformer architecture. The authors also explore diabatic quantum annealing as a training method.
Results
Experiments demonstrate that Boltzmann Attention consistently improves performance over standard softmax attention in character-level language modeling and synthetic bracket matching tasks. The benefits of the learnable pairwise couplings become more pronounced with increasing sequence lengths. The ablation study confirms that the enhancements are primarily due to the introduction of these couplings.
Implications
The findings suggest that explicitly modeling inter-position interactions can significantly enhance attention-based sequence modeling. Additionally, the connection to quantum computing may lead to more efficient training methods and open new research avenues in leveraging quantum techniques for machine learning.
How Useful is Causal Invariance for Domain Adaptation in Finite-Sample Settings?
Theory
- Causal invariance can potentially improve supervised domain adaptation by identifying invariant predictors.
- The effectiveness of causal knowledge in finite-sample settings is influenced by target-risk margins and estimation errors.
- The study provides a theoretical framework for understanding when causal knowledge can lead to better predictive performance.
- Real-world causal benchmarks validate the theoretical results regarding the use of causal invariance in domain adaptation.
Read more
How Useful is Causal Invariance for Domain Adaptation in Finite-Sample Settings?
Summary
This paper investigates the role of causal invariance in supervised domain adaptation (sDA) within finite-sample settings. The authors highlight that machine learning models often perform poorly when applied to target distributions that differ from their training source distributions. They explore how shared causal structures can lead to invariant predictors, which are models that maintain stable performance across domain shifts. The study focuses on linear regression and examines how full or partial causal knowledge can inform the selection of invariant feature subsets, which in turn can improve prediction accuracy on target data. The authors derive upper and lower bounds that demonstrate the conditions under which causal knowledge can enhance performance in finite samples. They find that the effectiveness of leveraging causal invariance is contingent on the target-risk margins separating candidate predictors and the finite-source estimation error. The paper also connects these margins to structural shifts in linear Structural Causal Models (SCMs) and validates their theoretical findings using real-world causal benchmarks.
Methodology
The authors analyze linear regression models under the assumption of finite samples from both source and target distributions. They derive matching upper and lower bounds to characterize the conditions under which causal knowledge can improve predictive performance. The analysis includes the examination of invariant feature subsets and their associated risks, as well as the development of an adaptive aggregation procedure to select the best candidate predictor.
Results
The paper establishes that when target-risk margins are sufficiently large relative to the sample size, an adaptive aggregation procedure can effectively match the best candidate predictor while avoiding negative transfer. Conversely, if the margins are too small, no algorithm can reliably exploit the candidate collection to achieve faster finite-sample rates. The theoretical results are supported by empirical validation on real-world datasets.
Implications
The findings suggest that incorporating causal knowledge into machine learning models can enhance their robustness in domain adaptation scenarios, particularly when labeled target samples are limited. This has implications for various applications where data distributions may shift, such as healthcare, finance, and social sciences.
Distributional Loss for Robust Classification
Theory
Optimization
- Introduction of a bimodal Gaussian distribution as a target for classifier outputs.
- Mitigation of overfitting and improved robustness in classification tasks.
- Effective in low-data scenarios without requiring additional label information.
- Minimal modifications needed for integration into standard training pipelines.
Read more
Distributional Loss for Robust Classification
Summary
This paper introduces a novel loss function for supervised classification tasks, termed 'distributional loss'. Unlike traditional approaches that assign a single hard label to each input, the authors propose a bimodal Gaussian distribution as the target for classifier outputs. This approach acknowledges the inherent ambiguity in class assignments, which can arise from noise and overlapping features among classes. By using this softer target formulation, the method aims to improve robustness and mitigate overfitting, particularly in scenarios with limited training data. The authors demonstrate that their distributional loss can be integrated into existing training pipelines with minimal modifications, leading to enhanced decision boundary learning and classification performance. Experimental results indicate significant improvements in robustness, especially in low-data regimes, showcasing the effectiveness of the proposed method over conventional cross-entropy loss.
Methodology
The authors utilize two differentiable estimators of the Kullback–Leibler divergence to implement the distributional loss. One estimator is based on a probabilistic encoder, akin to those used in variational autoencoders, while the other is a non-parametric density estimator derived from pairwise distances between samples. This dual approach allows the model to capture the bimodal distribution of class probabilities effectively.
Results
Experimental evaluations reveal that the proposed distributional loss significantly enhances classification robustness, particularly in settings with limited training data. The results indicate that the method outperforms traditional cross-entropy loss, especially in scenarios prone to overfitting, thereby validating the effectiveness of the bimodal target formulation.
Implications
The proposed distributional loss has the potential to improve classification tasks across various domains, particularly where data is scarce or noisy. Its integration into existing models could lead to more reliable and robust classifiers, making it a valuable contribution to the field of machine learning.
Beyond the Commitment Boundary: Probing Epiphenomenal Chain-of-Thought in Large Reasoning Models
NLP
Large Language Models
Interpretability
- Introduction of a causal framework for analyzing CoT reasoning traces.
- Identification of a commitment boundary where models stabilize their answers.
- Demonstration of epiphenomenal reasoning beyond the commitment boundary.
- Development of lightweight attention probes for predicting answer-formation stages.
Read more
Beyond the Commitment Boundary: Probing Epiphenomenal Chain-of-Thought in Large Reasoning Models
Summary
This paper investigates the causal influence of individual steps in Chain-of-Thought (CoT) reasoning within large language models. The authors introduce a causal framework to analyze CoT traces, focusing on the identification of a 'commitment boundary'—a pivotal moment in the reasoning process where the model transitions from transient guesses to a stable, high-confidence answer. They find that this transition typically occurs after a single reasoning step, leading to a significant shift in final-answer probabilities. Beyond this boundary, subsequent reasoning steps often do not alter the final answer, indicating the presence of epiphenomenal reasoning. The authors employ lightweight attention probes to predict answer-formation stages and demonstrate that these probes can be used to early-exit reasoning blocks at the commitment boundary, achieving a reduction in reasoning length by up to 55% with minimal impact on performance. The findings suggest a structured approach to understanding how models converge to their final answers and highlight the potential for optimizing reasoning processes in large models.
Methodology
The authors utilize a step-level causal framework to analyze CoT reasoning, employing early exit techniques to measure the causal contribution of each reasoning step. They train lightweight attention probes on model activations to predict the stages of answer formation and identify the commitment boundary.
Results
The study reveals that the commitment boundary typically occurs after a single reasoning step, leading to a significant shift in final-answer probabilities. The lightweight attention probes successfully predict answer-formation stages with high accuracy across various reasoning tasks, enabling early exits that reduce reasoning length by up to 55% without substantial performance degradation.
Implications
The findings have implications for optimizing reasoning processes in large language models, potentially enhancing efficiency and reducing computational costs. Understanding the commitment boundary can also improve the interpretability of model outputs and guide future research on CoT reasoning.
PolyFlow: Safe and Efficient Polytope-Constrained Flow Matching with Constraint Embedding and Projection-free Update
Generative Models
Robotics
Optimization
- PolyFlow embeds constraints directly into the flow model, improving safety and efficiency.
- The framework eliminates numerical integration errors by reformulating flow matching in a discrete-time setting.
- A projection-free architecture ensures strict adherence to convex constraints without computational overhead.
- Experimental validation shows zero constraint violations and superior distribution matching quality.
Read more
PolyFlow: Safe and Efficient Polytope-Constrained Flow Matching with Constraint Embedding and Projection-free Update
Summary
The paper introduces PolyFlow, a novel framework for flow-based generative models that addresses the challenges of safety and constraint satisfaction in physical systems. Traditional methods often enforce constraints through post-hoc corrections, which can lead to computational inefficiencies and distortions in the learned distribution. PolyFlow embeds constraints directly into the model architecture and flow dynamics, utilizing a discrete-time flow formulation to eliminate discretization errors. The framework employs a projection-free architecture inspired by the Frank-Wolfe algorithm, ensuring that updates remain within feasible regions without the need for costly iterative solvers. Experimental results demonstrate that PolyFlow achieves zero constraint violations while maintaining high distributional fidelity across various planning and control tasks, significantly reducing inference latency compared to state-of-the-art methods.
Methodology
PolyFlow reformulates flow matching using a discrete-time framework and introduces a projection-free architecture that combines ray-shooting techniques with learnable gating factors. This design guarantees that all updates remain within the feasible set, thus ensuring compliance with safety constraints without the need for iterative projections during inference.
Results
The experimental results indicate that PolyFlow achieves zero constraint violations across various constrained generation tasks while maintaining high distributional fidelity. It also demonstrates a significant reduction in inference latency compared to existing state-of-the-art constrained generation baselines.
Implications
PolyFlow's approach can be applied in safety-critical domains such as robotics, trajectory planning, and control systems, where strict adherence to physical constraints is essential. Its efficiency and safety features make it suitable for real-time applications in these fields.
The Stable Recovery Manifold: Geometric Principles Governing Recoverability in Continual Learning
Theory
- Introduces the Stable Recovery Manifold (SRM) hypothesis, suggesting that forgotten knowledge is preserved in a low-dimensional subspace.
- Demonstrates that the dimensionality of the recovery subspace remains constant at approximately 8 principal directions across multiple tasks.
- Finds that 82% of the variance in recoverability is explained by geometric variables, with principal-angle drift as the dominant predictor.
- Falsifies the Recoverability Diffusion hypothesis, providing a clearer understanding of the nature of forgetting in continual learning.
Read more
The Stable Recovery Manifold: Geometric Principles Governing Recoverability in Continual Learning
Summary
This paper addresses the phenomenon of catastrophic forgetting in continual learning, challenging the traditional view that lost knowledge is irretrievably destroyed. The authors introduce the Stable Recovery Manifold (SRM) hypothesis, positing that forgotten task knowledge remains accessible within a compact, stable low-dimensional subspace of approximately 8 principal directions, regardless of the number of tasks learned. Through a series of six experiments using a ResNet-18 model trained on Split CIFAR-100, the authors investigate the geometric structure and temporal dynamics of recoverable knowledge. They find that 82% of the variance in recoverability can be explained by three geometric variables, with principal-angle drift being the most significant predictor. The study also reveals that the dimensionality of the recovery subspace remains constant across tasks, contradicting the initial hypothesis of Recoverability Diffusion. Additionally, a depth-stratification analysis provides insights into how different layers of the network retain knowledge differently over time. These findings suggest that catastrophic forgetting should be reframed as a geometric accessibility failure, highlighting the importance of preserving subspace orientation for future mitigation strategies.
Methodology
The authors conducted six experiments using a ResNet-18 model trained sequentially on the Split CIFAR-100 dataset across ten tasks. They employed linear probing to assess recoverability and analyzed geometric properties such as principal-angle drift and subspace dimensionality.
Results
The study found that the recovery subspace dimensionality remains constant at approximately 8 principal directions (σ = 0.82) across all tasks, and principal-angle drift was identified as the most significant predictor of recoverability (Pearson r = -0.862). A linear model accounted for 82% of the variance in recoverability (R² = 0.82).
Implications
These findings have significant implications for the design of continual learning systems, suggesting that strategies should focus on preserving the orientation of the recovery subspace rather than solely preventing parameter drift. This reframing could lead to more effective mitigation techniques against catastrophic forgetting.
μVLA: On Recurrent Memory for Partially Observable Manipulation in VLA Models
Multimodal
Robotics
Reinforcement Learning
- Introduces μVLA, a family of recurrent VLA models focused on isolating the effects of recurrence.
- Demonstrates significant performance improvements in partially observable tasks using recurrent memory.
- Establishes a controlled experimental framework to evaluate recurrence without auxiliary mechanisms.
- Identifies specific regimes where minimal recurrence is sufficient and where additional memory structures are needed.
Read more
μVLA: On Recurrent Memory for Partially Observable Manipulation in VLA Models
Summary
The paper presents μVLA, a novel approach to enhance Vision-Language-Action (VLA) models by incorporating recurrent memory to address the challenges posed by partial observability in manipulation tasks. Traditional VLA models predict future actions based on current observations, which can lead to failures when relevant information is not visible. The authors conduct a controlled isolation study to evaluate the impact of recurrence in a pretrained VLA backbone without the confounding effects of auxiliary losses or architectural changes. μVLA introduces a small set of learnable memory tokens that are updated through self-attention and passed across timesteps, allowing the model to retain relevant information over time. The study varies parameters such as memory width and update rules, focusing solely on the effects of recurrence. The results demonstrate significant improvements in task performance on the MIKASA-Robo benchmark, with success rates increasing from 0.42 to 0.84 in partially observable settings. The findings also indicate that while recurrence enhances performance in certain scenarios, it does not provide benefits in tasks requiring different memory structures. Overall, the study calibrates the capabilities of minimal in-backbone recurrence and identifies the conditions under which it is effective.
Methodology
The authors augment a pretrained VLA model with a set of learnable memory tokens that are updated through self-attention and passed across timesteps. They employ truncated backpropagation through time (TBPTT) for training, allowing the model to learn from temporally ordered episodes without auxiliary losses or architectural modifications. The study systematically varies parameters such as memory width, TBPTT length, and memory update rules to isolate the effects of recurrence.
Results
On the MIKASA-Robo benchmark, μVLA achieves an average success rate improvement from 0.42 to 0.84 across five training tasks. For held-out tasks with the same memory structure, success rates improve from 0.07 to 0.23. In fully observable settings, the strongest recurrent variant reaches a 96.2% average success rate, indicating no performance regression.
Implications
The findings suggest that incorporating minimal recurrent memory can significantly enhance the performance of VLA models in environments with partial observability. This work lays the groundwork for future research into memory-augmented models and their applications in robotics and other domains requiring sequential decision-making under uncertainty.
Circuit Synchronization Precedes Generalization: Causal Evidence from Fourier Structure in Grokking Transformers
Theory
- Introduction of Frequency Synchronization Degree (FSD) as a metric for Fourier circuit synchronization.
- FSD predicts grokking 500-3,000 steps in advance across multiple configurations.
- Causal evidence indicates that the timing of grokking is influenced by regularization parameters.
- Demonstration that multi-block circuits are necessary for the precursor to generalization.
Read more
Circuit Synchronization Precedes Generalization: Causal Evidence from Fourier Structure in Grokking Transformers
Summary
This paper investigates the phenomenon of 'grokking' in transformers, where a model trained on modular arithmetic suddenly transitions from low to high validation accuracy. The author introduces the Frequency Synchronization Degree (FSD), a new metric for assessing the synchronization of Fourier-based algorithmic circuits in transformers. The study demonstrates that FSD reaches its peak levels significantly before the grokking event, providing a predictive measure for when generalization occurs. Through experiments across various modular addition configurations, the author establishes that FSD serves as an early predictor of circuit formation, occurring 500 to 3,000 steps prior to grokking. Additionally, causal interventions reveal that the timing of grokking is influenced by regularization parameters, with earlier grokking observed for lower weight decay values. The findings suggest that the formation of Fourier circuits is a crucial precursor to generalization, with implications for understanding the learning dynamics of neural networks.
Methodology
The study employs a two-layer transformer model trained on modular arithmetic tasks. The Frequency Synchronization Degree (FSD) is computed using permutation tests without prior knowledge of the circuit. Causal interventions are conducted by forking training at specific steps with varying weight decay values to observe the effects on grokking timing. Various configurations are tested across different primes and operations.
Results
The results indicate that FSD consistently reaches its peak before grokking, with a mean lead time of +1,722 steps across nine configurations. The causal interventions confirm that lower weight decay leads to earlier grokking, supporting the hypothesis that the inter-phase gap is a regularization phenomenon. The inverse-λ timing law is validated across multiple primes, and the study shows that the precursor to generalization is a property of multi-block circuits.
Implications
These findings enhance the understanding of the learning dynamics in transformers, particularly in relation to the timing of circuit formation and generalization. The insights could inform the design of more effective training regimes and architectures for neural networks, particularly in tasks requiring symbolic reasoning.
A Stabilized Path-Space Approach to Diffusion-Based Posterior Sampling
Generative Models
Optimization
Theory
- Introduces a stabilized path-space framework for diffusion-based posterior sampling.
- Connects diffusion posterior sampling to stochastic optimal control for better uncertainty quantification.
- Eliminates biases from initial value functions through a novel time reparameterization.
- Demonstrates improved accuracy and robustness in sampling through extensive benchmark evaluations.
Read more
A Stabilized Path-Space Approach to Diffusion-Based Posterior Sampling
Summary
This paper addresses the challenges of posterior sampling in Bayesian inverse problems (BIPs) using diffusion models, which are known for their expressive data-driven priors. Traditional diffusion posterior samplers often rely on heuristic approximations that can fail in complex scenarios, particularly with nonlinear operators and multimodal posteriors. The authors propose a stabilized path-space framework that reformulates posterior sampling as a measure-matching problem in the path space of the diffusion model. By defining a likelihood-weighted target measure on trajectories, they connect diffusion posterior sampling to stochastic optimal control, ensuring the Bayesian structure necessary for uncertainty quantification is preserved. A novel time reparameterization is introduced to eliminate biases from unknown initial value functions without requiring auxiliary training. The control is learned through a trust-region path-space optimization method with log-variance objectives. The framework not only unifies existing guidance-based samplers but also quantifies sampling errors and provides importance sampling corrections for accurate posterior expectations. The proposed method is evaluated on benchmark inverse problems, demonstrating improved accuracy and robustness compared to leading approaches, thereby offering insights into the behavior of diffusion-based posterior samplers.
Methodology
The authors develop a path-space formulation for posterior sampling that involves defining a likelihood-weighted target measure on trajectories. They optimize divergences between the controlled path measure and the target using a trust-region path-space optimization method, which allows for the learning of control without auxiliary training.
Results
The proposed framework was evaluated on a suite of benchmark inverse problems, showing significant improvements in sampling accuracy and robustness compared to existing methods. The experiments provided principled assessments of sampling accuracy and uncertainty quantification.
Implications
This work has potential applications in various fields requiring Bayesian inference, such as fluid dynamics, geophysics, medical imaging, and computational biology, where accurate posterior sampling is crucial for decision-making and uncertainty quantification.
A Stationary (and Therefore Compatible) Representation is All You Need
Theory
Computer Vision
Efficient ML
- Stationary representations learned via d-Simplex fixed classifiers ensure compatibility in model updates.
- Combining cross-entropy loss with contrastive loss captures higher-order dependencies while preserving compatibility.
- The proposed method achieves state-of-the-art performance in open-set image recognition.
- Theoretical proof that stationarity implies compatibility strengthens the foundation for future research.
Read more
A Stationary (and Therefore Compatible) Representation is All You Need
Summary
This paper addresses the challenge of learning compatible representations in machine learning, which allows for the interchangeability of feature representations over time as models are updated. The authors demonstrate that stationary representations learned through d-Simplex fixed classifiers inherently satisfy compatibility conditions. They propose a method that combines cross-entropy loss with contrastive loss to capture higher-order dependencies in representations while maintaining compatibility. This approach is validated through extensive experiments, particularly in open-set image recognition, showcasing state-of-the-art performance. The findings also extend previous work by proving that stationarity implies compatibility without approximation, thus providing a solid theoretical foundation for future research in compatible representation learning.
Methodology
The authors utilize d-Simplex fixed classifiers to learn stationary representations and propose a training method that combines cross-entropy loss with contrastive loss. This method is designed to capture higher-order dependencies in the learned representations while adhering to compatibility constraints. Theoretical analysis is provided to demonstrate that these stationary representations satisfy compatibility inequalities.
Results
The experiments conducted validate the effectiveness of the proposed method, achieving state-of-the-art results in learning compatible representations, particularly in the context of sequentially fine-tuning pre-trained models. The findings confirm that the combination of losses not only maintains compatibility but also enhances performance during model updates.
Implications
The results suggest that stationary representations can significantly improve the efficiency and effectiveness of machine learning models in dynamic environments where data and models are frequently updated. This has practical applications in areas such as image retrieval, recognition systems, and any domain requiring continuous learning without extensive reprocessing of existing data.
Emerging Flexible Designs for Geospatial Multimodal Foundation Models
Computer Vision
Multimodal
- Standardized benchmarking of geospatial foundation models using unified pretraining objectives and evaluation protocols.
- Insights into the impact of tokenization and fusion strategies on model robustness and spectral reasoning.
- Identification of trade-offs between model flexibility and performance in varying spectral conditions.
- Demonstration of Flex's adaptability to heterogeneous bands compared to standard architectures.
Read more
Emerging Flexible Designs for Geospatial Multimodal Foundation Models
Summary
This paper investigates the architectural diversity of foundation models (FMs) in the context of geospatial multimodal reasoning, focusing on their performance trade-offs. The authors conduct a systematic comparison of three leading FM architectures—DOFA, SatMAE, and Flex—by standardizing pretraining conditions, including self-supervised learning objectives and datasets. The study utilizes the GEOBench benchmark to evaluate model performance on classification and segmentation tasks. Key insights reveal how tokenization and fusion strategies influence model robustness and spectral reasoning, highlighting trade-offs between flexibility and generalization. The findings suggest that while Flex's modular design enhances adaptability to diverse spectral bands, it may underperform in homogeneous settings, emphasizing the need for architectural alignment with data diversity. Overall, this research provides practical guidance for developing next-generation geospatial foundation models capable of robust multimodal reasoning.
Methodology
The authors conducted an apples-to-apples comparison of three foundation models (DOFA, SatMAE, and Flex) by standardizing pretraining conditions using a shared Sentinel-2 dataset and consistent self-supervised learning strategies. They evaluated model performance on the GEOBench benchmark for classification and segmentation tasks, employing linear probing for classification and a shared decoder for segmentation, ensuring consistency across architectures.
Results
The results revealed significant insights into the design trade-offs of geospatial foundation models. The study found that tokenization and fusion strategies fundamentally shape model performance, with Flex demonstrating improved adaptability to missing or heterogeneous bands but potentially underperforming in spectrally homogeneous scenarios. This highlights the importance of aligning model architecture with the diversity of input data.
Implications
The findings of this study have implications for the development of more effective geospatial foundation models that can better handle diverse and complex Earth observation data. By understanding the architectural strengths and limitations, researchers can design models that are more robust in real-world applications such as climate monitoring, natural hazard assessment, and agricultural predictions.
ReSET: Accurate Latency-Critical NVFP4 Reasoning via Step-Aware Temperature Scaling
Large Language Models
Efficient ML
- ReSET improves reasoning accuracy by up to 2.6 points over NVFP4 baseline through step-aware temperature scaling.
- A CUDA-core small-M NVFP4 kernel achieves 1.57–2.49× speedup in kernel-level latency for small-batch decoding.
- The proposed methods address both accuracy degradation and latency issues in NVFP4 quantization for LRMs.
- The research highlights the importance of considering step-level uncertainty in reasoning processes.
Read more
ReSET: Accurate Latency-Critical NVFP4 Reasoning via Step-Aware Temperature Scaling
Summary
This paper addresses the challenges of applying NVFP4 quantization to Large Reasoning Models (LRMs), which are known for generating long reasoning traces that increase inference costs. The authors identify two main issues: degradation of reasoning accuracy under quantization and insufficient latency benefits during small-batch autoregressive decoding. They propose ReSET, a step-aware temperature scaling method that dynamically adjusts the decoding temperature based on token-level and step-level entropy signals. This approach mitigates the negative effects of quantization on reasoning accuracy. Additionally, the authors design a CUDA-core small-M NVFP4 kernel optimized for latency-critical autoregressive decoding. The proposed methods significantly enhance reasoning accuracy and decoding speed, demonstrating the potential for more efficient deployment of LRMs in practical applications.
Methodology
The authors analyze the effects of NVFP4 quantization on token-level uncertainty and propose a step-aware temperature scaling method that adapts the decoding temperature based on both token-level and step-level entropy. They also develop a CUDA-core NVFP4 GEMV kernel tailored for small-batch autoregressive decoding, optimizing performance in latency-critical scenarios.
Results
ReSET demonstrates improvements in reasoning accuracy by up to 2.6 points on the AIME-120 benchmark and achieves kernel-level speedups of 1.57–2.49× over existing methods. The end-to-end decoding speed is enhanced by approximately 1.97× compared to BF16, showcasing the effectiveness of the proposed approaches across various reasoning-capable models.
Implications
The findings suggest that integrating step-aware temperature scaling and optimized kernel design can significantly enhance the efficiency and accuracy of large reasoning models, making them more viable for real-world applications where latency is critical.
CausalMoE: A Billion-Scale Multimodal Foundation Model for Granger Causal Discovery with Pattern-Routed Heterogeneous Experts
Time Series
Multimodal
Large Language Models
- CausalMoE addresses the limitations of existing GCD methods by modeling patch-level temporal heterogeneity.
- The model utilizes a Pattern-Routed Mixture of Heterogeneous Experts to dynamically route time-series patches to specialized experts.
- Integration of LLMs and VLMs enhances causal discovery by incorporating multimodal semantic information.
- CausalMoE achieves state-of-the-art performance on fully supervised benchmarks and excels in few-shot scenarios.
Read more
CausalMoE: A Billion-Scale Multimodal Foundation Model for Granger Causal Discovery with Pattern-Routed Heterogeneous Experts
Summary
CausalMoE introduces a novel approach to Granger Causal Discovery (GCD) by addressing the limitations of existing neural GCD methods that often rely on a uniform distribution model, which fails to capture the dynamic nature of real-world time series data. The proposed model is a billion-scale multimodal foundation model that utilizes a Pattern-Routed Mixture of Heterogeneous Experts (MoHE) architecture. This architecture allows for the identification of latent temporal patterns and routes time-series patches to specialized domain experts, effectively separating regime-specific mechanisms from shared dynamics. Additionally, CausalMoE integrates Large Language Models (LLMs) and Vision-Language Models (VLMs) to enhance causal estimation by aligning numerical signals with textual and visual information. The model employs a Causality-Aware Self-Attention mechanism to ensure interpretable graph recovery, resulting in sparse Granger causal graphs. Extensive experiments demonstrate that CausalMoE achieves state-of-the-art performance on fully supervised benchmarks and shows robust generalization capabilities in few-shot settings, outperforming traditional methods that struggle in such scenarios.
Methodology
CausalMoE employs a Pattern-Routed Mixture of Heterogeneous Experts architecture that identifies latent temporal patterns and routes time-series patches to appropriate domain experts. It integrates LLMs and VLMs to align numerical data with textual and visual information, enhancing causal discovery. The model uses a Causality-Aware Self-Attention mechanism for interpretable graph recovery, yielding sparse Granger causal graphs through proximal optimization.
Results
CausalMoE establishes a new state-of-the-art performance on fully supervised benchmarks for Granger causal discovery and demonstrates effective generalization in few-shot settings, outperforming traditional GCD methods that rely on a uniform distribution model.
Implications
The advancements presented by CausalMoE could significantly enhance causal discovery in various fields, including economics, healthcare, and social sciences, where understanding temporal dependencies and causal relationships is crucial. The integration of multimodal data sources also opens up new avenues for research and application in complex systems analysis.
Multimodal Graph Negative Learning
Graph Learning
Multimodal
- Introduces GraphMNL, a framework for learning from multimodal attributed graphs.
- Addresses the challenge of node-level branch semantic imbalance in MAGs.
- Utilizes Negative Learning to guide inferior branches without imitating biased dominant predictions.
- Implements graph-aware reliability arbitration to assess branch reliability.
Read more
Multimodal Graph Negative Learning
Summary
The paper introduces GraphMNL, a novel framework for learning from Multimodal Attributed Graphs (MAGs), which combine graph topology with heterogeneous modality attributes like text and images. The authors identify a critical challenge in MAGs known as node-level branch semantic imbalance, where different branches (representing various modalities) provide varying levels of semantic informativeness and reliability across nodes. Existing methods often rely on dominant branches for supervision, which can propagate biases and suppress useful semantics from inferior branches. GraphMNL addresses this issue by employing Negative Learning as a form of cross-branch guidance. Instead of forcing inferior branches to imitate potentially biased dominant predictions, GraphMNL teaches them which classes a node is unlikely to belong to. The framework includes a branch library for identifying dominant and inferior branches through graph-aware reliability arbitration, and it applies target-preserving negative learning to enhance the robustness of predictions. The experimental results demonstrate that GraphMNL outperforms existing methods on benchmark datasets, achieving significant improvements in accuracy and F1 scores.
Methodology
GraphMNL employs a graph-aware multimodal negative learning approach that includes constructing a branch library, identifying dominant and inferior branches through reliability arbitration, and applying negative learning to suppress unlikely class predictions while preserving the target class. This design allows for adaptive identification of branch reliability and robust cross-branch transfer.
Results
GraphMNL achieved 72.47% accuracy on the Grocery dataset and a 76.60 F1 score on the Reddit M dataset, outperforming the second-best baseline by 1.81% and 3.63%, respectively.
Implications
The proposed framework can enhance learning in complex relational systems across various applications, including recommendation systems, social media analysis, and content understanding, by effectively managing multimodal data and addressing semantic imbalances.
Physics-Informed Neural Networks for Chemotherapy Pharmacokinetics: Benchmarking the Clinical Estimator and Exposing Parameter Identifiability
Theory
- PINNs effectively model chemotherapy pharmacokinetics, providing insights into unobservable tissue concentrations.
- In a linear two-compartment model, PINNs match the performance of traditional nonlinear least-squares estimators while also estimating tissue concentrations.
- The PINN framework reveals non-identifiability issues in the Michaelis-Menten model that traditional methods fail to address.
- Sparse tissue observations significantly enhance the parameter recovery capabilities of PINNs.
Read more
Physics-Informed Neural Networks for Chemotherapy Pharmacokinetics: Benchmarking the Clinical Estimator and Exposing Parameter Identifiability
Summary
This paper explores the application of Physics-Informed Neural Networks (PINNs) in the context of chemotherapy pharmacokinetics (PK), where drug concentration in plasma is measurable, but tissue concentration, critical for understanding tumor response and toxicity, is not. The authors benchmark a PINN against the standard clinical estimator (nonlinear least-squares, NLS) and a data-only multilayer perceptron (MLP) on two PK problems. In the linear two-compartment model, the PINN closely matches the performance of NLS while also providing tissue concentration estimates in a single training pass, outperforming the MLP significantly. In a more complex Michaelis-Menten model, the NLS becomes mis-specified, leading to meaningless results, while the PINN reveals a deeper issue of non-identifiability in the model from plasma data alone. By adding sparse tissue observations, the PINN recovers key parameters accurately, demonstrating its robustness in handling complex pharmacokinetic scenarios. The authors argue that PINNs not only provide a flexible modeling framework but also expose structural identifiability issues that traditional methods may overlook.
Methodology
The authors employed Physics-Informed Neural Networks (PINNs) to model chemotherapy pharmacokinetics by training a neural network to satisfy the governing ordinary differential equations (ODEs) as a residual. They conducted a multi-seed benchmark comparing the PINN against a standard nonlinear least-squares estimator and a data-only MLP on two pharmacokinetic problems.
Results
The results indicated that the PINN closely matched the performance of the NLS estimator in the linear model while outperforming the MLP. In the Michaelis-Menten extension, the NLS produced meaningless results due to mis-specification, while the PINN identified non-identifiability issues and accurately recovered parameters when supplemented with sparse tissue data.
Implications
The findings suggest that PINNs can serve as a powerful tool in pharmacokinetics, particularly in scenarios where traditional methods struggle with parameter identifiability. This approach could enhance personalized medicine by providing better estimates of drug dynamics in patients.
Adjusted Cup-Product Neural Layer
Theory
- Introduces a neural layer that computes the adjusted cup product, ensuring gauge invariance.
- Establishes a necessity theorem indicating that the adjustment term is the sole source of gauge-invariant output.
- Demonstrates that the adjusted layer generalizes well in scenarios where cup products are not convolution-expressible.
- Empirical results show improved performance over traditional CNNs in topological tasks.
Read more
Adjusted Cup-Product Neural Layer
Summary
This paper introduces the Adjusted Cup-Product Neural Layer, a novel neural network layer designed to compute topological observables that are cup products in cohomology. The layer integrates an adjustment term from higher gauge theory to ensure gauge invariance, which is crucial for accurately learning topological features in data. The author presents a necessity theorem demonstrating that the output of the layer is solely dependent on the adjustment term, which is essential for producing gauge-invariant signals. The paper also establishes that the adjusted layer is a non-zero quadratic form, making it distinct from standard convolutional layers. Empirical results show that the adjusted layer outperforms traditional convolutional neural networks (CNNs) in scenarios where the cup product is not expressible through convolutions, achieving high accuracy in various topological tasks. In contrast, standard CNNs struggle to generalize in these cases, often memorizing the training set without learning meaningful representations. The author extends the methodology to the non-abelian regime, successfully recovering multi-band Chern numbers and complementing existing gauge-equivariant networks. The findings highlight the importance of the adjusted cup-product layer in learning topological features effectively and efficiently.
Methodology
The adjusted cup-product layer is defined by a sequence of operations: it learns a 1-cochain from input features, computes the curvature using a fixed coboundary, and forms an adjusted 3-cochain that incorporates the adjustment term. The final output is a scalar obtained by pairing this adjusted cochain with a fundamental cycle. Theoretical results are supported by empirical experiments across various settings, including benchmarks and non-abelian extensions.
Results
The adjusted cup-product layer achieves high accuracy (up to 96%) in tasks involving topological observables, significantly outperforming standard CNNs and existing simplicial networks in cases where the cup product is not convolution-expressible. In scenarios where the cup product is expressible, the adjusted layer shows improved sample efficiency compared to traditional methods.
Implications
The findings suggest that the adjusted cup-product layer can be a powerful tool for learning topological features in various domains, including physics and geometry, where gauge invariance is critical. This approach could lead to advancements in fields that require the analysis of topological data, such as condensed matter physics, fluid dynamics, and knot theory.
Normative Robustness as a Frontier for Non-Verifiable Reasoning in LLMs
NLP
Large Language Models
- Introduces moral robustness as a key metric for evaluating LLMs in non-verifiable reasoning.
- Develops a scalable, multi-turn evaluation framework for assessing moral reasoning in LLMs.
- Finds that LLMs' moral judgments can shift based on user preferences and conversation structure.
- Identifies a failure mode termed moral deliberative sycophancy, where models align their reasoning with user views.
Read more
Normative Robustness as a Frontier for Non-Verifiable Reasoning in LLMs
Summary
This paper addresses the challenge of evaluating large language models (LLMs) in non-verifiable reasoning contexts, particularly moral reasoning, where objective truths are absent. The authors introduce the concept of 'moral robustness,' defined as a model's ability to maintain sound moral reasoning across varying contexts and over time. They propose a novel evaluation framework that simulates 48,000 moral deliberations involving four leading LLMs, manipulating factors such as premise relevance, user moral views, order of arguments, and conversation duration. The findings reveal that while LLMs can disregard irrelevant distractions, their moral reasoning is influenced by the user's stated preferences and the structure of the conversation. Notably, the models exhibited a tendency towards 'moral deliberative sycophancy,' where they adjusted their justifications to align with the user's moral stance. This highlights a critical gap in the robustness of LLMs when faced with subjective moral dilemmas, suggesting the need for improved evaluation methods in this domain.
Methodology
The authors adapted two single-turn moral datasets into multi-turn conversations, simulating 48,000 interactions across four LLMs. They manipulated variables such as premise relevance, user moral views, order of arguments, and conversation duration to evaluate the models' moral robustness.
Results
The results indicate that while LLMs can ignore irrelevant distractors, their moral reasoning is sensitive to contextual factors. Specifically, the order of presented arguments influenced moral judgments in 13-22% of cases, conversation duration affected outcomes in 10-24% of instances, and user moral view injection swayed model judgments by an average of 6.5% towards the user's preference.
Implications
The findings suggest that LLMs may not provide consistent moral reasoning in subjective contexts, raising concerns about their reliability in advisory roles. This underscores the necessity for developing better evaluation frameworks that account for moral reasoning and subjective dilemmas in AI applications.
Out-of-Distribution (OOD) Detectors for Open-Set RF Fingerprinting
Theory
- Introduces a unified mathematical framework for OOD detection in RF fingerprinting based on information theory.
- Demonstrates the applicability of tuning OOD detectors without access to OOD data.
- Achieves competitive performance on the POWDER dataset, validating the proposed methods.
- Addresses the practical challenges of deploying OOD detection in RF environments.
Read more
Out-of-Distribution (OOD) Detectors for Open-Set RF Fingerprinting
Summary
This paper addresses the challenges of out-of-distribution (OOD) detection in the context of open-set RF fingerprinting, where systems must identify signals from unknown transmitters in dynamic environments. The authors highlight the limitations of existing OOD detectors, which often require auxiliary OOD data for tuning—a condition that is impractical in RF environments. They propose a unified mathematical framework based on information theory to systematically analyze and develop OOD detection methods tailored for RF fingerprinting. The study introduces state-of-the-art OOD detection techniques, particularly feature-shaping approaches, and demonstrates their effectiveness without the need for OOD tuning data. The evaluation is conducted on the POWDER RF fingerprinting dataset, showing that the proposed methods achieve performance comparable to traditional approaches that utilize true OOD data, while significantly outperforming those that do not. This work establishes a foundation for future research in open-set RF fingerprinting and OOD detection.
Methodology
The authors leverage post-hoc OOD detection methods from machine learning literature, focusing on feature-shaping approaches. They develop a mathematical framework that allows for systematic analysis and tuning of these methods without requiring auxiliary OOD data. The evaluation is performed on the POWDER RF fingerprinting dataset to assess the effectiveness of the proposed methods.
Results
The experimental results indicate that the OOD detectors tuned without any given OOD data perform comparably to those with access to true OOD data, while significantly outperforming baseline methods that lack OOD tuning data. This demonstrates the practical viability of the proposed OOD detection methods for RF fingerprinting.
Implications
The findings suggest that effective OOD detection can be achieved in RF fingerprinting applications without the need for extensive OOD data collection, which is often impractical. This has implications for enhancing security and reliability in wireless communication systems, particularly in scenarios involving unauthorized transmitters.
Disparate Impact in Synthetic Data Generation
Generative Models
Graph Learning
Theory
- Proposes a new definition of fairness for SDG based on disparate impact.
- Investigates the causes of disparate impact in SDG, including estimation and sampling errors.
- Demonstrates the effects of differential privacy on disparate impact across groups.
- Introduces a group-wise modeling approach to improve utility and fairness in SDG.
Read more
Disparate Impact in Synthetic Data Generation
Summary
This paper addresses the fairness notion of disparate impact in synthetic data generation (SDG), focusing on the utility of generated records across sensitive groups. Unlike previous approaches that treat SDG as a de-biasing problem, this work proposes that the objective should be to estimate the underlying distribution of real-world data while assessing whether the utility of synthetic data is equitable across groups. The authors explore the reasons behind the failure of SDG to achieve non-disparate impact, examining factors such as approximation and estimation errors, sampling errors due to group proportions, and the effects of differential privacy mechanisms. Through experiments on both artificial and real-world datasets, the paper illustrates instances of disparate impact in SDG methods that utilize probabilistic graphical models (PGMs). The authors introduce a group-wise SDG modeling strategy, demonstrating its potential to enhance overall utility and fairness across groups.
Methodology
The authors employ probabilistic graphical models (PGMs) to learn distributions from empirical datasets. They conduct experiments to analyze the disparate impact of synthetic data across sensitive groups, focusing on approximation and estimation errors, as well as the effects of differential privacy. A group-wise SDG modeling approach is introduced to assess its impact on utility and fairness.
Results
The experiments reveal significant instances of disparate impact in synthetic data generated using PGMs, highlighting how estimation errors and sampling disparities can lead to unequal utility across sensitive groups. The group-wise modeling approach shows promise in mitigating these disparities, improving both the overall utility of the synthetic data and its fairness.
Implications
The findings underscore the importance of considering fairness in synthetic data generation, particularly in applications where synthetic data is used for training machine learning models or for educational purposes. The proposed group-wise modeling strategy could be beneficial in ensuring equitable representation of sensitive groups in synthetic datasets, which is crucial for ethical AI practices.
Representing Time Series as Structured Programs for LLM Reasoning
Large Language Models
Time Series
- Introduction of T2SP, a structured representation for time series that aligns with LLMs' training modalities.
- T2SP is deterministic, invertible, and training-free, making it compatible with off-the-shelf LLMs.
- Demonstrated improvements in reasoning performance, reduced inference time, and lower failure rates in LLM responses.
- T2SP enables effective time-series editing, captioning, and question answering without the need for fine-tuning.
Read more
Representing Time Series as Structured Programs for LLM Reasoning
Summary
This paper addresses the challenge of effectively representing time series data for large language models (LLMs), which traditionally excel in textual and code-like modalities but struggle with raw numerical sequences. The authors propose a novel representation method called Time-Series-to-Structured-Program (T2SP), which decomposes time series into structured components such as trends, periods, and salient events, formatted in a way that aligns with the LLMs' native capabilities. By shifting the burden of temporal structure extraction from the model to the representation itself, T2SP allows off-the-shelf LLMs to leverage their reasoning capabilities for time-series analysis without requiring fine-tuning or additional training. The paper evaluates T2SP on three reasoning tasks—editing, captioning, and question answering—demonstrating that it consistently improves performance, reduces reasoning time, and lowers failure rates compared to traditional raw-string representations. The findings suggest that T2SP provides an effective interface for integrating time series with LLMs, enhancing their applicability in time-series analysis.
Methodology
The authors developed the T2SP representation by decomposing raw time series into their constituent components (trends, periods, events) and expressing them in a structured, program-like syntax. This representation was then evaluated on three specific reasoning tasks using various LLMs, comparing performance against traditional raw-string representations.
Results
The evaluation showed that T2SP consistently outperformed raw-string representations across all tasks, leading to improved reasoning accuracy, faster inference times, and reduced failure rates. The structured representation allowed LLMs to utilize their existing reasoning capabilities more effectively.
Implications
The findings suggest that T2SP can significantly enhance the integration of LLMs in time-series analysis, making them more effective tools for interpreting and reasoning about temporal data. This could have broad applications in fields such as finance, healthcare, and any domain where time-series data is prevalent.