AI-generated summaries
Today's ML research,
without the noise.
Daily summaries of the latest machine learning papers from arXiv, processed every 8 hours.
67
Papers today
8h
Update frequency
7
Days of history
SafeAdapt: Provably Safe Policy Updates in Deep Reinforcement Learning
Reinforcement Learning
Theory
Robotics
- Introduction of the Rashomon set for safe policy updates in RL.
- Formal guarantees for safety during policy adaptation in non-stationary environments.
- Empirical validation in grid-world navigation tasks demonstrating superior performance over existing methods.
- Prevention of catastrophic forgetting of safety constraints during continual learning.
Read more
SafeAdapt: Provably Safe Policy Updates in Deep Reinforcement Learning
Summary
The paper introduces SafeAdapt, a novel approach to ensure safe policy updates in deep reinforcement learning (RL) for continual learning scenarios. The authors address the challenge of maintaining safety guarantees while adapting RL policies to non-stationary environments. Current methods often lack formal safety guarantees or only verify safety post hoc. SafeAdapt proposes the concept of the Rashomon set, a certified region in policy parameter space where all policies meet safety constraints based on demonstration data. The authors demonstrate that arbitrary RL algorithms can be updated safely by projecting their updates onto this Rashomon set. The method is empirically validated in grid-world navigation tasks, showing that it preserves safety during adaptation while outperforming regularization-based methods that suffer from catastrophic forgetting of safety constraints. This work represents a significant advancement in the field of safe reinforcement learning, providing a framework for ensuring safety during policy adaptation.
Methodology
The authors leverage the Local Invariant Domain (LID) framework to define the Rashomon set in policy parameter space. They formulate an optimization problem to compute this set using a differentiable safety surrogate and Interval Bound Propagation (IBP). The method combines Proximal Policy Optimization (PPO) updates with projected gradient descent to ensure that policy updates remain within the certified safe region throughout training.
Results
The empirical results indicate that SafeAdapt successfully preserves source-task safety during policy adaptation in grid-world environments (Frozen Lake and Poisoned Apple), achieving competitive performance while preventing catastrophic forgetting of safety constraints, which is a common issue in regularization-based continual learning methods.
Implications
SafeAdapt has significant implications for deploying RL agents in safety-critical applications such as autonomous driving, medical treatment planning, and industrial control, where maintaining safety during policy updates is crucial. The formal guarantees provided by this method could enhance trust in RL systems operating in dynamic environments.
Using Synthetic Data for Machine Learning-based Childhood Vaccination Prediction in Narok, Kenya
Generative Models
Optimization
Theory
- Identification of children at risk of missing vaccinations can improve healthcare interventions.
- Synthetic data generation can protect patient privacy without sacrificing predictive accuracy.
- Machine learning models can effectively predict vaccination risks in low-resource settings.
- High performance metrics (recall, precision, F1-scores > 90%) were achieved for vaccination predictions.
Read more
Using Synthetic Data for Machine Learning-based Childhood Vaccination Prediction in Narok, Kenya
Summary
This paper addresses the challenge of low childhood vaccination rates in Narok County, Kenya, particularly among nomadic populations like the Maasai, where data scarcity and privacy concerns hinder effective healthcare interventions. The authors aim to identify children at risk of missing vaccinations and to protect sensitive health data through innovative methods. They digitized eight years of vaccination records and employed machine learning models, including Logistic Regression and XGBoost, to predict vaccination risks. A novel synthetic data generation technique, TabSyn, was utilized to ensure patient privacy while maintaining model performance. The results indicate that the predictive models achieved high accuracy, with recall, precision, and F1-scores exceeding 90% for certain vaccines. Importantly, the use of synthetic data did not compromise predictive performance compared to real data. The findings suggest that synthetic data can enhance health informatics strategies in low-resource settings, enabling scalable and privacy-preserving forecasting for childhood immunization coverage.
Methodology
The authors digitized vaccination records from the MOH 510 registry, applying machine learning models such as Logistic Regression and XGBoost to identify at-risk children. They also developed a synthetic data generation method called TabSyn to ensure data privacy while training the models.
Results
The machine learning models demonstrated high predictive performance, with recall, precision, and F1-scores exceeding 90% for some vaccines. The use of synthetic data did not lead to a loss in predictive performance compared to models trained on real data.
Implications
The findings support the integration of synthetic data in health informatics, particularly in low-resource settings, facilitating privacy-preserving and scalable approaches to improve childhood vaccination coverage and health equity.
Is your algorithm unlearning or untraining?
Theory
- Distinction between 'Unlearning' and 'Untraining' is crucial for clarity in research.
- Untraining removes the influence of specific examples, while Unlearning addresses broader distributions.
- Misuse of the term 'unlearning' can lead to inappropriate metrics and expectations in algorithm evaluation.
- Clarifying these concepts can accelerate progress in machine unlearning research.
Read more
Is your algorithm unlearning or untraining?
Summary
The paper addresses the growing interest in 'machine unlearning', which refers to the ability to delete specific data points or behaviors from a trained model. The authors argue that the term 'unlearning' has been overloaded in the literature, leading to confusion and misinterpretation of various research efforts. They propose a critical distinction between two concepts: 'Unlearning' and 'Untraining'. Untraining focuses on reversing the effects of specific examples in a forget set, while Unlearning aims to remove the influence of the entire underlying distribution from which those examples are drawn. The paper reviews the background of these concepts, provides technical definitions, and maps existing literature to these two formulations. By clarifying these definitions, the authors hope to foster better understanding and progress in the field of unlearning, highlighting overlooked research questions and the importance of appropriate metrics and baselines for evaluating algorithms.
Methodology
The authors review existing literature on unlearning and untraining, establish technical definitions for both concepts, and map various problem settings to these definitions. They also discuss the implications of these distinctions through illustrative examples.
Results
The paper successfully delineates the differences between Unlearning and Untraining, providing a framework for understanding how these concepts have been conflated in previous research. This clarity is expected to enhance the evaluation of algorithms and guide future research directions.
Implications
The proposed distinctions have significant implications for the development of algorithms that can effectively 'forget' data, particularly in contexts requiring compliance with privacy regulations. Additionally, it opens avenues for research into safer AI systems by addressing harmful behaviors and concepts.
Efficient RL Training for LLMs with Experience Replay
Large Language Models
Reinforcement Learning
Efficient ML
- Experience Replay can enhance sample efficiency in LLM post-training, contrary to prevailing beliefs.
- Theoretical analysis provides a framework for optimizing replay buffer design based on compute efficiency and data diversity.
- Empirical results show that using a replay buffer can save up to 40% of compute budget while maintaining or improving model accuracy.
- The study emphasizes the importance of balancing data staleness and diversity for optimal training outcomes.
Read more
Efficient RL Training for LLMs with Experience Replay
Summary
This paper investigates the application of Experience Replay in the post-training phase of Large Language Models (LLMs), challenging the conventional belief that on-policy data is essential for optimal performance. The authors present a systematic study of replay buffers, formalizing the trade-off between staleness-induced variance, sample diversity, and the computational cost of data generation. They demonstrate that strict on-policy sampling is suboptimal, especially when generation costs are high. Through theoretical analysis, they quantify the optimal design of replay buffers and empirically validate that a well-implemented replay buffer can significantly reduce inference costs without degrading, and sometimes even improving, model performance. The findings suggest that a balanced approach to data freshness and diversity can enhance compute efficiency in LLM training.
Methodology
The authors conducted a theoretical analysis of the bias-variance trade-off in stochastic gradient descent to formalize the optimal design of replay buffers. They performed extensive empirical experiments to analyze the impact of buffer hyperparameters on training efficiency and model performance, comparing results with standard on-policy training methods.
Results
The study found that implementing a well-designed replay buffer can lead to significant reductions in inference costs, achieving up to 40% savings in compute budget while preserving or enhancing model performance. The results indicated that aggressive reuse of samples could degrade performance, but a properly sized buffer stabilizes training and improves output diversity.
Implications
The findings suggest that incorporating Experience Replay into LLM training pipelines can lead to more efficient use of computational resources, making it feasible to train larger models or conduct more extensive experiments within the same budget. This approach could shift the focus in RL fine-tuning from maximizing performance per step to maximizing performance per unit of compute.
Finite-Sample Analysis of Nonlinear Independent Component Analysis: Sample Complexity and Identifiability Bounds
Theory
- Establishes the first complete characterization of finite-sample analysis for nonlinear ICA.
- Identifies sample complexity scaling laws that guide practitioners in determining sample sizes.
- Introduces a direct relationship between excess risk and identification error, improving convergence rates.
- Validates theoretical predictions through extensive simulations, confirming scaling laws with high accuracy.
Read more
Finite-Sample Analysis of Nonlinear Independent Component Analysis: Sample Complexity and Identifiability Bounds
Summary
This paper addresses the finite-sample statistical properties of Nonlinear Independent Component Analysis (ICA), a crucial unsupervised learning technique for separating mixed signals into independent sources. While previous work has established asymptotic identifiability guarantees, the finite-sample requirements for reliable source recovery remained unclear. The author presents a comprehensive analysis that characterizes the sample complexity required for ϵ-accurate source identification, revealing that it scales as n = Θ((d + log(1/δ))/(ϵ²∆)), where d is the latent dimension, δ is the confidence parameter, and ∆ represents the informativeness of auxiliary supervision. The findings highlight three scaling laws: sample size grows quadratically with the inverse error, linearly with dimension, and inversely with stronger auxiliary supervision. The paper introduces three key technical contributions: a direct relationship between excess risk and identification error, matching information-theoretic lower bounds confirming optimality, and an extension to practical SGD optimization. Extensive simulations validate the theoretical predictions, confirming the scaling laws and revealing a gap between theory and practice in finite-iteration SGD behavior. This gap underscores the challenges of observing asymptotic rates in neural network training and suggests future research directions.
Methodology
The paper employs theoretical analysis to derive sample complexity bounds and uses extensive simulation experiments to validate these theoretical predictions. It leverages neural network encoders and explores the relationship between optimization objectives and source identification.
Results
The study finds that the sample complexity for accurate source identification scales as n = Θ((d + log(1/δ))/(ϵ²∆)). The empirical results from simulations confirm the theoretical predictions with R² > 0.999, and the research identifies a significant difference in behavior between theoretical predictions and practical SGD outcomes.
Implications
The findings provide critical guidance for practitioners in determining appropriate sample sizes for nonlinear ICA applications, particularly in fields where data collection is costly. The insights into the theory-practice gap also pave the way for future research into optimizing neural network training.
Temporal Dropout Risk in Learning Analytics: A Harmonized Survival Benchmark Across Dynamic and Early-Window Representations
Time Series
- A harmonized benchmark for dropout risk modeling enhances comparability across different predictive models.
- Temporal and behavioral signals are more predictive of dropout risk than static demographic factors.
- Calibration and interpretability are essential for practical applications of dropout risk predictions.
- Random Survival Forest and Poisson Piecewise-Exponential models show strong performance in their respective arms.
Read more
Temporal Dropout Risk in Learning Analytics: A Harmonized Survival Benchmark Across Dynamic and Early-Window Representations
Summary
This paper addresses the challenge of student dropout prediction in Learning Analytics by introducing a survival-oriented benchmark that evaluates predictive models under a harmonized protocol. The study utilizes the Open University Learning Analytics Dataset (OULAD) to compare two approaches: a dynamic weekly model and a continuous-time model. The evaluation framework includes four analytical layers: predictive performance, ablation, explainability, and calibration. Results indicate that Random Survival Forest excels in the continuous-time arm, while Poisson Piecewise-Exponential performs best in the dynamic arm. Notably, the dominant predictive signals identified were temporal and behavioral rather than demographic or structural. The findings emphasize the importance of a multi-dimensional benchmark in Learning Analytics and suggest that dropout risk should be viewed as a temporal-behavioral process, advocating for intervention strategies that focus on engagement trajectories rather than static attributes.
Methodology
The study employs a survival-oriented benchmark using the OULAD dataset, comparing two model representations: a dynamic weekly arm and a continuous-time arm. It evaluates models across four dimensions: predictive performance (using metrics like Integrated Brier Score and C-index), ablation analysis, explainability of predictive drivers, and calibration of risk predictions.
Results
The Random Survival Forest model leads in discrimination and horizon-specific Brier scores in the continuous-time arm, while the Poisson Piecewise-Exponential model performs best in the dynamic arm. The analysis reveals that the primary predictive signals are temporal and behavioral, with calibration results indicating that most models align well with observed outcomes, except for XGBoost AFT, which showed systematic bias.
Implications
The findings suggest that dropout risk modeling should focus on dynamic and behavioral factors, leading to more effective student support interventions. The study advocates for the adoption of harmonized evaluation protocols in Learning Analytics to improve the reliability of dropout predictions.
From Selection to Scheduling: Federated Geometry-Aware Correction Makes Exemplar Replay Work Better under Continual Dynamic Heterogeneity
Federated Learning
- FEAT addresses the limitations of exemplar replay in federated continual learning by focusing on both sample selection and effective utilization.
- The Geometric Structure Alignment module enhances feature consistency across clients by aligning local representations with shared prototypes.
- The Energy-based Geometric Correction module mitigates prediction bias towards majority classes, improving sensitivity to minority classes.
- Experimental results show that FEAT consistently outperforms seven state-of-the-art methods in terms of accuracy across different datasets.
Read more
From Selection to Scheduling: Federated Geometry-Aware Correction Makes Exemplar Replay Work Better under Continual Dynamic Heterogeneity
Summary
This paper addresses the challenges of catastrophic forgetting in federated continual learning (FCL) by proposing a novel method called Federated gEometry-Aware correcTion (FEAT). While existing approaches focus on selecting important samples for exemplar replay, they often neglect the effective utilization of these samples, particularly in environments characterized by continual dynamic heterogeneity. FEAT introduces two key components: the Geometric Structure Alignment module, which aligns feature representations with shared Equiangular Tight Frame prototypes to maintain geometric consistency across clients, and the Energy-based Geometric Correction module, which reduces bias towards majority classes during inference. This dual approach enhances the model's sensitivity to minority classes and improves robustness against class imbalances. Extensive experiments demonstrate that FEAT outperforms existing methods across various datasets, effectively addressing the issues of inter-client heterogeneity and class imbalance in FCL settings.
Methodology
The methodology involves two main components: (1) Geometric Structure Alignment, which distills relational geometry by aligning feature representations with globally shared prototypes, and (2) Energy-based Geometric Correction, which debiases feature embeddings to reduce overconfidence in majority classes during inference. The approach is designed to enhance representation learning consistency across clients in heterogeneous environments.
Results
FEAT was evaluated on three datasets with varying levels of heterogeneity, demonstrating significant improvements in Top-1 accuracy compared to seven state-of-the-art methods. The results validate the effectiveness of the geometric distillation and debiasing techniques in enhancing model robustness under class-imbalanced distributions.
Implications
The findings suggest that FEAT can be applied in real-world federated learning scenarios where data distributions are dynamic and imbalanced, such as in healthcare and finance, where continual learning from diverse client data is essential. This approach may lead to more robust and fair machine learning models in federated settings.
ECHO: Efficient Chest X-ray Report Generation with One-step Block Diffusion
Computer Vision
NLP
Generative Models
- ECHO achieves one-step-per-block parallel decoding for efficient CXR report generation.
- Introduces Direct Conditional Distillation (DCD) for improved inference speed with minimal quality loss.
- Response-Asymmetric Diffusion (RAD) adaptation reduces training complexity significantly.
- Outperforms existing autoregressive and diffusion-based models in clinical accuracy and speed.
Read more
ECHO: Efficient Chest X-ray Report Generation with One-step Block Diffusion
Summary
The paper introduces ECHO, an efficient diffusion-based vision-language model (dVLM) designed for chest X-ray report generation (CXR-RG). Traditional autoregressive models face significant latency due to sequential token decoding, while diffusion models, although capable of parallel generation, typically require multiple denoising iterations that can compromise output coherence. ECHO addresses these challenges by employing a novel Direct Conditional Distillation (DCD) framework that allows for stable one-step-per-block inference, effectively mitigating the mean-field bias associated with token-factorized denoisers. Additionally, the Response-Asymmetric Diffusion (RAD) training strategy enhances training efficiency without sacrificing model performance. Experimental results indicate that ECHO outperforms state-of-the-art autoregressive methods, achieving significant improvements in report generation metrics while facilitating an 8× increase in inference speed, thus demonstrating its potential to alleviate the workload of radiologists in clinical settings.
Methodology
ECHO utilizes a Direct Conditional Distillation (DCD) framework to construct unfactorized supervision from on-policy diffusion trajectories, enabling the model to capture joint token dependencies. The Response-Asymmetric Diffusion (RAD) training strategy is employed to further enhance training efficiency. The model is built upon an enhanced autoregressive CXR-RG VLM, which is adapted into a block diffusion decoding paradigm.
Results
ECHO demonstrates a 64.33% improvement in RaTE and a 60.58% enhancement in SemScore compared to state-of-the-art autoregressive methods. The model achieves up to an 8× speedup in inference time while maintaining clinical accuracy.
Implications
The advancements presented in ECHO could significantly reduce the workload of radiologists by enabling faster and more efficient generation of chest X-ray reports, potentially improving diagnostic throughput in clinical environments.
Tree-of-Evidence: Efficient "System 2" Search for Faithful Multimodal Grounding
Multimodal
Interpretability
Optimization
- Introduces Tree-of-Evidence (ToE) for improved interpretability of multimodal models.
- Frames interpretability as a discrete optimization problem using a beam search strategy.
- Maintains high predictive performance while producing compact evidence sets.
- Demonstrates adaptability in evidence selection based on the context of the data.
Read more
Tree-of-Evidence: Efficient "System 2" Search for Faithful Multimodal Grounding
Summary
The paper introduces Tree-of-Evidence (ToE), an innovative inference-time search algorithm designed to enhance the interpretability of Large Multimodal Models (LMMs) in high-stakes domains such as healthcare. Traditional interpretability methods often fail to accurately represent the decision-making processes of these complex models, particularly when integrating diverse data types like time-series and text. ToE addresses this by framing interpretability as a discrete optimization problem, utilizing lightweight Evidence Bottlenecks to score groups of data and employing a beam search to identify the essential evidence needed to justify model predictions. The authors evaluate ToE across six tasks from three datasets, demonstrating that it can produce auditable evidence traces while maintaining predictive performance, achieving over 98% AUROC with minimal evidence units. The qualitative analysis reveals that ToE adapts its search strategy based on the context, effectively balancing the use of vital signs and textual data. This approach not only enhances model transparency but also provides a practical mechanism for auditing multimodal models by clearly linking predictions to specific evidence units.
Methodology
ToE employs a multi-step deliberative search process that evaluates candidate evidence combinations using a beam search. It separates the multimodal space into Global Context and Searchable Evidence, scoring coarse data units through lightweight Evidence Bottlenecks. The inference process optimizes for decision agreement, probability stability, and evidence sparsity.
Results
ToE was evaluated on six tasks across three datasets, achieving over 98% AUROC with as few as five evidence units. It demonstrated higher decision agreement and lower probability fidelity error compared to existing methods, while also providing clear and auditable evidence traces.
Implications
The development of ToE has significant implications for the deployment of LMMs in critical applications such as healthcare, where interpretability and accountability are paramount. It enables practitioners to trace model predictions back to specific evidence, fostering trust and facilitating regulatory compliance.
From Dispersion to Attraction: Spectral Dynamics of Hallucination Across Whisper Model Scales
Audio & Speech
Theory
Interpretability
- Introduction of the Spectral Sensitivity Theorem to explain hallucinations in ASR models.
- Identification of two regimes: Structural Disintegration in smaller models and Compression-Seeking Attractor in larger models.
- Validation of theoretical predictions through eigenspectral analysis of Whisper models under adversarial conditions.
- Demonstration of how model scaling impacts the dynamics of signal propagation and hallucination occurrences.
Read more
From Dispersion to Attraction: Spectral Dynamics of Hallucination Across Whisper Model Scales
Summary
This paper addresses the critical issue of hallucinations in large Automatic Speech Recognition (ASR) models, particularly focusing on the Whisper model. The authors introduce the Spectral Sensitivity Theorem, which predicts a phase transition in deep networks from a dispersive regime, characterized by signal decay, to an attractor regime, where rank-1 collapse occurs. This transition is influenced by layer-wise gain and alignment. The study validates this theory by analyzing the eigenspectra of activation graphs in Whisper models of varying sizes under adversarial stress. The findings reveal that intermediate models experience Structural Disintegration (Regime I), marked by a 13.4% collapse in Cross-Attention rank, while larger models enter a Compression-Seeking Attractor state (Regime II), where Self-Attention compresses rank by -2.34% and hardens the spectral slope, leading to a decoupling from acoustic evidence. The research highlights the importance of understanding the internal dynamics of ASR models to mitigate hallucinations and improve their reliability.
Methodology
The authors developed a theoretical framework based on spectral graph theory to analyze the signal propagation through Transformer networks. They introduced the Spectral Propagation Instability (SPI) framework to characterize how acoustic information is preserved or suppressed. The analysis involved examining the eigenspectra of activation graphs in Whisper models of different sizes, focusing on the effects of adversarial stress on model performance.
Results
The study confirmed the Spectral Sensitivity Theorem, showing that smaller Whisper models (Regime I) suffer from significant structural disintegration, while larger models (Regime II) exhibit a compression-seeking behavior that leads to a reduction in rank and a hardening of the spectral slope. Specifically, the intermediate models showed a 13.4% collapse in Cross-Attention rank, while larger models demonstrated a -2.34% change in Self-Attention rank.
Implications
The findings suggest that understanding the spectral dynamics of ASR models can lead to improved strategies for mitigating hallucinations, enhancing the reliability of speech recognition systems. This research could inform the design of future ASR architectures and contribute to the development of more robust machine learning models.
MIPT-SSM: Scaling Language Models with O(1) Inference Cache via Phase Transitions
NLP
Large Language Models
Theory
- Introduces a learned measurement rate to navigate between wave and particle phases in sequence modeling.
- Demonstrates a significant reduction in memory usage while improving accuracy over traditional Transformer models.
- Establishes a formal proof of the incompatibility between norm-preservation and selective forgetting in linear operators.
- Empirical validation across multiple tasks including text classification and language modeling.
Read more
MIPT-SSM: Scaling Language Models with O(1) Inference Cache via Phase Transitions
Summary
The paper introduces MIPT-SSM, a novel neural sequence architecture inspired by Measurement-Induced Phase Transitions (MIPT). The architecture employs a learned measurement rate (pt) that dynamically routes computation between two distinct phases: a wave phase (pt approaching 0) for distributed information propagation and a particle phase (pt approaching 1) for precise local storage. This approach addresses the inherent incompatibility of norm-preservation and selective forgetting in sequence modeling, as formally proven in the paper. MIPT-SSM is predicted to undergo a phase transition at a critical sequence length of approximately 1024, aligning with observed memory scaling behaviors. Empirical results demonstrate that MIPT outperforms traditional Transformers on AG News classification tasks, achieving 90.5% accuracy compared to 73.6% for Transformers, while also significantly reducing memory usage from 34,651 MB to 810 MB. The model also shows high accuracy in exact-recall tasks and competitive performance in language modeling tasks, indicating its potential for efficient memory utilization and effective information retrieval.
Methodology
The MIPT-SSM architecture utilizes a learned measurement rate (pt) to determine the operational phase of information processing. The model employs a recurrence relation that allows for O(N log N) parallel training and O(1) inference per token. The architecture is grounded in phase transition theory, predicting behavior based on the information density ratio. A causal sparse key-value cache is implemented, dynamically populated based on learned pt values, enhancing memory efficiency.
Results
MIPT-SSM achieved 90.5% accuracy on AG News classification, outperforming Transformers by 16.6%. Memory requirements were reduced from 34,651 MB to 810 MB, a 42.8x reduction. In exact-recall tasks, the model reached 96.8% accuracy, and in language modeling on WikiText-103, it achieved a perplexity of 92.1, closely matching the Transformer’s performance while significantly reducing inference cache complexity.
Implications
The findings suggest that MIPT-SSM could lead to more efficient language models capable of handling long sequences with reduced memory requirements. This architecture may be applicable in various NLP tasks where memory efficiency and accuracy are critical, potentially influencing future designs of neural sequence models.
Identification and Anonymization of Named Entities in Unstructured Information Sources for Use in Social Engineering Detection
NLP
Audio & Speech
Multimodal
- Development of a system for collecting and processing unstructured data from Telegram.
- Implementation of advanced speech-to-text models for audio data transcription.
- Evaluation of NER solutions, highlighting the effectiveness of transformer-based architectures.
- Introduction of anonymization metrics to balance data protection and analytical utility.
Read more
Identification and Anonymization of Named Entities in Unstructured Information Sources for Use in Social Engineering Detection
Summary
This paper addresses the challenge of creating datasets for cybercrime analysis while adhering to regulations like the GDPR. The authors propose a system for collecting data from the Telegram platform, which includes text, audio, and images. They implement speech-to-text transcription models with signal enhancement techniques and evaluate various Named Entity Recognition (NER) solutions, including Microsoft Presidio and transformer-based models. The study finds that Parakeet excels in audio transcription, while the proposed NER solutions yield high f1-scores in detecting sensitive information. Anonymization metrics are introduced to assess the preservation of data structure while ensuring personal information protection, thus supporting cybersecurity research within legal frameworks. The research aims to extract relationship structures indicative of fraudulent activities from unstructured content, facilitating the analysis of cybercrime patterns and their validation by law enforcement.
Methodology
The methodology involves collecting unstructured data from Telegram, applying speech-to-text transcription with signal enhancement, and evaluating NER solutions for identifying sensitive information. The authors compare the performance of different models and tools, including transformer-based architectures and Microsoft Presidio, to achieve optimal results in entity recognition and anonymization.
Results
The experimental results indicate that the Parakeet model provides the best performance in audio transcription tasks. The proposed NER solutions achieve high f1-scores in detecting sensitive information, demonstrating their effectiveness in identifying named entities while maintaining compliance with GDPR requirements. Anonymization metrics are developed to evaluate the structural coherence of data post-anonymization.
Implications
The findings have significant implications for cybersecurity research, particularly in the context of monitoring and analyzing cybercrime activities. The methodologies established can aid law enforcement agencies in detecting social engineering patterns while ensuring compliance with data protection regulations. This work contributes to the development of tools that can effectively analyze unstructured data in a legally compliant manner.
Inside-Out: Measuring Generalization in Vision Transformers Through Inner Workings
Computer Vision
Interpretability
Theory
- Introduces two new metrics for evaluating model generalization based on internal mechanisms.
- Dependency Depth Bias (DDB) measures reliance on deep versus shallow features before deployment.
- Circuit Shift Score (CSS) predicts model performance under distribution shifts after deployment.
- Both metrics show improved correlation with OOD performance, outperforming existing proxies.
Read more
Inside-Out: Measuring Generalization in Vision Transformers Through Inner Workings
Summary
This paper addresses the critical need for reliable generalization metrics in machine learning, particularly for vision transformers, in scenarios where labeled data is scarce. The authors propose a novel approach that leverages the internal workings of models, specifically their circuits, to create two new metrics: Dependency Depth Bias (DDB) for pre-deployment model selection and Circuit Shift Score (CSS) for post-deployment performance monitoring. DDB quantifies the reliance on deep versus shallow features, while CSS measures deviations in the model's circuit structure under distribution shifts. The study demonstrates that these metrics significantly improve correlation with out-of-distribution (OOD) performance compared to existing proxies, thus providing a more robust evaluation of model generalization capabilities. The findings indicate that models with strong generalization exhibit deeper inter-layer connections, and CSS can effectively detect silent failures in model performance, enhancing reliability in real-world applications.
Methodology
The authors utilize circuit discovery to extract causal interactions between internal representations of vision transformers. They analyze the structural patterns of these circuits to derive the DDB and CSS metrics, which are then validated across various tasks and datasets to assess their effectiveness in predicting generalization performance.
Results
The proposed metrics, DDB and CSS, demonstrate a significant improvement in correlation with generalization performance, with increases of 13.4% and 34.1% respectively compared to existing metrics. Additionally, CSS provides a 45% gain in detection F1 score for early identification of silent failures in model performance.
Implications
The findings suggest that evaluating model generalization through internal mechanisms can lead to more reliable model selection and monitoring strategies, particularly in high-stakes applications where labeled data is limited. This approach could enhance the robustness of machine learning models in real-world scenarios, reducing the risk of performance degradation due to distribution shifts.
On the Spectral Geometry of Cross-Modal Representations: A Functional Map Diagnostic for Multimodal Alignment
Multimodal
- First application of functional maps to multimodal neural representation alignment.
- Evidence that independently pretrained vision and language encoders develop similar spectral complexity.
- Identification of the spectral complexity–orientation gap in cross-modal representations.
- Introduction of three quantitative diagnostics for assessing representation compatibility.
Read more
On the Spectral Geometry of Cross-Modal Representations: A Functional Map Diagnostic for Multimodal Alignment
Summary
This paper investigates cross-modal alignment between independently pretrained vision and language encoders using a functional map framework from computational geometry. The study highlights that while the functional map framework underperforms compared to traditional methods like Procrustes alignment in cross-modal retrieval tasks, it reveals significant structural properties of multimodal representations. The authors find that the Laplacian eigenvalue spectra of the vision and language encoders are quantitatively similar, indicating comparable intrinsic complexity. However, the functional map exhibits low diagonal dominance and high orthogonality error, suggesting that while the models capture similar structures, they do not organize them in a compatible manner. This phenomenon is termed the spectral complexity–orientation gap. The paper introduces three diagnostic metrics for assessing cross-modal representation compatibility: diagonal dominance, orthogonality deviation, and Laplacian commutativity error. The findings suggest that independently pretrained models can develop similar representation structures but may not align effectively in terms of their organization.
Methodology
The study employs a functional map framework to analyze cross-modal alignment by encoding samples from the Flickr30k dataset using a vision encoder (DINOv2) and a text encoder (MiniLM). It constructs k-nearest-neighbor graphs, computes normalized graph Laplacians, and extracts spectral bases. The functional map is derived by solving a regularized least-squares problem that penalizes Laplacian commutativity violations, and comparisons are made against traditional alignment methods.
Results
The functional map framework achieved a 2.2% i2t Recall@1 at 100 anchors, significantly lower than Procrustes alignment (12.1%) and relative representations (13.4%). The spectral diagnostics revealed a normalized spectral distance of 0.043 between the Laplacian eigenvalue spectra of the two encoders, indicating similar intrinsic complexity, but the functional map showed near-zero diagonal dominance and a high orthogonality error of 70.15, highlighting the misalignment in their organizational structures.
Implications
The findings suggest that while independently pretrained models can capture similar structures, their organizational differences may hinder effective cross-modal alignment. This has implications for the development of more modular and effective multimodal systems that can leverage independently trained models without the need for extensive retraining.
On the Role of DAG topology in Energy-Aware Cloud Scheduling : A GNN-Based Deep Reinforcement Learning Approach
Reinforcement Learning
Graph Learning
Optimization
- Introduces a GNN-based DRL approach for scheduling workflows in cloud environments.
- Identifies OOD conditions that cause performance degradation in GNN-based schedulers.
- Analyzes the impact of structural mismatches on message passing and policy generalization.
- Highlights the need for robust representations to improve scheduling performance under distribution shifts.
Read more
On the Role of DAG topology in Energy-Aware Cloud Scheduling : A GNN-Based Deep Reinforcement Learning Approach
Summary
This paper addresses the challenge of scheduling workflows represented as directed acyclic graphs (DAGs) in cloud environments, focusing on minimizing both completion time and energy consumption. The authors propose a novel scheduling approach that utilizes Graph Neural Networks (GNNs) combined with Deep Reinforcement Learning (DRL) to adaptively assign heterogeneous compute resources to tasks. A significant contribution of the work is the identification of out-of-distribution (OOD) conditions that lead to performance failures in GNN-based schedulers. The authors provide a detailed analysis of how structural mismatches between training and deployment environments disrupt message passing and policy generalization, ultimately leading to degraded performance. Through controlled evaluations, they demonstrate the limitations of current GNN-based scheduling methods and emphasize the need for more robust representations to ensure reliable performance under varying conditions. The findings highlight the importance of considering DAG topology in the design of scheduling algorithms, particularly in the context of energy-aware cloud computing.
Methodology
The authors developed a GNN-based deep reinforcement learning scheduler that minimizes workflow completion time and energy usage. They conducted controlled evaluations to analyze the performance of the scheduler under various out-of-distribution conditions, focusing on the structural mismatches between training and deployment environments.
Results
The study revealed that GNN-based schedulers experience significant performance degradation when faced with OOD conditions due to structural mismatches. The analysis provided insights into the limitations of current scheduling methods and underscored the necessity for improved representations to ensure reliable performance across different scenarios.
Implications
The findings suggest that cloud scheduling algorithms need to incorporate more robust representations and consider the topology of DAGs to enhance energy efficiency and performance. This research could lead to more effective resource allocation strategies in cloud computing, particularly for AI workloads that demand high computational resources.
Conservation Law Breaking at the Edge of Stability: A Spectral Theory of Non-Convex Neural Network Optimization
Optimization
Theory
- Gradient descent on L-layer ReLU networks preserves L−1 conservation laws under continuous flow.
- Discrete gradient descent breaks these conservation laws, leading to a drift characterized by a power law.
- A closed-form spectral crossover formula for drift is derived and validated across multiple architectures.
- Cross-entropy loss induces exponential spectral compression in the Hessian, independent of dataset size.
Read more
Conservation Law Breaking at the Edge of Stability: A Spectral Theory of Non-Convex Neural Network Optimization
Summary
This paper addresses the paradox of why gradient descent effectively finds good solutions in non-convex neural network optimization, despite the NP-hard nature of the problem. The author demonstrates that gradient flow in L-layer ReLU networks preserves certain conservation laws, which confine optimization trajectories to lower-dimensional manifolds. However, when using discrete gradient descent, these conservation laws break, leading to a drift in the optimization process. The study provides a detailed spectral theory explaining this drift, characterized by a power law scaling with a non-integer exponent that varies based on the architecture and loss function. The author derives a closed-form spectral crossover formula for the drift, validates it across various network architectures, and identifies two distinct dynamical regimes in the optimization process. The findings suggest that cross-entropy loss induces a specific spectral compression in the Hessian, which is independent of the training set size, and that the breaking of conservation laws is maximized at the Edge of Stability, where training performance paradoxically improves.
Methodology
The author employs a theoretical approach to analyze the conservation laws in L-layer ReLU networks under gradient flow and discrete gradient descent. The drift is decomposed and characterized using spectral analysis, leading to the derivation of a closed-form formula for the drift exponent. Theoretical results are validated through extensive experiments across different network architectures.
Results
The study finds that the drift in conservation laws scales as η^α, with α approximately between 1.1 and 1.6, depending on various factors. The derived spectral crossover formula accurately predicts the drift behavior across different architectures. The research also confirms that cross-entropy loss leads to exponential compression of the Hessian spectrum, with a timescale independent of the training set size.
Implications
These findings provide deeper insights into the optimization dynamics of neural networks, particularly in understanding how gradient descent navigates non-convex landscapes. The results could inform the design of more efficient training algorithms and enhance the understanding of convergence behaviors in deep learning.
Automated Batch Distillation Process Simulation for a Large Hybrid Dataset for Deep Anomaly Detection
Time Series
- Creation of a large hybrid dataset combining experimental and simulated data for batch distillation.
- Development of an automated Python-based process simulator for generating consistent simulation data.
- The hybrid dataset is openly released and includes rich metadata and anomaly annotations.
- Addresses the challenges of obtaining annotated data from real chemical processes.
Read more
Automated Batch Distillation Process Simulation for a Large Hybrid Dataset for Deep Anomaly Detection
Summary
This paper addresses the challenge of anomaly detection (AD) in chemical processes, particularly batch distillation, by creating a large hybrid dataset that combines experimental and simulated data. The authors developed a novel Python-based process simulator that automates the generation of simulation data, which is calibrated to existing experimental records. This hybrid dataset includes comprehensive metadata and structured anomaly annotations, allowing for the simulation of both normal operations and various anomalies. The study emphasizes the importance of having large, diverse, and well-annotated datasets for training deep learning models for AD. The authors highlight that their dataset is the largest collection of consistent experimental and simulation data for dynamic chemical processes currently available, providing a unique resource for future research in deep AD methods and simulation-to-experiment style transfer. The paper also discusses the potential for generating pseudo-experimental data from simulation data, which could further enhance the development of machine learning-based AD methods.
Methodology
The authors developed a Python-based process simulator that employs a tailored index-reduction strategy for solving differential-algebraic equations. This simulator automates the generation of simulation data based on existing experimental records, allowing for the consistent creation of time-series data across various operational scenarios.
Results
The study successfully generated a comprehensive hybrid dataset that includes both normal and anomalous operating conditions for 119 batch distillation experiments. The dataset is noted for its size and consistency, making it a valuable resource for developing and benchmarking anomaly detection methods.
Implications
The hybrid dataset can significantly enhance the training and validation of machine learning models for anomaly detection in chemical processes. It also opens avenues for research in generating pseudo-experimental data, which can help in exploring unsafe or undesirable operating conditions without the risks associated with real experiments.
Act or Escalate? Evaluating Escalation Behavior in Automation with Language Models
NLP
Large Language Models
Theory
- LLMs exhibit miscalibrated self-assessments, leading to inconsistent escalation behavior.
- Escalation thresholds vary significantly across models and are not predicted by architecture or scale.
- Interventions such as supervised fine-tuning on chain-of-thought targets yield robust decision-making policies.
- Effective automation requires LLMs to explicitly reason about uncertainty and decision costs.
Read more
Act or Escalate? Evaluating Escalation Behavior in Automation with Language Models
Summary
This paper investigates the decision-making process of large language models (LLMs) regarding when to act on their predictions and when to escalate decisions to humans. The authors model this as a decision under uncertainty, where an LLM predicts outcomes, estimates its accuracy, and weighs the costs of acting versus escalating. The study evaluates eight models across five decision-making domains, revealing significant miscalibration in self-assessments of accuracy and variability in escalation behavior that is not linked to model architecture or size. The authors propose interventions to improve decision-making, including cost framing and supervised fine-tuning on chain-of-thought targets, which enhance model performance and generalization across various tasks. The findings emphasize the need for careful characterization of escalation behavior in LLMs before deployment, advocating for training that encourages explicit reasoning about uncertainty and decision costs.
Methodology
The authors model the escalation decision as a cost-benefit analysis where LLMs predict outcomes and estimate their accuracy. They evaluate eight models from four families on five decision-making tasks derived from human decision data. The study employs various interventions, including cost framing and supervised fine-tuning, to assess their impact on model performance.
Results
The study finds that LLMs are often miscalibrated in their self-assessments, leading to overconfidence or underconfidence in their predictions. The escalation behavior varies widely among models, with some preferring to escalate and others to implement decisions. Interventions, particularly supervised fine-tuning on chain-of-thought targets, significantly improve the models' ability to make optimal escalation decisions across different datasets and scenarios.
Implications
The findings suggest that organizations deploying LLMs for automation should carefully evaluate and characterize the escalation behavior of these models. Training models to explicitly consider uncertainty and decision costs can enhance their effectiveness in real-world applications, reducing the risk of propagating errors and improving overall decision-making efficiency.
Are Independently Estimated View Uncertainties Comparable? Unified Routing for Trusted Multi-View Classification
Multimodal
- Identifies the fragility of the assumption that evidence from different views is comparable.
- Proposes TMUR, which decouples evidence extraction from fusion arbitration.
- Introduces a unified router that generates sample-level expert weights based on global context.
- Demonstrates improved classification performance and reliability through extensive experiments.
Read more
Are Independently Estimated View Uncertainties Comparable? Unified Routing for Trusted Multi-View Classification
Summary
This paper addresses the challenges in trusted multi-view classification, where independent views produce class evidence and uncertainty for predictions. The authors highlight that the conventional approach assumes that evidence from different views is numerically comparable, which is often not the case due to variations in feature space, noise levels, and semantic granularity. This leads to unreliable uncertainty assessments that can skew the fusion process. To overcome this issue, the authors propose a novel framework called Trusted Multi-view learning with Unified Routing (TMUR). TMUR decouples evidence extraction from fusion arbitration by employing view-private experts and a collaborative expert, along with a unified router that generates sample-level expert weights based on the global multi-view context. This approach encourages balanced expert utilization and enhances the reliability of predictions. The paper includes a theoretical analysis supporting the need for global routing over branch-local arbitration. Extensive experiments on 14 datasets demonstrate that TMUR consistently outperforms 15 recent baselines in terms of classification performance and reliability.
Methodology
The TMUR framework utilizes view-private expert networks for each view and a collaborative expert network. A unified router observes the global multi-view context to generate sample-specific weights for these experts, allowing for a more reliable fusion of evidence. The approach incorporates soft load-balancing and diversity regularization to enhance expert specialization and utilization.
Results
TMUR was tested on 14 datasets and showed consistent improvements in classification performance and reliability compared to 15 recent baselines. The results validate the effectiveness of the unified routing mechanism in addressing the issues of evidence scale incomparability.
Implications
The findings suggest that TMUR can be applied in various multi-view classification scenarios, particularly where data heterogeneity and varying noise levels are present. This framework can enhance the robustness of predictions in real-world applications, such as image and text classification, where multiple data sources are integrated.
Flow Learners for PDEs: Toward a Physics-to-Physics Paradigm for Scientific Computing
Theory
Generative Models
Optimization
- Current learned PDE solvers often rely on state prediction, which is inadequate for complex scientific problems.
- Flow learners parameterize transport vector fields, providing a more accurate representation of PDE dynamics.
- The proposed approach enhances uncertainty quantification and supports continuous-time predictions.
- A shift from regression-based models to transport-based learning is necessary for effective PDE solving.
Read more
Flow Learners for PDEs: Toward a Physics-to-Physics Paradigm for Scientific Computing
Summary
This paper addresses the challenges of solving partial differential equations (PDEs) in scientific computing, emphasizing the limitations of current learned solvers. The authors propose a novel approach called 'flow learners,' which focuses on parameterizing transport vector fields and generating trajectories through integration, aligning more closely with the continuous dynamics of PDE evolution. The paper critiques existing paradigms such as physics-informed neural networks and neural operators for their reliance on state prediction, which often fails in complex scenarios involving uncertainty and long-term predictions. By shifting the focus from state regression to modeling transport over physically admissible futures, flow learners offer a more robust framework for learned PDE solving. The authors outline a research agenda that stems from this new perspective, advocating for a physics-to-physics alignment in solver design that enhances uncertainty quantification and continuous-time prediction capabilities.
Methodology
The authors introduce flow learners as a new class of models that parameterize transport vector fields instead of predicting states. This involves integrating or sampling the induced dynamics to generate trajectories that reflect the continuous evolution defined by PDEs. The paper critiques existing methodologies and proposes a framework that aligns solver structure with physical processes.
Results
The paper does not present empirical results but argues for the theoretical advantages of flow learners over traditional learned solvers. It highlights the inadequacies of current approaches in handling uncertainty and long-term predictions, suggesting that flow learners can better capture the dynamics of PDEs.
Implications
The proposed flow learners could significantly improve the efficiency and accuracy of PDE solving in various scientific fields, including climate modeling, engineering simulations, and medical applications. This paradigm shift may enable real-time predictions and better decision-making under uncertainty.
Efficient Hierarchical Implicit Flow Q-learning for Offline Goal-conditioned Reinforcement Learning
Reinforcement Learning
Generative Models
Robotics
- Introduction of Hierarchical Implicit Flow Q-Learning (HIFQL) for offline GCRL.
- Utilization of a goal-conditioned mean flow policy to enhance hierarchical policy expressiveness.
- Incorporation of LeJEPA loss for improved goal representation and generalization.
- Demonstrated strong performance on OGBench benchmark for both state-based and pixel-based tasks.
Read more
Efficient Hierarchical Implicit Flow Q-learning for Offline Goal-conditioned Reinforcement Learning
Summary
This paper presents Hierarchical Implicit Flow Q-Learning (HIFQL), a novel approach for offline goal-conditioned reinforcement learning (GCRL) that addresses the challenges of long-horizon control. Traditional methods like Hierarchical Implicit Q-Learning (HIQL) struggle with the expressiveness of Gaussian policies and the generation of effective subgoals. HIFQL introduces a goal-conditioned mean flow policy that utilizes an average velocity field to enhance hierarchical policy modeling. This allows for efficient action generation through one-step sampling, improving the expressiveness of both high-level and low-level policies. Additionally, the authors propose a LeJEPA loss to enhance goal representation, encouraging more discriminative embeddings that improve generalization. The experimental results demonstrate that HIFQL outperforms existing methods on the OGBench benchmark across various state-based and pixel-based tasks, showcasing its effectiveness in long-horizon offline GCRL scenarios.
Methodology
HIFQL extends the HIQL framework by replacing unimodal Gaussian policies with expressive mean flow policies at both high and low levels. This method captures complex target distributions and enables efficient one-step action generation. The LeJEPA-based goal representation encoder is employed to learn semantically meaningful goal embeddings, enhancing the robustness of high-level policy learning.
Results
The experimental evaluation on the OGBench benchmark indicates that HIFQL achieves superior performance compared to existing offline GCRL methods, effectively handling both state-based and pixel-based tasks, particularly in long-horizon scenarios.
Implications
The proposed HIFQL method has significant implications for advancing offline goal-conditioned reinforcement learning, particularly in applications requiring long-horizon decision-making, such as robotics and autonomous systems. Its ability to generate effective subgoals and improve goal representation could enhance the performance of agents in complex environments.
Adaptive Candidate Point Thompson Sampling for High-Dimensional Bayesian Optimization
Optimization
- Introduction of Adaptive Candidate Thompson Sampling (ACTS) to improve Bayesian optimization in high dimensions.
- ACTS adaptively reduces the search space by generating candidate points in gradient-aligned subspaces.
- Demonstrated significant performance improvements over traditional TS methods in both synthetic and real-world scenarios.
- Maintains global consistency, ensuring effective exploration of the optimization landscape.
Read more
Adaptive Candidate Point Thompson Sampling for High-Dimensional Bayesian Optimization
Summary
This paper presents Adaptive Candidate Thompson Sampling (ACTS), a novel approach to enhance Bayesian optimization in high-dimensional spaces. Traditional Thompson sampling (TS) methods struggle with the curse of dimensionality, as the density of candidate points required for effective sampling grows exponentially with dimensionality. ACTS addresses this issue by adaptively reducing the search space during sampling. Instead of relying on a fixed set of candidate points, ACTS generates candidate points in subspaces aligned with the gradient of a surrogate model sample. This allows for a denser and more effective discretization of the candidate points, leading to improved sampling of maxima. The authors demonstrate that ACTS can be seamlessly integrated into existing TS methods and shows significant performance improvements across various synthetic and real-world benchmarks. The method is shown to maintain global consistency, ensuring that it can eventually query points arbitrarily close to the global maximizer, despite using local gradient information for candidate selection.
Methodology
The methodology involves sampling the gradient of a Gaussian process (GP) surrogate model at the incumbent point to identify an ascent direction. Candidate points are then generated within a smaller, gradient-aligned region, allowing for a denser candidate set. This approach is coupled with posterior sampling to yield higher values of the posterior sample paths.
Results
ACTS outperformed traditional TS methods in finding higher values of posterior sample paths across various high-dimensional benchmarks. The method showed significant improvements when combined with local or sparse Bayesian optimization strategies, matching or exceeding the performance of alternative methods.
Implications
The findings suggest that ACTS can be effectively applied in various fields requiring high-dimensional optimization, such as machine learning, scientific discovery, and robotics, where sample efficiency is crucial.
SCOT: Multi-Source Cross-City Transfer with Optimal-Transport Soft-Correspondence Objective
Graph Learning
Optimization
Theory
- SCOT addresses the central challenge of establishing explicit soft correspondences in cross-city transfer learning.
- The framework utilizes Sinkhorn-based entropic optimal transport to manage unequal region partitions effectively.
- An OT-weighted contrastive objective sharpens semantic separability and enhances transferability of learned embeddings.
- SCOT shows consistent improvements over strong baselines in real-world applications, demonstrating robustness under data heterogeneity.
Read more
SCOT: Multi-Source Cross-City Transfer with Optimal-Transport Soft-Correspondence Objective
Summary
The paper introduces SCOT, a novel framework for cross-city transfer learning that addresses the challenges of aligning region representations from different cities with incompatible partitions and no ground-truth correspondences. Traditional methods often rely on heuristic matching or distribution-level alignment, which can be unstable and sensitive to the choice of anchors. SCOT leverages Sinkhorn-based entropic optimal transport to learn explicit soft correspondences between unequal region sets. It incorporates an OT-weighted contrastive objective to enhance transferable structure and employs a cycle-style reconstruction regularizer to stabilize optimization. The framework aligns multiple source cities to a shared prototype hub, guided by a target-induced prior, thus preventing source domination. Experimental results demonstrate that SCOT significantly improves transfer accuracy and robustness across various urban computing tasks, including GDP, population, and CO2 estimation, while providing interpretable diagnostics of alignment quality.
Methodology
SCOT employs a Sinkhorn-based entropic optimal transport framework to establish soft correspondences between regions from different cities. It integrates an OT-weighted contrastive objective to enhance the quality of learned embeddings and utilizes a cycle reconstruction regularizer to ensure optimization stability. The framework is extended to multi-source transfer by aligning each source city to a shared prototype hub using balanced entropic transport, guided by a target-induced prior.
Results
The experimental evaluation of SCOT on tasks such as GDP, population, and CO2 estimation reveals significant gains in transfer accuracy and robustness compared to existing methods. The results indicate that the improvements are primarily due to the alignment design rather than the capacity of the underlying encoders, confirming the effectiveness of the proposed approach.
Implications
The SCOT framework has potential applications in urban computing, particularly in scenarios where labeled data is scarce. It can enhance predictive modeling in various domains, including economic forecasting, urban planning, and environmental monitoring, by enabling more effective knowledge transfer across cities.
Prediction Arena: Benchmarking AI Models on Real-World Prediction Markets
Reinforcement Learning
Theory
Optimization
- Prediction Arena benchmarks AI models in real prediction markets with actual capital.
- Cohort 1 models showed significant performance differences between platforms, with Kalshi yielding worse returns than Polymarket.
- The study identifies key drivers of performance, including initial prediction accuracy and the ability to capitalize on correct predictions.
- Computational efficiency does not correlate with performance, challenging assumptions about model complexity.
Read more
Prediction Arena: Benchmarking AI Models on Real-World Prediction Markets
Summary
The paper introduces Prediction Arena, a novel benchmark designed to evaluate AI models' predictive accuracy and decision-making capabilities by allowing them to autonomously trade on real prediction markets using actual capital. This approach contrasts with traditional synthetic benchmarks, as it tests models in live environments where trades are executed on platforms like Kalshi and Polymarket, providing objective ground truth. The evaluation spans 57 days, tracking two cohorts of models: six frontier models engaged in live trading and four next-generation models in paper trading. Results indicate a performance hierarchy among the models, with returns on Kalshi ranging from -16.0% to -30.8%, while Polymarket showed less severe losses averaging -1.1%. Notably, the model grok-4-20-checkpoint achieved a 71.4% settlement win rate on Polymarket, highlighting the influence of platform design on model success. The study also examines computational efficiency, settlement accuracy, and trading behavior, revealing that performance is not solely dependent on computational effort. Overall, Prediction Arena provides a comprehensive framework for assessing AI models in real-world financial contexts, addressing limitations of existing benchmarks.
Methodology
The methodology involves deploying AI models as autonomous traders in live prediction markets, tracking their performance over a 57-day period. Two cohorts are evaluated: one with live trading on Kalshi and Polymarket, and another with paper trading. The models' returns, win rates, and computational efficiency metrics are analyzed to assess their predictive capabilities.
Results
Cohort 1 models experienced returns of -16.0% to -30.8% on Kalshi, while on Polymarket, they averaged -1.1%. The grok-4-20-checkpoint model achieved the highest win rate at 71.4% on Polymarket. Cohort 2's gemini-3.1-pro-preview model, which did not trade on Kalshi, achieved a +6.02% return on Polymarket in just three days.
Implications
The findings suggest that real-world trading environments significantly impact AI model performance, emphasizing the need for benchmarks that reflect genuine predictive capabilities. This research could influence the development of AI models for financial decision-making and forecasting across various domains.
Hierarchical Kernel Transformer: Multi-Scale Attention with an Information-Theoretic Approximation Analysis
Theory
Efficient ML
NLP
- HKT introduces a multi-scale attention mechanism that processes sequences at different resolution levels.
- The model captures both local and long-range dependencies while maintaining lower computational costs compared to traditional attention.
- Theoretical contributions include a positive semidefinite kernel definition and a unique decomposition of the asymmetric score matrix.
- Empirical results show consistent performance improvements across multiple tasks, including synthetic ListOps and CIFAR-10.
Read more
Hierarchical Kernel Transformer: Multi-Scale Attention with an Information-Theoretic Approximation Analysis
Summary
The paper introduces the Hierarchical Kernel Transformer (HKT), a novel multi-scale attention mechanism designed to address the limitations of standard self-attention in Transformer models. Traditional self-attention treats all token pairs equally, leading to inefficiencies in tasks requiring both short and long-range reasoning. HKT processes input sequences at multiple resolution levels, utilizing trainable causal downsampling to create compressed representations. Attention scores are computed independently at each level and combined through learned weights, allowing the model to capture both local patterns and long-range structures while maintaining a computational cost that is at most 4/3 times that of standard attention. The paper presents four theoretical contributions: (1) the hierarchical scoring function defines a positive semidefinite kernel; (2) the asymmetric score matrix can be decomposed into symmetric and antisymmetric components; (3) the approximation error can be interpreted through three components with a geometric decay bound; and (4) HKT subsumes standard attention and causal convolution in specific settings. Empirical results demonstrate that HKT consistently outperforms standard attention baselines across various tasks, achieving significant accuracy improvements with manageable computational overhead.
Methodology
The Hierarchical Kernel Transformer employs a multi-scale attention mechanism that processes input sequences at multiple resolution levels through trainable causal downsampling. Attention scores are computed independently at each level and combined using learned weights, allowing the model to effectively capture both local and long-range dependencies. Theoretical analysis includes the establishment of a positive semidefinite kernel and a decomposition of the asymmetric score matrix, while empirical evaluations are conducted on various tasks to assess performance improvements over standard attention mechanisms.
Results
HKT achieved notable improvements over standard attention baselines, with +4.77 percentage points on synthetic ListOps, +1.44 percentage points on sequential CIFAR-10, and +7.47 percentage points on IMDB character-level sentiment classification. These results were obtained with a computational overhead of only 1.31 times that of standard attention, demonstrating the efficiency and effectiveness of the proposed model.
Implications
The Hierarchical Kernel Transformer has the potential to enhance performance in various sequence modeling tasks, particularly those requiring both short and long-range reasoning. Its efficient multi-scale approach could be applied in areas such as natural language processing, computer vision, and other domains where attention mechanisms are critical. The theoretical insights provided may also inform future research on hierarchical models and attention mechanisms.
AlphaLab: Autonomous Multi-Agent Research Across Optimization Domains with Frontier LLMs
Large Language Models
Optimization
Time Series
- ALPHALAB automates the full experimental cycle in quantitative domains using frontier LLMs.
- The system operates through three phases: exploration, evaluation framework construction, and large-scale experimentation.
- ALPHALAB demonstrates significant performance improvements in CUDA kernel optimization, LLM pretraining, and traffic forecasting.
- The use of multiple LLMs allows for diverse solution discovery, enhancing the research process.
Read more
AlphaLab: Autonomous Multi-Agent Research Across Optimization Domains with Frontier LLMs
Summary
The paper introduces ALPHALAB, an autonomous research framework that utilizes advanced large language models (LLMs) to automate the experimental process in computationally intensive domains. ALPHALAB operates without human intervention through three main phases: (1) domain adaptation and data exploration, where it generates analysis code and a research report; (2) construction of an evaluation framework using an adversarial loop; and (3) execution of large-scale experiments via a Strategist/Worker model, which accumulates knowledge in a persistent playbook. The system is evaluated using two frontier LLMs (GPT-5.2 and Claude Opus 4.6) across three diverse domains: CUDA kernel optimization, LLM pretraining, and traffic forecasting. In CUDA optimization, ALPHALAB produces GPU kernels that outperform PyTorch’s compiler by an average of 4.4 times, with a maximum speedup of 91 times. For LLM pretraining, it achieves a 22% reduction in validation loss compared to a baseline. In traffic forecasting, it surpasses standard benchmarks by 23-25%. The results indicate that using multiple models provides complementary solutions across different tasks, highlighting the potential for enhanced research productivity through autonomous systems.
Methodology
ALPHALAB employs a multi-agent system that leverages frontier LLMs to autonomously adapt to different domains, explore datasets, construct evaluation frameworks, and execute experiments. It utilizes a Strategist/Worker loop for large-scale experimentation and maintains a persistent playbook for knowledge accumulation.
Results
In CUDA kernel optimization, ALPHALAB achieves an average speedup of 4.4 times over PyTorch, with a peak of 91 times. For LLM pretraining, it reduces validation loss by 22% compared to a single-shot baseline. In traffic forecasting, it outperforms standard baselines by 23-25%. The results demonstrate the effectiveness of the autonomous system across various optimization tasks.
Implications
The development of ALPHALAB suggests significant advancements in automating scientific research, potentially increasing research throughput and efficiency in various fields such as medicine, energy, and materials science. It highlights the role of AI in enhancing human-AI collaboration and optimizing experimental processes.
EngageTriBoost: Predictive Modeling of User Engagement in Digital Mental Health Intervention Using Explainable Machine Learning
Interpretability
- EngageTriBoost (ETB) is an explainable ensemble machine learning framework for predicting user engagement in DMHI.
- ETB achieved up to 84% accuracy in predicting user message posting, outperforming individual models.
- The study utilized Shapley Additive Explanations (SHAP) for interpretability, revealing key behavioral and demographic factors affecting engagement.
- The framework emphasizes the need for understanding multi-level user engagement rather than treating it as a static binary outcome.
Read more
EngageTriBoost: Predictive Modeling of User Engagement in Digital Mental Health Intervention Using Explainable Machine Learning
Summary
This study addresses the growing mental health challenges among young adults by developing EngageTriBoost (ETB), an explainable ensemble machine learning framework aimed at predicting user engagement in digital mental health interventions (DMHI). The framework was applied to data from 1,673 at-risk college students participating in the eBridge platform, which utilizes motivational interviewing for online counseling. ETB integrates multiple machine learning models (XGBoost, LightGBM, and CatBoost) with logistic regression as a meta-learner, focusing on interpretability rather than solely predictive performance. The study emphasizes the importance of understanding user engagement patterns, which are critical for improving DMHI efficacy. ETB achieved up to 84% accuracy in predicting user message posting, demonstrating superior performance in recall and calibration compared to individual models. The framework also provided insights into behavioral and demographic factors influencing engagement, such as chronic pain and stigma. Overall, the findings highlight the potential of explainable machine learning to enhance DMHI design and inform adaptive intervention strategies.
Methodology
The study employed a stacked ensemble approach combining XGBoost, LightGBM, and CatBoost with logistic regression as a meta-learner. The model was trained and evaluated on data from 1,673 college students, utilizing 108 baseline features related to user engagement. Cross-validation was used for hyperparameter tuning, and SHAP was applied for interpretability.
Results
ETB achieved an accuracy of up to 84% in predicting message posting, with improved recall and calibration compared to individual models. The framework demonstrated stable discrimination for message posting but more conservative performance for predicting initial logins, highlighting the challenges in predicting user uptake. SHAP analysis revealed significant behavioral and demographic associations with engagement.
Implications
The findings suggest that using explainable machine learning can significantly enhance the design and effectiveness of digital mental health interventions by providing insights into user engagement patterns. This can inform adaptive strategies to improve user adherence and overall intervention efficacy.
Robust Reasoning Benchmark
NLP
Large Language Models
Theory
- Introduction of a perturbation pipeline with 14 deterministic transformations for evaluating LLM robustness.
- Demonstration of significant accuracy degradation in open-weight models under perturbations.
- Identification of Intra-Query Attention Dilution, where prior reasoning steps negatively impact subsequent tasks.
- Call for future LLM architectures to integrate mechanisms for contextual resets to improve reasoning reliability.
Read more
Robust Reasoning Benchmark
Summary
The paper introduces the Robust Reasoning Benchmark (RRB), aimed at evaluating the robustness of Large Language Models (LLMs) in mathematical reasoning tasks. Despite high performance on standard benchmarks, LLMs exhibit brittle reasoning processes that are overly reliant on specific textual formats. The authors propose a perturbation pipeline consisting of 14 deterministic structural transformations that do not alter the meaning or difficulty of problems, allowing for a more accurate assessment of reasoning capabilities. The RRB is applied to the AIME 2024 dataset, where eight state-of-the-art models are evaluated. Results reveal that while some frontier models show resilience, open-weight models experience significant accuracy drops, indicating structural fragility. The study further isolates mechanical parsing failures from reasoning failures by requiring models to solve multiple problems sequentially within a single context window. Findings suggest that prior reasoning steps can degrade subsequent reasoning accuracy, highlighting a need for future architectures to incorporate explicit contextual resets within their reasoning processes. This work raises important questions about the optimal granularity of reasoning tasks and the architectural design of LLMs to enhance their reasoning reliability.
Methodology
The authors developed a perturbation pipeline that applies 14 deterministic structural transformations to mathematical problems without changing their meaning. They evaluated the performance of various LLMs on the AIME 2024 dataset, measuring accuracy on the last problem in a sequence of tasks to assess the impact of prior reasoning on subsequent performance.
Results
The evaluation revealed that open-weight models, regardless of their size (7B to 120B parameters), suffered from significant accuracy drops (up to 55% on average and 100% on some perturbations). In contrast, some closed-source models demonstrated resilience. The study also found that the accuracy on the last problem degraded as the context window became polluted by previous reasoning steps, indicating a fundamental limitation in the current LLM architectures.
Implications
The findings suggest that current LLM architectures may not be adequately designed for robust reasoning tasks, highlighting the need for innovations that incorporate explicit working memory isolation and contextual resets. This could lead to more reliable reasoning capabilities in future models, with potential applications in complex problem-solving scenarios.
Uncertainty-Aware Transformers: Conformal Prediction for Language Models
NLP
Large Language Models
Interpretability
- Introduction of CONFIDE, a conformal prediction framework for transformer models.
- Achieves up to 4.09% improvement in test accuracy and higher correct efficiency over existing methods.
- Demonstrates better calibration and semantic representation in early and intermediate transformer layers.
- Provides instance-level explanations for predictions, enhancing interpretability.
Read more
Uncertainty-Aware Transformers: Conformal Prediction for Language Models
Summary
This paper introduces CONFIDE, a novel framework for uncertainty quantification in transformer-based language models, specifically designed to enhance interpretability and reliability in predictions. The authors highlight the limitations of traditional neural networks, particularly their black-box nature, which hinders trust in high-stakes applications. CONFIDE applies conformal prediction to the internal embeddings of encoder-only architectures like BERT and RoBERTa, allowing for hyper-parameter tuning and the construction of class-conditional nonconformity scores. This framework provides statistically valid prediction sets with instance-level explanations, improving test accuracy by up to 4.09% on BERT-tiny and achieving greater correct efficiency compared to existing methods such as NM2 and VanillaNN. The study emphasizes the importance of early and intermediate transformer layers in yielding better-calibrated representations for conformal prediction. The authors position CONFIDE as a robust and interpretable solution for resource-constrained models and high-stakes tasks, where traditional softmax-based uncertainty methods may fail.
Methodology
The methodology involves applying conformal prediction techniques to the internal embeddings of transformer architectures, specifically using either [CLS] token embeddings or flattened hidden states to derive class-conditional nonconformity scores. This allows for the construction of prediction sets that contain the true output with a predefined probability, enhancing interpretability and reliability.
Results
CONFIDE demonstrated an absolute accuracy improvement of up to 4.09% on BERT-tiny and achieved a 5.40% increase in correct efficiency compared to softmax-based confidence baselines across GLUE and SuperGLUE benchmarks. It outperformed prior methods like NM2 and VanillaNN, particularly in scenarios where standard predictors exhibited undercoverage or skewed confidence.
Implications
The findings suggest that CONFIDE can be effectively utilized in high-stakes applications requiring reliable uncertainty quantification and interpretability, such as healthcare and finance. Its robustness in resource-constrained environments also opens avenues for deploying advanced language models in practical settings.
Distributed Online Convex Optimization with Compressed Communication: Optimal Regret and Applications
Optimization
Theory
Efficient ML
- Establishes lower bounds for regret in D-OCO with compressed communication.
- Introduces the D-FTFCL algorithm that optimally manages compression and projection errors.
- Achieves optimal regret bounds for both convex and strongly convex loss functions.
- Extends methods to offline stochastic optimization with domain constraints.
Read more
Distributed Online Convex Optimization with Compressed Communication: Optimal Regret and Applications
Summary
This paper addresses the challenges of Distributed Online Convex Optimization (D-OCO) in scenarios with significant communication costs due to large-scale applications. The authors investigate the impact of compressed communication on D-OCO, establishing lower bounds for regret in both convex and strongly convex loss functions. They propose an optimal algorithm, Distributed Follow-the-Fast-Compressed-Leader (D-FTFCL), which incorporates an error feedback mechanism within the Follow-the-Regularized-Leader framework to manage the coupling of compression and projection errors. This method achieves optimal regret bounds of O(δ−1/2√T) and O(δ−1 log T) for convex and strongly convex loss functions, respectively. The paper also extends the applicability of their methods to offline stochastic settings, providing convergence rates and guarantees for distributed non-smooth optimization with compressed communication and domain constraints.
Methodology
The authors develop the Distributed Follow-the-Compressed-Leader (D-FTCL) and Distributed Follow-the-Fast-Compressed-Leader (D-FTFCL) algorithms. D-FTCL uses error feedback on both learner and server sides to facilitate bidirectional compressed communication, while D-FTFCL employs an online compression strategy to mitigate accumulated compression errors. The methods are analyzed rigorously to ensure convergence and optimal regret bounds.
Results
The proposed algorithms achieve regret bounds of O(δ−1/2√T) and O(δ−1 log T) for convex and strongly convex loss functions, respectively. The established lower bounds for regret are Ω(δ−1/2√T) and Ω(δ−1 log T), indicating that the proposed methods are optimal in terms of communication efficiency and regret performance.
Implications
The findings have significant implications for large-scale distributed systems, such as mobile applications, self-driving vehicles, and recommendation systems, where efficient communication and decision-making are critical. The methods can be applied to various domains requiring real-time data processing with limited communication bandwidth.
Dead Weights, Live Signals: Feedforward Graphs of Frozen Language Models
NLP
Large Language Models
Efficient ML
- Introduces a feedforward graph architecture using frozen LLMs as nodes communicating through a shared latent space.
- Achieves strong benchmark performance, outperforming the best single model and parameter-matched classifiers.
- Demonstrates tractable gradient flow through multiple frozen model boundaries.
- Emergent selective routing behavior is observed in the output node without explicit supervision.
Read more
Dead Weights, Live Signals: Feedforward Graphs of Frozen Language Models
Summary
This paper introduces a novel feedforward graph architecture that utilizes heterogeneous frozen large language models (LLMs) as computational nodes. These models communicate through a shared continuous latent space via learned linear projections, building on previous findings that demonstrate geometric compatibility between independently trained LLM latent spaces. The proposed architecture allows for end-to-end trainable multi-node graphs, where projection matrices are optimized jointly through backpropagation. The authors employ three smaller frozen models to encode inputs into a shared latent space, which is then injected into two larger frozen models, culminating in a lightweight cross-attention output node. The architecture is highly efficient, with only 17.6 million trainable parameters compared to approximately 12 billion frozen parameters. The results show significant performance improvements on various benchmarks, outperforming the best single constituent model and parameter-matched classifiers, while also demonstrating effective gradient flow and emergent selective routing behavior without explicit supervision.
Methodology
The authors leverage geometric compatibility between LLM latent spaces to create a differentiable communication medium for a feedforward graph of frozen models. They optimize projection matrices and a cross-attention output node using backpropagation, allowing for effective information aggregation across models.
Results
The proposed architecture achieves 87.3% accuracy on ARC-Challenge, 82.8% on OpenBookQA, and 67.2% on MMLU, surpassing the best single constituent model by 11.4, 6.2, and 1.2 percentage points respectively, and outperforming parameter-matched classifiers by 9.1, 5.2, and 6.7 points.
Implications
This work suggests a new paradigm for utilizing existing frozen LLMs, enabling more efficient model compositions that leverage the strengths of various architectures without the need for extensive retraining. It opens avenues for improved performance on narrow tasks and could influence future research in model aggregation and ensemble methods.
Introducing Echo Networks for Computational Neuroevolution
Audio & Speech
Efficient ML
Time Series
- Introduction of Echo Networks as a new type of recurrent neural network for neuroevolution.
- Echo Networks utilize a connection matrix for representing topology and weights, allowing for flexible architecture.
- Demonstrated effectiveness in classifying electrocardiography signals with minimal network size.
- Enhanced systematicity in mutation and recombination processes compared to traditional methods.
Read more
Introducing Echo Networks for Computational Neuroevolution
Summary
This paper presents Echo Networks, a novel type of recurrent neural network designed for computational neuroevolution, particularly suited for applications with stringent resource constraints, such as edge devices. Traditional neural networks, including feed-forward networks, RNNs, and CNNs, often struggle with systematic mutation and recombination when using direct genetic encoding, as seen in algorithms like NEAT. Echo Networks address this issue by representing the network solely through a connection matrix, where the source neurons are rows, destination neurons are columns, and weights are the matrix entries. This structure allows for flexible input and output assignments and the use of additional functions for tasks like binary classification. The authors evaluated Echo Networks on electrocardiography signal classification, achieving a notable accuracy of 0.684 with a minimal network of 21 neurons and 250 weights. The key advantage of Echo Networks lies in their genome representation as a single matrix, facilitating efficient matrix computations and factorization as mutation and recombination operators, which enhances the systematicity of the evolution process.
Methodology
The authors developed Echo Networks, which consist of a connection matrix representing the network's topology and weights. They employed evolutionary algorithms to evolve these networks, focusing on matrix computations for mutation and recombination, allowing for systematic variations in network architecture and weights.
Results
Echo Networks were successfully evaluated on the classification of electrocardiography signals, achieving an accuracy of 0.684 with a network comprising 21 neurons and 250 weights. This demonstrates the capability of minimal networks to perform effectively in specific tasks.
Implications
The introduction of Echo Networks could significantly impact the design of neural networks for edge applications, enabling efficient event detection and classification with minimal computational resources. Their systematic approach to neuroevolution may lead to more robust and adaptable network architectures.
MolPaQ: Modular Quantum-Classical Patch Learning for Interpretable Molecular Generation
Generative Models
Graph Learning
Interpretability
- MOLPAQ integrates quantum circuits as modular patch generators within a classical conditioning framework.
- The architecture enforces chemical realism through a constraint-aware molecular assembly process.
- MOLPAQ achieves high validity, novelty, and diversity in generated molecules.
- The framework allows for improved property control and interpretability in molecular generation.
Read more
MolPaQ: Modular Quantum-Classical Patch Learning for Interpretable Molecular Generation
Summary
MOLPAQ introduces a novel modular quantum-classical framework for molecular generation that addresses the trade-offs between validity, diversity, and property control in existing generative models. The architecture consists of three main components: a β-VAE pretrained on QM9 that learns a chemically aligned latent manifold, a reduced conditioner that maps molecular descriptors into this latent space, and a quantum patch generator that produces entangled node embeddings. These embeddings are then reconstructed into valid molecular graphs by a valence-aware aggregator. The framework allows for modular control and systematic analysis of molecular generation, enhancing interpretability and enabling targeted property control. The results demonstrate that MOLPAQ achieves 100% validity according to RDKit, 99.75% novelty, and a diversity score of 0.905, while also improving mean QED by approximately 2.3% and increasing the incidence of aromatic motifs by 10-12% compared to a classical generator.
Methodology
The MOLPAQ framework employs a modular approach where molecular generation is achieved through a combination of quantum and classical components. A β-VAE is used to learn a latent manifold, while a quantum circuit generates entangled patch embeddings. A classical aggregator then constructs valid molecular graphs based on these embeddings, ensuring adherence to chemical constraints.
Results
MOLPAQ achieved 100% validity, 99.75% novelty, and a diversity score of 0.905. Additionally, it improved the mean QED by approximately 2.3% and increased the incidence of aromatic motifs by 10-12% compared to a parameter-matched classical generator.
Implications
The modular quantum-classical architecture of MOLPAQ has significant implications for drug discovery and molecular design, allowing for systematic exploration of chemical space while ensuring the generation of valid and diverse compounds. Its interpretability and property control capabilities could enhance the efficiency of identifying novel drug candidates.
HiFloat4 Format for Language Model Pre-training on Ascend NPUs
Large Language Models
Efficient ML
- First study on low-precision LLM pre-training on energy-efficient NPU accelerators.
- HiFloat4 format achieves lower relative loss (≈1.0%) compared to MXFP4 (≈1.5%).
- HiF4 training requires fewer stabilization techniques than MXFP4.
- Stable training of both dense and Mixture-of-Experts LLM architectures is demonstrated.
Read more
HiFloat4 Format for Language Model Pre-training on Ascend NPUs
Summary
This paper investigates the HiFloat4 (HiF4) format for low-precision training of large language models (LLMs) on Huawei Ascend NPUs. The authors highlight the growing importance of reducing computational and memory costs in training LLMs, which have become central to modern machine learning. The study systematically compares HiF4 with existing 4-bit formats like MXFP4, demonstrating that HiF4 achieves lower relative loss and requires fewer stabilization techniques. The authors implement a training pipeline that performs the majority of computations in FP4 while maintaining stable optimization. They evaluate their approach on various LLM architectures, including dense models and mixture-of-experts (MoE) models, showing that approximately 90% of training can be conducted in FP4 with minimal loss compared to full-precision baselines. The findings suggest that HiF4 is a promising solution for efficient LLM pre-training on energy-constrained hardware.
Methodology
The authors developed a training pipeline for LLMs using the HiFloat4 format, which employs a hierarchical scaling scheme to enhance dynamic range and precision. They utilized stochastic rounding and Random Hadamard Transform techniques to mitigate quantization bias and gradient noise. The evaluation included various LLM architectures on Ascend NPU clusters, focusing on both dense and MoE models.
Results
The results indicate that HiFloat4 allows for approximately 90% of training computations to be executed in FP4 with a loss gap of about 1.0% compared to full-precision baselines. This demonstrates the feasibility of large-scale low-precision training while maintaining model performance.
Implications
The findings suggest that HiFloat4 can significantly reduce the computational and memory costs associated with training large language models, making it a viable option for energy-efficient AI applications. This could enable broader access to advanced LLMs and facilitate their deployment in resource-constrained environments.
LLM-Generated Fault Scenarios for Evaluating Perception-Driven Lane Following in Autonomous Edge Systems
Computer Vision
Generative Models
Robotics
- Introduction of a decoupled offline–online fault injection framework for autonomous systems.
- Utilization of LLMs for generating fault scenarios and LDMs for synthesizing visual degradations.
- Real-time fault condition assessment on edge devices without heavy computational load.
- Significant robustness degradation observed under various fault scenarios, emphasizing the need for comprehensive testing.
Read more
LLM-Generated Fault Scenarios for Evaluating Perception-Driven Lane Following in Autonomous Edge Systems
Summary
This paper addresses the challenges of validating autonomous vision systems deployed on edge devices, particularly in the context of lane following. Traditional validation methods, which rely on static datasets or manual fault injection, are inadequate for capturing the diverse environmental hazards encountered in real-world scenarios. To overcome these limitations, the authors propose a decoupled offline–online fault injection framework. This framework separates the validation process into an offline phase, where Large Language Models (LLMs) generate structured fault scenarios and Latent Diffusion Models (LDMs) synthesize high-fidelity sensor degradations, and an online phase, where edge devices perform lightweight fault-aware inference using a pre-computed lookup table. The framework was validated on a ResNet18 lane-following model across 460 fault scenarios, revealing significant robustness degradation under generated faults, with RMSE increasing by up to 99% and localization accuracy dropping to as low as 31.0% under fog conditions. These results highlight the inadequacy of normal-data evaluations for real-world edge AI deployment and demonstrate the effectiveness of the proposed framework in ensuring safety in autonomous systems.
Methodology
The proposed framework consists of two phases: an offline phase where LLMs generate fault scenarios and LDMs create corresponding faulty images, and an online phase where an edge device queries a precomputed lookup table to assess fault conditions in real-time. This approach allows for comprehensive safety validation without the need for resource-intensive AI models to run on the edge device.
Results
The validation of the framework on a ResNet18 lane-following model showed a baseline R2 of approximately 0.85 on clean data. However, under generated fault conditions, the model's RMSE increased by up to 99%, and localization accuracy dropped to as low as 31.0% in fog conditions, indicating significant robustness issues that traditional evaluations fail to capture.
Implications
The proposed framework has the potential to enhance the safety and reliability of autonomous vehicles and edge-deployed robotic systems by enabling proactive evaluation of system robustness under diverse fault conditions. This could lead to more effective deployment of AI-driven systems in real-world environments, where safety is paramount.
Offline Local Search for Online Stochastic Bandits
Optimization
Theory
- Introduces a framework for converting offline local search algorithms into online stochastic bandit algorithms.
- Achieves O(log³ T) regret, improving upon existing methods that yield polynomial regret.
- Applies the framework to three combinatorial optimization problems, showcasing its versatility.
- Establishes the conditions under which local search algorithms can guarantee γ-regret.
Read more
Offline Local Search for Online Stochastic Bandits
Summary
This paper explores the conversion of offline local search algorithms into online stochastic bandit algorithms, focusing on combinatorial multi-armed bandits. The authors highlight the importance of minimizing regret, defined as the difference between the cumulative cost of the algorithm and the optimal fixed action in hindsight. They propose a generic method for transforming offline local search algorithms that yield approximately optimal solutions into online algorithms with O(log³ T) regret, which is a significant improvement over existing frameworks that yield polynomial regret. The paper demonstrates the versatility of this approach by applying it to three distinct online stochastic combinatorial optimization problems: scheduling to minimize total completion time, finding a minimum cost base of a matroid, and uncertain clustering. The authors establish that local search algorithms with (β, γ)-improving neighborhoods can achieve γ-regret with logarithmic dependence on T, thus providing a robust framework for online decision-making in complex environments.
Methodology
The authors develop a generic method for transforming offline local search algorithms into online stochastic bandit algorithms. They focus on local search methods that terminate in approximately optimal solutions and require the neighborhood of a solution to allow (β, γ)-improving moves. The algorithm iteratively selects actions based on minimizing costs while ensuring that the regret scales logarithmically with the number of time steps T.
Results
The main result is that local search algorithms with (β, γ)-improving neighborhoods can achieve γ-regret with O(log³ T) dependence on T. This result is significant as it provides a more efficient regret bound compared to existing offline-to-online frameworks, which typically yield polynomial regret.
Implications
The findings suggest that local search methods can be effectively utilized in online learning scenarios, potentially leading to better decision-making strategies in various applications such as scheduling, resource allocation, and clustering. This work opens avenues for further research into the integration of offline algorithmic techniques into online learning frameworks.
GNN-as-Judge: Unleashing the Power of LLMs for Graph Learning with GNN Feedback
Large Language Models
Graph Learning
- Introduces GNN-as-Judge framework to improve LLM performance on TAGs in low-resource settings.
- Addresses challenges of generating reliable pseudo labels and mitigating label noise during fine-tuning.
- Utilizes GNNs to provide structural insights for selecting influential unlabeled nodes.
- Demonstrates significant performance improvements over existing methods in few-shot semi-supervised learning.
Read more
GNN-as-Judge: Unleashing the Power of LLMs for Graph Learning with GNN Feedback
Summary
This paper addresses the challenges of leveraging Large Language Models (LLMs) for few-shot semi-supervised learning on text-attributed graphs (TAGs), particularly in low-resource settings where labeled data is scarce. The authors identify two main issues: the difficulty in generating reliable pseudo labels for LLMs and the need to mitigate label noise during fine-tuning. To tackle these challenges, they propose a novel framework called GNN-as-Judge, which integrates the structural inductive bias of Graph Neural Networks (GNNs) to enhance the pseudo-labeling process. The framework employs a collaborative pseudo-labeling strategy that identifies influential unlabeled nodes and utilizes both agreement and disagreement patterns between LLMs and GNNs to generate reliable labels. Additionally, a weakly-supervised fine-tuning algorithm is developed to distill knowledge from informative pseudo labels while addressing potential label noise. Experimental results on multiple TAG datasets demonstrate that GNN-as-Judge significantly outperforms existing methods, especially in scenarios with limited labeled data, showcasing its effectiveness in enhancing LLM performance in graph learning tasks.
Methodology
The GNN-as-Judge framework employs a collaborative pseudo-labeling strategy that identifies influential unlabeled nodes and leverages both agreement and disagreement between LLMs and GNNs to generate reliable pseudo labels. A weakly-supervised fine-tuning algorithm is introduced to distill knowledge from these pseudo labels while mitigating label noise, allowing for effective learning from both easy and hard examples.
Results
Experiments conducted on multiple TAG datasets indicate that GNN-as-Judge significantly outperforms existing methods, particularly in low-resource scenarios where labeled data is limited. The framework effectively enhances the performance of LLMs in few-shot semi-supervised learning tasks.
Implications
The findings suggest that integrating GNNs with LLMs can substantially improve graph learning tasks, particularly in environments with limited labeled data. This approach could be applied to various applications involving TAGs, such as citation networks, social media analysis, and e-commerce systems, enhancing the ability to classify and analyze complex graph structures.
Practical Bayesian Inference for Speech SNNs: Uncertainty and Loss-Landscape Smoothing
Audio & Speech
Theory
Optimization
- Introduces IVON as a Bayesian learning method for SNNs, addressing weight uncertainty.
- Demonstrates that Bayesian learning can smooth the irregular predictive landscape of SNNs.
- Shows improved performance on speech recognition tasks using IVON compared to deterministic methods.
- Provides evidence through predictive, calibration, and local loss geometry analyses.
Read more
Practical Bayesian Inference for Speech SNNs: Uncertainty and Loss-Landscape Smoothing
Summary
This paper investigates the application of Bayesian learning methods to Spiking Neural Networks (SNNs) for speech processing tasks. SNNs are advantageous for handling temporal data, but their threshold-based spike generation leads to irregular predictive landscapes, complicating training. The authors propose the Improved Variational Online Newton (IVON) approach as a practical Bayesian learning method that maintains a Gaussian posterior over the network weights, allowing for uncertainty in weight learning. The hypothesis is that Bayesian learning can smooth the irregularities in the predictive landscape caused by deterministic SNNs. The authors evaluate their method on the Heidelberg Digits and Speech Commands datasets, demonstrating that IVON improves performance metrics such as negative log-likelihood and Brier score, while also yielding a smoother predictive landscape compared to deterministic training. The findings suggest that Bayesian methods can enhance the robustness and reliability of predictions in SNNs, particularly in the context of speech recognition.
Methodology
The authors utilize the Improved Variational Online Newton (IVON) approach, which maintains a Gaussian posterior approximation over the weights of the SNNs. This method updates both the posterior mean and uncertainty during training, allowing for a distribution of plausible weights rather than a single deterministic estimate. The performance is evaluated using predictive metrics and one-dimensional slices of the weight space on two speech recognition datasets.
Results
The application of IVON resulted in improved performance metrics, including lower negative log-likelihood and Brier scores, indicating better predictive accuracy. Additionally, the predictive landscape was found to be smoother and more regular when using Bayesian learning compared to deterministic training methods.
Implications
The findings suggest that incorporating Bayesian methods into SNN training can enhance the robustness and reliability of speech recognition systems. This approach may also be beneficial in other domains where uncertainty in model parameters is critical, potentially leading to more resilient AI systems.
GAN-based Domain Adaptation for Image-aware Layout Generation in Advertising Poster Design
Generative Models
Computer Vision
- Introduction of the CGL-Dataset with over 60,000 paired posters and 121,000 product images.
- Development of two GAN-based models: CGL-GAN and PDA-GAN, to address domain adaptation challenges.
- PDA-GAN employs a pixel-level discriminator for improved layout generation based on image content.
- Three novel content-aware metrics are proposed for evaluating layout generation quality.
Read more
GAN-based Domain Adaptation for Image-aware Layout Generation in Advertising Poster Design
Summary
This paper presents a novel approach to generating image-aware graphic layouts for advertising posters using Generative Adversarial Networks (GANs). The authors introduce the Content-aware Graphic Layout Dataset (CGL-Dataset), which comprises 60,548 paired inpainted posters and 121,000 clean product images. The challenge addressed is the domain gap between inpainted posters and clean images due to inpainting artifacts. To bridge this gap, two GAN-based models are proposed: CGL-GAN, which applies Gaussian blur to inpainted regions, and PDA-GAN, which incorporates unsupervised domain adaptation with a pixel-level discriminator to enhance layout generation based on the visual texture of input images. The paper also introduces three novel content-aware metrics to evaluate the relationship between graphic elements and image content. Experimental results demonstrate that PDA-GAN outperforms CGL-GAN, achieving state-of-the-art performance in generating high-quality image-aware layouts, thereby improving visual quality significantly across various metrics.
Methodology
The authors propose two GAN-based models for layout generation: CGL-GAN, which uses Gaussian blurring to reduce domain gaps, and PDA-GAN, which utilizes unsupervised domain adaptation with a pixel-level discriminator connected to shallow feature maps for fine-grained control over feature alignment. Additionally, they introduce new content-aware metrics to evaluate the generated layouts.
Results
PDA-GAN outperforms CGL-GAN, achieving relative improvements of 6.21% in background complexity, 17.5% in subject occlusion degree, 14.5% in product occlusion degree, and 19.07% in the content-aware Fréchet Inception Distance (cFID) metric, indicating enhanced visual quality of generated layouts.
Implications
The findings suggest that GAN-based models, particularly with domain adaptation techniques, can significantly improve the quality of graphic layout generation in advertising, potentially streamlining the design process for marketing materials and enhancing visual appeal.
Ranked Activation Shift for Post-Hoc Out-of-Distribution Detection
Computer Vision
- Introduces RAS, a hyperparameter-free method for OoD detection that enhances performance without requiring retraining.
- Identifies the limitations of existing scaling-based methods and the impact of unrectified activations on performance.
- Demonstrates consistent performance across different datasets and architectures, preserving in-distribution accuracy.
- Analyzes the contributions of both inhibitory and excitatory activation shifts to the effectiveness of OoD detection.
Read more
Ranked Activation Shift for Post-Hoc Out-of-Distribution Detection
Summary
This paper addresses the challenge of out-of-distribution (OoD) detection in machine learning models, particularly focusing on post-hoc methods that do not require retraining. The authors identify inconsistencies in the performance of existing methods, which often rely on intermediate layer activation editing. They attribute these inconsistencies to variations in activation distributions and highlight a failure mode in scaling-based methods when penultimate layer activations are unrectified. To overcome these issues, the authors propose a novel method called Ranked Activation Shift (RAS), which is hyperparameter-free and enhances OoD detection by replacing sorted activation magnitudes with a fixed in-distribution reference profile. RAS demonstrates strong and consistent performance across various datasets and architectures, maintaining in-distribution classification accuracy. The authors analyze the factors contributing to RAS's effectiveness, revealing that both inhibitory and excitatory activation shifts play a role in improving OoD discrimination. This work provides a robust solution for integrating OoD detection into existing machine learning pipelines with minimal computational overhead.
Methodology
The authors propose the Ranked Activation Shift (RAS) method, which replaces the sorted activation magnitudes of the penultimate layer with a fixed in-distribution reference profile. This approach is designed to enhance the discriminative power of OoD detection without the need for hyperparameter tuning. The method operates on both features and logits, making it versatile across different model architectures.
Results
The RAS method shows significant improvements in OoD detection performance compared to existing methods, achieving robust results across multiple datasets and architectures. The analysis indicates that the method effectively utilizes both inhibitory and excitatory shifts in activations to enhance discrimination between in-distribution and out-of-distribution samples.
Implications
The findings suggest that RAS can be effectively used in real-world applications where reliable OoD detection is critical, such as autonomous driving, medical imaging, and financial decision-making. Its hyperparameter-free nature allows for easy integration into existing systems, making it a practical solution for enhancing model safety and reliability.
EgoEverything: A Benchmark for Human Behavior Inspired Long Context Egocentric Video Understanding in AR Environment
Computer Vision
Multimodal
- EgoEverything incorporates human attention into question generation for long-context egocentric video understanding.
- The benchmark features over 5,000 question-answer pairs across 100+ hours of video, reflecting real-world AR interactions.
- A novel Visual Question Answering pipeline and attention-inspired sampling strategy are utilized to generate questions.
- Evaluation reveals that current vision-language models perform poorly on EgoEverything, highlighting their limitations in real-life scenarios.
Read more
EgoEverything: A Benchmark for Human Behavior Inspired Long Context Egocentric Video Understanding in AR Environment
Summary
EgoEverything is a novel benchmark designed to enhance long-context egocentric video understanding, particularly in augmented reality (AR) environments. Traditional benchmarks often overlook the significance of human attention in question generation, leading to a mismatch with real-world usage where queries are typically grounded in what users are observing. EgoEverything addresses this gap by incorporating human attention signals derived from gaze data into its question generation process. The benchmark consists of over 5,000 multiple-choice question-answer pairs derived from more than 100 hours of egocentric video footage. This dataset is generated through a unique Visual Question Answering (VQA) pipeline that employs multiple AI agents and an attention-inspired sampling strategy to create questions that reflect authentic human inquiry patterns. The results indicate that current vision-language models (VLMs) struggle with the complexities of real-life AR scenarios, as evidenced by their lower performance on EgoEverything compared to existing benchmarks. This work not only sets a new standard for evaluating long-context egocentric video understanding but also emphasizes the importance of aligning machine learning models with human behavior in AR applications.
Methodology
The methodology involves a Visual Question Answering (VQA) generation pipeline that utilizes multiple AI agents to produce questions based on simulated gaze data. An attention-inspired sampling strategy is employed to select question targets, ensuring both attention-driven and detail-oriented queries. The quality and reliability of the questions are further enhanced through human review.
Results
Evaluation on several cutting-edge vision-language models shows consistently lower performance on the EgoEverything benchmark, indicating significant challenges for these models in handling real-life AR long-context egocentric video understanding tasks.
Implications
EgoEverything has the potential to advance research in long-context egocentric video understanding and improve the development of intelligent AR applications that can better assist users by aligning with their natural inquiry patterns.
Adversarial Label Invariant Graph Data Augmentations for Out-of-Distribution Generalization
Graph Learning
Optimization
Theory
- Introduction of RIA, a method for improving OoD generalization under covariate shift.
- Formulation of adversarial label invariant data augmentations to create diverse training environments.
- Development of an alternating gradient descent-ascent algorithm for optimization.
- Extensive experimental validation showing superior performance over existing OoD methods.
Read more
Adversarial Label Invariant Graph Data Augmentations for Out-of-Distribution Generalization
Summary
This paper addresses the challenge of out-of-distribution (OoD) generalization in machine learning, particularly under covariate shift, where the input distribution changes while the underlying concept remains the same. The authors propose a novel method called RIA (Regularization for Invariance with Adversarial training) that utilizes adversarial label invariant data augmentations to create diverse training environments. This approach aims to prevent the model from collapsing to an empirical risk minimization (ERM) solution, which is common when training environments are limited. The methodology is inspired by Q-learning and involves an adversarial exploration of training data environments. The authors develop an alternating gradient descent-ascent algorithm to optimize the learning process. Extensive experiments demonstrate that RIA significantly improves OoD generalization performance on both synthetic and real-world graph classification tasks compared to existing methods.
Methodology
The authors propose RIA, which employs adversarial training techniques to generate counterfactual environments that challenge the model during training. This is achieved through adversarial label invariant data augmentations, which help maintain the model's robustness against distribution shifts. The optimization is performed using an alternating gradient descent-ascent algorithm to effectively navigate the search space of training environments.
Results
The experiments conducted on various synthetic and natural datasets demonstrate that RIA achieves higher accuracy in OoD graph classification tasks compared to existing baseline methods. The results indicate that RIA effectively mitigates the collapse to ERM solutions, showcasing its potential for improved generalization across varying environments.
Implications
The findings suggest that RIA can be a valuable tool for practitioners dealing with graph data in scenarios where distribution shifts are common. It highlights the importance of adversarial training in creating robust models that can generalize well to unseen environments, which is crucial for applications in fields such as computer vision, social network analysis, and biological data interpretation.
Mathematical analysis of one-layer neural network with fixed biases, a new activation function and other observations
Theory
- Rigorous proof of convergence for one-hidden-layer neural networks with fixed biases using gradient descent.
- Introduction of a new activation function, FReX, which maintains convergence properties.
- Establishment of the spectral bias property for the learning process.
- Discussion on the representability of functions in both continuous and discrete models.
Read more
Mathematical analysis of one-layer neural network with fixed biases, a new activation function and other observations
Summary
This paper presents a mathematical analysis of a simple one-hidden-layer neural network utilizing ReLU activation functions with fixed biases, focusing on one-dimensional input and output. The authors rigorously demonstrate the convergence of the learning process using the L2 squared loss function and gradient descent, establishing the spectral bias property for the model. The analysis highlights the necessary properties that activation functions should possess and explores the relationship between the spectrum of certain operators and the learning process. A novel activation function, the full-wave rectified exponential function (FReX), is proposed, which also exhibits convergence in the gradient descent process. The study emphasizes the representability of L2 functions in continuous models and a natural class of representable functions in discrete models, suggesting that these simplified models can enhance understanding of neural network mechanisms. The authors note that while higher-dimensional models can be constructed, they are significantly more complex and will be addressed in future work.
Methodology
The authors analyze both continuous and discrete versions of a one-hidden-layer neural network model. They employ mathematical techniques from functional analysis and partial differential equations to rigorously prove convergence and spectral bias properties, as well as to derive insights into the role of activation functions.
Results
The paper proves that the learning process converges to a unique minimum for the proposed neural network model with fixed biases and ReLU activation. The new activation function, FReX, is shown to also support convergence in the gradient descent process. The analysis reveals that the ReLU function is effective due to its properties as a fundamental solution to the one-dimensional Laplacian, and the proposed FReX function similarly exhibits beneficial characteristics.
Implications
The findings suggest that the structure of activation functions significantly impacts the learning dynamics of neural networks. The proposed models and insights may facilitate the design of more effective neural network architectures and contribute to a deeper understanding of neural network behavior, particularly in simpler configurations.
Loom: A Scalable Analytical Neural Computer Architecture
Theory
Efficient ML
Interpretability
- Loom implements a 22-opcode instruction set in 8 transformer layers, optimizing execution efficiency.
- The opcode-as-operand-routing technique reduces the complexity of execution layers, enabling a more compact architecture.
- The introduction of the STORE instruction significantly decreases the size of compiled programs, enhancing performance.
- A C compiler facilitates the translation of C programs to Loom's ISA, supporting both offline and in-browser execution.
Read more
Loom: A Scalable Analytical Neural Computer Architecture
Summary
This paper introduces Loom, a novel computer architecture designed to execute programs compiled from C within a looped transformer framework. Loom features a 22-opcode instruction set implemented across 8 transformer layers, allowing for efficient execution of various programs. The architecture operates on a fixed-size tensor that encapsulates the entire machine state, ensuring that each execution step has a constant computational cost, independent of the program's length. The key innovation lies in the opcode-as-operand-routing technique, which reduces the need for multiple execution layers by mapping all operations to a shared subtract core. Additionally, the introduction of the STORE instruction optimizes memory operations, significantly reducing the size of compiled code for programs with variable-index array writes. The paper also presents a C compiler that translates a subset of C into Loom's instruction set architecture (ISA), with implementations available in both Python and JavaScript. Demonstrations of Loom include various applications such as a sorting visualizer, a playable Snake game, and a Sudoku solver, showcasing its versatility and efficiency. The architecture's design is scale-independent, allowing it to function across different configurations without altering the source code.
Methodology
The Loom architecture utilizes a looped transformer with analytically derived weights to execute compiled C programs. It employs opcode-as-operand-routing to streamline operations and reduce execution layers, while a C compiler translates C code into Loom's instruction set. The architecture operates on a fixed-size state tensor, ensuring consistent computational costs across different program lengths.
Results
Loom successfully executes various programs, including a sorting visualizer, a playable Snake game, and a 9×9 Sudoku solver, demonstrating its capability to handle complex tasks efficiently. The architecture's optimizations lead to a significant reduction in compiled code size, particularly for programs requiring indirect memory writes.
Implications
Loom's architecture could enable the integration of deterministic algorithmic logic directly into perception models, enhancing the reliability of safety-critical applications. Its design may also inspire future developments in scalable neural computing and program execution frameworks.
Structured Exploration and Exploitation of Label Functions for Automated Data Annotation
NLP
Efficient ML
Theory
- EXPONA introduces a two-phase process for LF generation that balances diversity and reliability.
- The framework systematically explores multi-level label functions to improve coverage and quality.
- EXPONA achieved up to 98.9% label coverage and improved weak label quality by up to 87%.
- The method resulted in substantial downstream performance gains, with improvements of up to 46% in weighted F1 scores.
Read more
Structured Exploration and Exploitation of Label Functions for Automated Data Annotation
Summary
The paper presents EXPONA, an automated framework for programmatic labeling that addresses the challenges of generating high-quality labeled data for machine learning models. Traditional methods of automated label function (LF) generation often rely on large language models or model-based synthesis, which can lead to limited coverage and unreliable label quality. EXPONA formulates LF generation as a process that balances diversity and reliability by systematically exploring multi-level LFs—surface, structural, and semantic perspectives. It employs a two-phase approach: LF exploration generates diverse candidate LFs based on task descriptions and data characteristics, while LF exploitation evaluates and filters these candidates based on performance indicators. The framework was evaluated on eleven classification datasets, demonstrating that EXPONA outperformed existing methods by achieving nearly complete label coverage (up to 98.9%) and significantly improving weak label quality (up to 87%). This resulted in downstream performance gains of up to 46% in weighted F1 scores, indicating that EXPONA's approach effectively enhances label quality and model performance across various tasks.
Methodology
EXPONA employs a two-phase approach consisting of LF exploration and LF exploitation. In the exploration phase, it generates candidate LFs from multiple perspectives based on task descriptions and data characteristics. In the exploitation phase, these candidates are evaluated and filtered based on performance indicators to retain only the most reliable heuristics for weakly annotating unlabeled data.
Results
EXPONA was evaluated on eleven classification datasets, achieving nearly complete label coverage (up to 98.9%) and improving weak label quality by up to 87%. The framework led to downstream performance gains of up to 46% in weighted F1 scores compared to state-of-the-art automated LF generation methods.
Implications
The findings suggest that EXPONA can significantly enhance the efficiency and effectiveness of automated data annotation processes, making it a valuable tool for generating high-quality labeled datasets in various machine learning applications.
Bit-by-Bit: Progressive QAT Strategy with Outlier Channel Splitting for Stable Low-Bit LLMs
Large Language Models
Efficient ML
Optimization
- Introduces a progressive QAT framework that enhances stability during low-bit training.
- Employs outlier channel splitting to mitigate quantization errors effectively.
- Achieves significant speed improvements with custom operators for low-bit configurations.
- Demonstrates superior performance on LLaMA-2/3 compared to existing QAT baselines.
Read more
Bit-by-Bit: Progressive QAT Strategy with Outlier Channel Splitting for Stable Low-Bit LLMs
Summary
The paper presents BIT-BY-BIT, a novel progressive quantization-aware training (QAT) framework designed to enhance the stability and efficiency of training large language models (LLMs) at ultra-low precision. Traditional low-bit QAT methods struggle with convergence instability and high training costs, particularly due to quantization noise from outlier channels and error accumulation across layers. BIT-BY-BIT addresses these challenges through three main innovations: (1) a block-wise progressive training approach that gradually reduces precision, ensuring stable initialization for low-bit optimization; (2) a nested structure of integer quantization grids that allows a single model to support multiple bit-widths without retraining; and (3) rounding-aware outlier channel splitting that reduces quantization errors while preserving output integrity. The framework also incorporates microscaling groups with E4M3 scales to align with industry standards. Custom operators for W2A2 and W2A16 configurations were developed, achieving significant speed improvements. Evaluations on LLaMA-2/3 demonstrate that BIT-BY-BIT outperforms existing QAT methods, achieving a minimal increase in perplexity compared to full-precision models, thus showcasing its effectiveness in ultra-low-bit regimes.
Methodology
The methodology involves a progressive training strategy that reduces precision in stages, starting with weights and followed by activations. The framework utilizes a nested quantization grid structure to facilitate multi-precision deployment without retraining. Rounding-aware outlier channel splitting is implemented to address quantization errors while maintaining output fidelity. Custom operators are developed to optimize performance for specific low-bit configurations.
Results
BIT-BY-BIT significantly outperformed baseline methods like BitDistiller and EfficientQAT on LLaMA-2/3, achieving only a 2.25 increase in perplexity on WikiText2 compared to full-precision models. The framework also demonstrated up to an 11× speedup over BF16 configurations, indicating substantial efficiency gains.
Implications
The findings suggest that BIT-BY-BIT can enable more efficient training and deployment of large language models at ultra-low precision, making it feasible to leverage large models in resource-constrained environments. This approach could lead to broader applications of LLMs in real-time systems and edge computing.
Feature-Label Modal Alignment for Robust Partial Multi-Label Learning
Multimodal
- Introduction of modal alignment to improve feature-label consistency in PML.
- Development of a low-rank orthogonal decomposition method for robust pseudo-label generation.
- Implementation of a multi-peak class prototype learning mechanism to enhance discriminability.
- Demonstrated superior performance in classification accuracy and noise robustness compared to existing methods.
Read more
Feature-Label Modal Alignment for Robust Partial Multi-Label Learning
Summary
This paper addresses the challenge of partial multi-label learning (PML), where instances are associated with candidate labels that include both ground-truth and noisy labels. The authors propose a novel method called Feature-Label Modal Alignment (PML-MA) to enhance classification performance by restoring the correspondence between features and labels through systematic alignment. The method employs low-rank orthogonal decomposition to generate pseudo-labels that approximate the true label distribution, effectively filtering out noisy labels. It aligns features and pseudo-labels through global projection into a common subspace and local preservation of neighborhood structures. Additionally, a multi-peak class prototype learning mechanism is introduced, which utilizes pseudo-labels as soft membership weights to improve the discriminability of instances belonging to multiple categories. The proposed approach demonstrates significant improvements in classification accuracy and robustness against label noise across various real-world and synthetic datasets, outperforming state-of-the-art methods.
Methodology
The methodology involves three main steps: (1) generating pseudo-labels using low-rank orthogonal decomposition to approximate true label distributions, (2) aligning features and pseudo-labels through global and local alignment strategies, and (3) employing a multi-peak class prototype learning mechanism that uses pseudo-labels as soft membership weights to refine the classification process.
Results
Extensive experiments show that PML-MA significantly outperforms existing PML methods, achieving higher classification accuracy and greater robustness against label noise across both real-world and synthetic datasets.
Implications
The findings suggest that the proposed PML-MA method can be effectively applied in scenarios where label noise is prevalent, such as in crowdsourced data annotations, improving the reliability of multi-label classification tasks in various domains including text classification, image recognition, and more.
Stability Enhanced Gaussian Process Variational Autoencoders
Generative Models
Robotics
Theory
- Introduction of SEGP-VAE for training LTI systems using video data.
- Derivation of mean and covariance functions from LTI system definitions.
- Unconstrained parametrization prevents numerical issues during training.
- Demonstrated effectiveness on spiraling particle video dataset.
Read more
Stability Enhanced Gaussian Process Variational Autoencoders
Summary
This paper introduces the Stability Enhanced Gaussian Process Variational Autoencoder (SEGP-VAE), a novel framework designed to train low-dimensional linear time-invariant (LTI) systems using high-dimensional video data. The SEGP prior's mean and covariance functions are derived from LTI system definitions, allowing the model to capture indirectly observed latent processes through a combination of probabilistic and interpretable physical modeling. The authors implement a complete and unconstrained parametrization that restricts the search space of LTI parameters to semi-contracting systems, thus avoiding numerical issues associated with non-Hurwitz state matrices. The SEGP-VAE is evaluated on a dataset of spiraling particles, demonstrating its capability to accurately predict latent states and underlying dynamics with low uncertainty when conditioned on observed video data. This approach extends previous physics-enhanced GP-VAEs by requiring less prior knowledge and accommodating unknown initial conditions, making it a robust tool for modeling dynamical systems.
Methodology
The SEGP-VAE integrates a stability-enhanced Gaussian process prior into the variational autoencoder framework. The mean and covariance functions are derived from LTI system principles, and the model employs an unconstrained parametrization to ensure stability and avoid numerical issues. The training process utilizes unconstrained optimization algorithms to learn the latent dynamics from high-dimensional video data.
Results
The SEGP-VAE successfully captures the latent dynamics of the spiraling particles, providing accurate predictions of the underlying processes with low uncertainty. The model's ability to incorporate stability through semi-contraction enhances its robustness and interpretability, making it suitable for applications in material science and other fields requiring dynamical system modeling.
Implications
The SEGP-VAE framework has significant implications for modeling complex dynamical systems in various scientific fields, particularly where direct measurements are challenging. Its ability to integrate physical principles with data-driven approaches can lead to improved understanding and control of systems in material science, robotics, and beyond.
SPAMoE: Spectrum-Aware Hybrid Operator Framework for Full-Waveform Inversion
Optimization
- Introduction of SPAMoE, a spectrum-aware framework for FWI.
- Utilization of a Spectral-Preserving DINO Encoder to maintain frequency balance.
- Implementation of an Adaptive Spectral Mixture-of-Experts for dynamic frequency band allocation.
- Significant performance improvement on OpenFWI benchmark datasets.
Read more
SPAMoE: Spectrum-Aware Hybrid Operator Framework for Full-Waveform Inversion
Summary
The paper introduces SPAMoE, a novel framework designed to enhance Full-Waveform Inversion (FWI) by addressing the computational intensity and ill-posed nature of traditional methods. SPAMoE employs a Spectrum-Preserving DINO Encoder that maintains a balanced high-to-low frequency energy ratio, which mitigates high-frequency collapse and stabilizes frequency-domain modeling. Additionally, it features an Adaptive Spectral Mixture-of-Experts (MoE) that dynamically allocates frequency bands to a mixture of neural operators, thereby effectively decoupling multi-scale geological features. The framework is evaluated on the OpenFWI benchmark, demonstrating a significant reduction in mean absolute error (MAE) by 54.1% compared to the best existing baseline. This work establishes a new architectural framework for learning-based FWI, showcasing improved recovery of complex geological structures and high-frequency details.
Methodology
The SPAMoE framework consists of a Spectral-Preserving DINO Encoder that enforces a lower bound on the high-to-low frequency energy ratio, and an Adaptive Spectral Mixture-of-Experts that includes frequency decomposition, routing, and operator modeling. This design allows for explicit decoupling of frequency information, facilitating better learning and inversion outcomes.
Results
SPAMoE achieved a 54.1% reduction in average MAE on the OpenFWI benchmark compared to the best reported baseline, indicating its effectiveness in reconstructing high-resolution subsurface velocity models and handling complex geological features.
Implications
The proposed framework has the potential to significantly enhance the efficiency and accuracy of seismic imaging and subsurface modeling in geophysics, paving the way for more effective exploration and resource management.
Cluster Attention for Graph Machine Learning
Graph Learning
- Introduces Cluster Attention (CLATT) to enhance graph learning models.
- CLATT allows nodes to attend to all nodes within their clusters, improving long-range dependency capture.
- Integrating CLATT with MPNNs and Graph Transformers leads to significant performance improvements.
- The approach leverages community detection algorithms for effective graph partitioning.
Read more
Cluster Attention for Graph Machine Learning
Summary
This paper introduces Cluster Attention (CLATT), a novel attention mechanism designed for Graph Machine Learning (GML). Traditional Message Passing Neural Networks (MPNNs) have a limited receptive field, which restricts their ability to capture long-range dependencies in graph-structured data. While Graph Transformers with global attention can overcome this limitation, they often neglect the graph's topology, which is crucial for effective learning. CLATT addresses these issues by partitioning graph nodes into clusters using community detection algorithms, allowing nodes to attend to all other nodes within their cluster. This approach retains the inductive biases of graph structure while enhancing the receptive field. The authors demonstrate that integrating CLATT into both MPNNs and Graph Transformers significantly boosts performance across various real-world graph datasets, including those from the GraphLand benchmark.
Methodology
The authors propose a new attention mechanism, CLATT, which clusters graph nodes using community detection algorithms. This allows nodes to exchange information with all other nodes in their respective clusters, thus enhancing the receptive field while maintaining graph-structure-based inductive biases. The methodology includes augmenting existing MPNNs and Graph Transformers with CLATT and evaluating their performance on various datasets.
Results
The experimental results show that models augmented with CLATT outperform traditional MPNNs and Graph Transformers on a diverse set of 12 real-world graph datasets, indicating that the cluster-based attention mechanism effectively captures long-range dependencies and leverages graph structure.
Implications
The findings suggest that incorporating cluster-based attention mechanisms can significantly enhance the performance of graph learning models, making them more effective for applications in social networks, biological systems, and other domains where graph structures are prevalent.
How does Chain of Thought decompose complex tasks?
NLP
Large Language Models
Theory
- Classification error in LLMs scales as a power law with the number of classes.
- Decomposing tasks into smaller classification problems can significantly reduce prediction error.
- There exists an optimal degree of decomposition that minimizes error, beyond which additional reasoning depth is counterproductive.
- The study formalizes reasoning as a structured decomposition of classification tasks, explaining empirical observations in LLM performance.
Read more
How does Chain of Thought decompose complex tasks?
Summary
This paper investigates the Chain of Thought (CoT) methodology in large language models (LLMs) and its effectiveness in decomposing complex classification tasks. The authors establish that classification error scales as a power law with respect to the number of classes, suggesting that task decomposition into smaller sub-problems can significantly reduce prediction error. They propose a tree-structured approach to CoT, where the model's reasoning process is broken down into a sequence of smaller classification tasks. The study identifies an optimal degree of task decomposition, beyond which further 'thinking' or depth in reasoning can lead to diminishing returns or even increased error. The findings highlight the importance of balancing the depth of reasoning with the number of classes to achieve minimal error, providing a theoretical framework for understanding the effectiveness of CoT in LLMs.
Methodology
The authors analyze the scaling laws of classification error in supervised learning, deriving a mathematical framework that relates error probability to the number of classes, data points, and input dimensionality. They explore the effects of task decomposition on error rates and identify optimal conditions for reasoning depth through theoretical analysis and empirical observations.
Results
The results indicate that the classification error can be minimized by balancing the number of classes in each sub-task with the depth of reasoning. Specifically, an optimal degree of decomposition is identified, which allows for effective reasoning without incurring excessive error. The study also confirms that excessive reasoning can be detrimental when the degree of task decomposition is below a critical threshold.
Implications
These findings have significant implications for the design and training of LLMs, suggesting that structured reasoning processes can enhance performance on complex tasks. The insights can guide future research on optimizing reasoning strategies in AI systems, potentially improving their capabilities in various applications such as mathematical reasoning and programming.
Toward World Models for Epidemiology
Time Series
Theory
Optimization
- Introduces a formal framework for epidemiological world models.
- Reframes epidemic decision-making to incorporate latent states and adaptive behaviors.
- Demonstrates the necessity of world models through empirical case studies.
- Highlights the limitations of traditional epidemiological models in capturing dynamic behaviors.
Read more
Toward World Models for Epidemiology
Summary
This paper presents a conceptual framework for integrating world models into computational epidemiology, highlighting the need for advanced modeling techniques to address the complexities of epidemic decision-making. The authors argue that traditional epidemiological models often overlook the dynamic nature of disease spread, particularly the latent states of epidemics, the noise in observational data, and the adaptive behaviors of individuals in response to interventions. By framing epidemics as controlled, partially observed dynamical systems, the paper emphasizes the importance of understanding the latent disease burden and the effects of interventions through a world model lens. The authors provide three case studies that illustrate the necessity of world modeling in epidemiology: (1) strategic misreporting in behavioral surveillance, (2) systematic delays in reporting signals like hospitalizations and deaths, and (3) counterfactual analysis of interventions. These case studies demonstrate how world models can enhance policy-relevant reasoning and improve decision-making under uncertainty.
Methodology
The authors develop a conceptual framework that models epidemics as controlled, partially observed dynamical systems. They analyze the dynamics of latent states, noisy observations, and the impact of interventions through a series of case studies that illustrate the application of world models in epidemiology.
Results
The case studies reveal that world models can significantly enhance the understanding of epidemic dynamics, improve the accuracy of policy interventions, and facilitate counterfactual reasoning. The findings suggest that incorporating world models into epidemiological research can lead to more informed and effective public health strategies.
Implications
The proposed framework for world models in epidemiology has the potential to transform how public health decisions are made, particularly in the context of emerging infectious diseases. It encourages the integration of machine learning techniques into epidemiological modeling, paving the way for more adaptive and responsive public health policies.
Event-Driven Temporal Graph Networks for Asynchronous Multi-Agent Cyber Defense in NetForge_RL
Reinforcement Learning
Graph Learning
Time Series
- Introduction of NetForge_RL, a continuous-time POSMDP simulator for cyber defense.
- Development of CT-GMARL, a novel architecture for asynchronous multi-agent defense.
- Empirical results show significant performance improvements over existing MARL baselines.
- CT-GMARL effectively restores more compromised services and validates Zero-Shot transfer capabilities.
Read more
Event-Driven Temporal Graph Networks for Asynchronous Multi-Agent Cyber Defense in NetForge_RL
Summary
This paper addresses the challenges of deploying Multi-Agent Reinforcement Learning (MARL) policies from simulated environments to real-world Security Operations Centers (SOCs), primarily due to the Sim2Real gap. The author introduces NetForge_RL, a high-fidelity cyber operations simulator that reformulates network defense as an asynchronous, continuous-time Partially Observable Semi-Markov Decision Process (POSMDP). This simulator incorporates cryptographic Zero-Trust Network Access (ZTNA) constraints and requires the processing of NLP-encoded telemetry derived from Windows Event XML logs. To effectively navigate this environment, the paper proposes a novel Continuous-Time Graph MARL (CT-GMARL) architecture, which utilizes Neural Ordinary Differential Equations (ODEs) to handle irregularly sampled alerts. The framework is empirically validated against LSTM-ready baselines, showing significant improvements in performance metrics. The results indicate that CT-GMARL achieves a median Blue reward of 57,135, outperforming R-MAPPO and QMIX by 2.0× and 2.1×, respectively, and restores 12× more compromised services than traditional methods. Additionally, CT-GMARL demonstrates robust policy retention with a median reward of 98,026 during Zero-Shot transfer to live environments, effectively bridging the Sim2Real gap.
Methodology
The paper presents a dual-mode engine for NetForge_RL that allows high-throughput MARL training in a MockHypervisor and Zero-Shot evaluation in a DockerHypervisor. The CT-GMARL architecture employs fixed-step Neural ODEs to process irregularly sampled Security Information and Event Management (SIEM) alerts, enabling effective navigation of the continuous-time POSMDP framework.
Results
CT-GMARL achieved a median Blue reward of 57,135, which is a 2.0× improvement over R-MAPPO and a 2.1× improvement over QMIX. It also restored 12× more compromised services compared to the strongest discrete baseline and achieved a median reward of 98,026 during Zero-Shot transfer to the live DockerHypervisor.
Implications
The findings suggest that the proposed framework can significantly enhance the effectiveness of autonomous cyber defense systems in real-world SOC environments, potentially reducing alert fatigue and improving response times to cyber threats. This work could lead to more robust and adaptable cybersecurity measures in operational settings.
Alleviating Community Fear in Disasters via Multi-Agent Actor-Critic Reinforcement Learning
Reinforcement Learning
Optimization
Theory
- Introduction of a control-affine extension to the CPS model for disaster resilience.
- Development of a three-player non-zero-sum differential game to optimize resource deployment.
- Application of online actor-critic reinforcement learning to solve the game and derive optimal policies.
- Empirical results show significant reductions in community fear during disasters.
Read more
Alleviating Community Fear in Disasters via Multi-Agent Actor-Critic Reinforcement Learning
Summary
This paper addresses the amplification of community fear during disasters due to cascading failures in power grids, communication networks, and social behaviors. Existing cyber-physical-social (CPS) models have been limited in their ability to prescribe active interventions. The authors extend a previous CPS resilience model by incorporating control channels for three agencies: communication, power, and emergency management. They formulate the system as a three-player non-zero-sum differential game, which is solved using an online actor-critic reinforcement learning approach. Simulations based on data from Hurricane Harvey demonstrate a mean fear reduction of approximately 70%, while cross-validation with Hurricane Irma shows a 50% reduction, indicating the model's generalizability. The study highlights the importance of integrating real-time feedback and intervention strategies in disaster management frameworks.
Methodology
The authors enhance the existing CPS model by adding an active control layer that includes three disaster-response agencies. They model the interactions as a three-player non-zero-sum differential game and utilize an online actor-critic reinforcement learning architecture to derive optimal resource deployment strategies. The model is validated through simulations based on real disaster data.
Results
The proposed approach resulted in a mean fear reduction of approximately 70% during Hurricane Harvey and about 50% during Hurricane Irma, confirming the effectiveness and generalizability of the intervention strategies across different disaster scenarios.
Implications
This research has significant implications for disaster management, suggesting that integrating real-time control mechanisms and reinforcement learning can enhance community resilience and reduce fear during crises. It opens avenues for further research in applying game-theoretic approaches to other complex systems.
Distilling Genomic Models for Efficient mRNA Representation Learning via Embedding Matching
Efficient ML
- Embedding-based distillation is more stable than logit-based methods for genomic models.
- The HelixNano-mRNA model achieves a 200-fold reduction in size while maintaining state-of-the-art performance.
- Utilizing intermediate latent representations is an effective strategy for distilling knowledge in biological models.
Read more
Distilling Genomic Models for Efficient mRNA Representation Learning via Embedding Matching
Summary
This paper presents a novel distillation framework aimed at transferring mRNA representations from a large genomic foundation model to a significantly smaller model, HelixNano-mRNA, which is specialized for mRNA sequences. The authors address the challenge of high computational costs associated with large genomic models, which can exceed billions of parameters. By employing embedding-level distillation, the authors demonstrate that their approach is more stable and effective than traditional logit-based methods. The distilled model achieves state-of-the-art performance on the mRNA-bench benchmark, outperforming larger models while being 200 times smaller. This work highlights the potential of embedding-based distillation as a viable strategy for efficient and scalable sequence modeling in genomics, particularly in scenarios where computational resources are limited.
Methodology
The authors utilize a distillation approach that aligns intermediate embeddings from a large teacher model (Evo2-1B) with the hidden layers of a smaller student model (HelixNano-mRNA). They define a loss function that combines cosine and mean-square losses to optimize the embedding matching process. The training is conducted using a batch size of 32 and mixed precision on multiple GPUs, focusing on mRNA sequences.
Results
The distilled HelixNano-mRNA model demonstrates state-of-the-art performance on mRNA-related tasks, outperforming larger models while being significantly smaller (200-fold reduction in parameters). The stability of embedding-based distillation is confirmed through consistent improvements in training outcomes compared to logit-based methods.
Implications
The findings suggest that embedding-based distillation can facilitate the development of compact and efficient genomic models, making it feasible to conduct large-scale genomic analyses without the prohibitive computational costs associated with larger models. This has potential applications in various genomic tasks, including sequence generation, classification, and regression.
Truncated Rectified Flow Policy for Reinforcement Learning with One-Step Sampling
Reinforcement Learning
Generative Models
Robotics
- TRFP integrates flow matching into MaxEnt RL, overcoming challenges of likelihood computation and backpropagation instability.
- The framework employs a hybrid architecture for sampling, allowing for tractable optimization and stable training.
- Flow straightening regularization enables high-fidelity one-step inference while minimizing surrogate divergence error.
- Empirical results show TRFP achieves state-of-the-art performance on MuJoCo benchmarks and competes well with existing diffusion policies.
Read more
Truncated Rectified Flow Policy for Reinforcement Learning with One-Step Sampling
Summary
The paper introduces the Truncated Rectified Flow Policy (TRFP), a novel framework for maximum entropy reinforcement learning (MaxEnt RL) that addresses the limitations of traditional Gaussian policy parameterization, which is unimodal and struggles with complex multimodal action distributions. The authors highlight two primary challenges in integrating generative policies based on diffusion and flow matching into MaxEnt RL: the intractability of likelihood computation and the instability of multi-step sampling. TRFP employs a hybrid deterministic-stochastic architecture to facilitate tractable entropy-regularized optimization while enabling stable training and effective one-step sampling through gradient truncation and flow straightening. The empirical evaluation demonstrates that TRFP effectively captures multimodal behavior, outperforms strong baselines across various benchmarks, and maintains competitive performance with one-step sampling, showcasing its potential for real-time applications in complex decision-making environments.
Methodology
The TRFP framework combines a deterministic ordinary differential equation (ODE) for the prefix of the sampling process with a stochastic differential equation (SDE) for the tail. This design allows for the formulation of a tractable surrogate log-likelihood. Additionally, a gradient truncation mechanism is implemented to stabilize training by optimizing only during the stochastic refinement phase, while flow straightening regularization is used to enforce linear transport trajectories.
Results
The empirical evaluation on a toy multigoal environment and 10 MuJoCo benchmarks indicates that TRFP effectively captures multimodal behavior and outperforms strong baselines in most scenarios. The framework also demonstrates competitive performance under one-step sampling, highlighting its efficiency and practicality for real-time applications.
Implications
The TRFP framework has significant implications for reinforcement learning applications in robotics and autonomous systems, where the ability to model complex action distributions and maintain real-time performance is critical. Its integration of generative policies into MaxEnt RL could enhance adaptability and robustness in dynamic environments.
CORA: Conformal Risk-Controlled Agents for Safeguarded Mobile GUI Automation
Computer Vision
NLP
Multimodal
- CORA provides a formalized safety mechanism for mobile GUI automation using VLMs.
- The framework employs a Guardian model for risk estimation and a Diagnostician for intervention recommendations.
- CORA introduces a user-tunable execute/abstain threshold based on Conformal Risk Control.
- The Phone-Harm benchmark is established to evaluate mobile safety violations in real-world settings.
Read more
CORA: Conformal Risk-Controlled Agents for Safeguarded Mobile GUI Automation
Summary
The paper introduces CORA (COnformal Risk-controlled GUI Agent), a framework designed to enhance the safety of mobile GUI automation powered by vision language models (VLMs). As VLMs transition from passive assistance to autonomous operation, the risk of harmful actions increases, leading to potential financial, privacy, or social harm. Existing safety measures are inadequate, relying on prompt engineering and heuristics without formal verification. CORA addresses these issues by implementing a post-policy, pre-action safeguarding mechanism that provides statistical guarantees on the risks of executed actions. It employs a Guardian model to estimate the risk of proposed actions and utilizes Conformal Risk Control to establish a user-defined execute/abstain threshold. Actions deemed high-risk are routed to a Diagnostician model, which offers interventions to minimize user burden. The framework is rigorously evaluated using a new benchmark, Phone-Harm, which assesses mobile safety violations. Experimental results demonstrate that CORA effectively improves the safety-helpfulness-interruption trade-off, providing a statistically grounded safety paradigm for autonomous GUI execution.
Methodology
CORA consists of three main components: a Guardian model that estimates action-conditional risk, a Conformal Calibration module that establishes a user-defined execute/abstain threshold, and a Diagnostician model that provides interventions for rejected actions. The framework is trained end-to-end, allowing for adaptive decision-making based on user intent and risk assessment.
Results
CORA was evaluated on the Phone-Harm benchmark and public datasets, showing significant improvements in the safety-helpfulness-interruption trade-off compared to various baseline methods. The results indicate that CORA can effectively reduce harmful actions while maintaining user assistance.
Implications
The CORA framework has the potential to enhance the safety of autonomous mobile applications, making them more reliable for users. It can be applied in various domains where GUI automation is critical, such as finance, healthcare, and personal assistance, ensuring user safety and privacy.
Alloc-MoE: Budget-Aware Expert Activation Allocation for Efficient Mixture-of-Experts Inference
Large Language Models
Efficient ML
Optimization
- Introduces the concept of activation budget for expert activations in MoE models.
- Presents Alloc-MoE, a unified framework optimizing expert allocation at both layer and token levels.
- Alloc-L uses sensitivity profiling and dynamic programming for layer-level allocation.
- Alloc-T dynamically redistributes expert activations based on routing scores.
Read more
Alloc-MoE: Budget-Aware Expert Activation Allocation for Efficient Mixture-of-Experts Inference
Summary
The paper presents Alloc-MoE, a novel framework designed to optimize expert activation allocation in Mixture-of-Experts (MoE) models under a constrained activation budget. MoE architectures are known for their sparse activation mechanism, which enhances the scalability of large language models. However, the high number of expert activations can lead to significant latency during inference, particularly in resource-limited environments. The authors introduce the concept of an 'activation budget' to manage the number of expert activations effectively. Alloc-MoE operates at both the layer and token levels to minimize performance degradation while adhering to the budget. The layer-level method, Alloc-L, employs sensitivity profiling and dynamic programming to determine optimal expert allocation across layers. The token-level method, Alloc-T, redistributes activations based on routing scores, ensuring efficient budget utilization without increasing latency. Experimental results demonstrate that Alloc-MoE maintains model performance while achieving notable speedups in inference, specifically 1.15× during prefill and 1.34× during decoding on the DeepSeek-V2-Lite model, even when using only half of the original activation budget.
Methodology
The methodology involves two main components: Alloc-L and Alloc-T. Alloc-L focuses on layer-level expert activation allocation by profiling layer sensitivity and solving an optimization problem using dynamic programming. Alloc-T reallocates expert activations at the token level based on routing scores, prioritizing tokens with less concentrated routing distributions to optimize performance under a fixed budget.
Results
The results indicate that Alloc-MoE effectively sustains model performance under a restricted activation budget. Specifically, it achieves a 1.15× speedup in prefill and a 1.34× speedup in decoding on the DeepSeek-V2-Lite model while using only half of the original expert activation budget.
Implications
The findings suggest that Alloc-MoE can enhance the deployment of MoE models in real-world applications where resource constraints are a concern, allowing for efficient inference without significant performance loss. This could lead to broader adoption of MoE architectures in various NLP tasks.
Event-Centric World Modeling with Memory-Augmented Retrieval for Embodied Decision-Making
Robotics
Interpretability
Reinforcement Learning
- Introduction of an event-centric framework for world modeling in autonomous agents.
- Utilization of memory-augmented retrieval for decision-making based on prior experiences.
- Integration of physics-informed knowledge to enhance maneuver selection consistency.
- Demonstrated effectiveness in UAV flight scenarios under real-time control constraints.
Read more
Event-Centric World Modeling with Memory-Augmented Retrieval for Embodied Decision-Making
Summary
This paper presents an innovative framework for autonomous agents operating in dynamic environments, emphasizing decision-making that is both efficient and interpretable. The proposed event-centric world modeling framework utilizes memory-augmented retrieval to represent the environment as a structured set of semantic events. These events are encoded into a permutation-invariant latent representation, allowing agents to retrieve prior experiences associated with specific maneuvers. The decision-making process involves computing actions as a weighted combination of these retrieved solutions, fostering transparency and interpretability. The framework integrates physics-informed knowledge to ensure that selected maneuvers align with observed system dynamics. Experimental evaluations in UAV flight scenarios demonstrate the framework's capability to operate under real-time constraints while maintaining consistent and interpretable behavior. This approach addresses the limitations of existing end-to-end learning methods, which often lack interpretability and fail to ensure physical consistency in decision-making.
Methodology
The methodology involves four core stages: perception and feature extraction from sensory inputs, event abstraction into a structured event list, compression into a latent event code, and retrieval-based decision-making using a physics-informed knowledge bank. The framework employs case-based reasoning to derive decisions from previously observed experiences, enhancing interpretability.
Results
The experimental results indicate that the proposed framework successfully operates within real-time control constraints while providing interpretable and consistent decision-making behavior in UAV flight scenarios. The integration of physics-informed knowledge further supports the selection of appropriate maneuvers.
Implications
The framework has significant implications for the development of autonomous systems in safety-critical environments, such as UAVs, where reliable and interpretable decision-making is essential. It can be applied to various domains requiring dynamic environment modeling and decision-making under uncertainty.
GIRL: Generative Imagination Reinforcement Learning via Information-Theoretic Hallucination Control
Reinforcement Learning
Generative Models
Robotics
- GIRL introduces a cross-modal grounding mechanism to maintain semantic consistency in imagined trajectories.
- An uncertainty-adaptive trust-region bottleneck is used to control imagination drift based on real-environment feedback.
- The framework shows a 38-61% reduction in latent rollout drift compared to DreamerV3.
- GIRL achieves higher asymptotic returns with 40-55% fewer environment steps on long-horizon tasks.
Read more
GIRL: Generative Imagination Reinforcement Learning via Information-Theoretic Hallucination Control
Summary
The paper introduces GIRL (Generative Imagination Reinforcement Learning), a novel framework designed to enhance model-based reinforcement learning (MBRL) by addressing the issue of imagination drift during long-horizon planning. The authors identify that traditional MBRL approaches, such as DreamerV3, suffer from compounded model errors that lead to unreliable value estimates and policies. To mitigate these issues, GIRL employs two key innovations: (1) a cross-modal grounding signal from a frozen foundation model (DINOv2) that ensures imagined trajectories remain semantically consistent, and (2) an uncertainty-adaptive trust-region bottleneck that constrains imagination drift based on a learned trust region calibrated by Expected Information Gain and Relative Performance Loss. The theoretical contributions include a re-derivation of the value-gap bound that connects the I-ELBO objective to real-environment regret. Empirical evaluations across multiple benchmark suites demonstrate that GIRL significantly reduces latent rollout drift and achieves higher returns with fewer environment interactions compared to existing methods.
Methodology
GIRL employs a latent world-model framework that integrates a cross-modal grounding vector from DINOv2 into the transition prior, penalizing physics-defying hallucinations. It also reformulates the KL regularizer as a Lagrange multiplier in a constrained optimization problem, allowing for adaptive trust regions based on Expected Information Gain and Relative Performance Loss signals. The architecture includes a recurrent state-space model and a generative model for observations and rewards.
Results
Experiments demonstrate that GIRL reduces latent rollout drift by 38-61% relative to DreamerV3 across various tasks. It achieves higher asymptotic returns with 40-55% fewer environment steps on tasks with horizons of 500 or more. Additionally, GIRL outperforms TD-MPC2 in sparse-reward and high-contact settings, as measured by Interquartile Mean and Probability of Improvement metrics.
Implications
The advancements presented in GIRL could lead to more robust and efficient reinforcement learning systems, particularly in environments where long-horizon planning is critical. The framework's ability to maintain semantic consistency and adaptively manage uncertainty may enhance applications in robotics, autonomous systems, and complex decision-making tasks.
Every Response Counts: Quantifying Uncertainty of LLM-based Multi-Agent Systems through Tensor Decomposition
NLP
Large Language Models
- Introduces the first systematic definition of uncertainty quantification for LLM-based multi-agent systems.
- Presents MATU, a tensor decomposition-based framework for holistic uncertainty estimation.
- Addresses unique sources of uncertainty in multi-agent systems, including tool usage and multi-step reasoning.
- Demonstrates the effectiveness of MATU through extensive experiments across diverse tasks.
Read more
Every Response Counts: Quantifying Uncertainty of LLM-based Multi-Agent Systems through Tensor Decomposition
Summary
This paper addresses the critical issue of uncertainty quantification (UQ) in Large Language Model (LLM)-based Multi-Agent Systems (MAS), which outperform single-agent systems in complex tasks but face unique reliability challenges due to intricate communication dynamics and role dependencies. Existing UQ methods, primarily designed for single-turn outputs, are inadequate for the complexities of MAS, which include cascading uncertainties from multi-step reasoning, variability in inter-agent communication paths, and diverse communication topologies. To tackle these challenges, the authors introduce MATU, a novel framework that utilizes tensor decomposition to quantify uncertainty. MATU represents entire reasoning trajectories as embedding matrices and organizes multiple execution runs into a higher-order tensor, allowing for the disentanglement and quantification of distinct sources of uncertainty. The framework provides a comprehensive reliability measure applicable across various agent structures. The authors conduct extensive experiments demonstrating that MATU effectively estimates holistic and robust uncertainty across diverse tasks and communication topologies, highlighting the importance of addressing UQ in MAS.
Methodology
The authors developed MATU, a framework that represents reasoning trajectories as embedding matrices and organizes execution runs into a higher-order tensor. Tensor decomposition is then applied to quantify distinct sources of uncertainty, enabling a comprehensive reliability assessment at both response and system levels.
Results
The experiments conducted show that MATU effectively estimates uncertainty across various tasks and communication topologies, providing robust reliability measures that are generalizable across different agent structures.
Implications
The findings suggest that MATU can significantly improve the reliability of LLM-based multi-agent systems in critical applications such as healthcare, education, and scientific discovery, where uncertainty can lead to severe consequences.
An Imperfect Verifier is Good Enough: Learning with Noisy Rewards
Reinforcement Learning
Large Language Models
Theory
- RLVR is robust to noise in verifier accuracy, with noise rates up to 15% showing minimal impact on performance.
- The study emphasizes the importance of precision over recall in the context of verifier accuracy.
- Imperfect verification does not pose a fundamental barrier to effective RLVR training.
- Findings suggest that engineering efforts to improve verifier accuracy beyond a certain point yield diminishing returns.
Read more
An Imperfect Verifier is Good Enough: Learning with Noisy Rewards
Summary
This paper investigates the robustness of Reinforcement Learning with Verifiable Rewards (RLVR) in the presence of noisy reward signals, particularly focusing on the implications for training large language models (LLMs). The authors highlight that while RLVR has gained traction as a method for enhancing LLMs, the verifiers used to assess model outputs are often imperfect, leading to questions about the necessary accuracy of these verifiers for effective training. Through experiments in code generation and scientific reasoning, the authors introduce noise into the RL training process, revealing that noise rates of up to 15% do not significantly degrade peak validation accuracy compared to a clean baseline. This robustness is consistent across various noise types and model families, suggesting that the accuracy of verifiers can be less than perfect without fundamentally hindering RLVR performance. The findings advocate for a focus on achieving moderate accuracy with high precision in verifiers rather than striving for perfection, indicating that an imperfect verifier can still be effective in the RLVR framework.
Methodology
The authors conducted controlled experiments by introducing varying levels of noise into the reward signals during RL training, focusing on coding tasks that mirror rubric-based evaluations. They assessed the impact of this noise on model performance across different model families (Qwen3, GLM4, Llama 3.1) and sizes (4B to 9B). The experiments included both controlled noise patterns and realistic noise scenarios.
Results
The experiments demonstrated that RLVR maintains peak validation accuracy within 2 percentage points of the clean baseline even with noise rates up to 15%. This robustness was consistent across different types of noise and model families, indicating that the models can effectively learn despite the presence of noisy rewards.
Implications
The findings suggest that practitioners in the field of RLVR can prioritize developing verifiers that are moderately accurate but highly precise, rather than striving for perfect verification. This could lead to more efficient training processes and broader applicability of RLVR in various domains, including semi-verifiable areas like finance and law.
Wireless Communication Enhanced Value Decomposition for Multi-Agent Reinforcement Learning
Reinforcement Learning
Graph Learning
Robotics
- CLOVER leverages realistic wireless communication channels to enhance cooperation in MARL.
- The GNN-based centralized mixer is conditioned on the communication graph, allowing for better credit assignment.
- The framework achieves significant performance improvements over traditional methods like VDN and QMIX.
- Agents learn effective communication strategies, adapting to varying channel conditions.
Read more
Wireless Communication Enhanced Value Decomposition for Multi-Agent Reinforcement Learning
Summary
This paper introduces CLOVER, a novel framework for multi-agent reinforcement learning (MARL) that enhances cooperation through realistic wireless communication. Traditional MARL approaches often assume idealized communication channels, neglecting the stochastic nature of real-world wireless environments. CLOVER addresses this gap by conditioning a centralized value mixer on the inter-agent communication graph, which reflects the actual communication that occurs among agents. The framework employs a Graph Neural Network (GNN)-based mixer, where the weights are generated by a Permutation-Equivariant Hypernetwork (PEHypernet). This design allows for multi-hop propagation of agent utilities along the communication graph, enabling a more nuanced credit assignment based on who successfully communicated with whom. The paper proves that the proposed mixer is permutation invariant and more expressive than existing QMIX-style mixers. Additionally, it formulates an augmented Markov Decision Process (MDP) to isolate stochastic channel effects from agent computations, facilitating end-to-end differentiable training. Experimental results on benchmarks such as Predator-Prey and Lumberjacks demonstrate that CLOVER significantly improves convergence speed and task performance compared to existing methods, particularly in challenging scenarios. Behavioral analyses reveal that agents develop effective communication strategies, adapting their signaling and listening behaviors to channel conditions.
Methodology
The methodology involves the development of a GNN-based centralized value mixer that utilizes a Permutation-Equivariant Hypernetwork to condition the mixing process on the realized communication graph. An augmented MDP is formulated to separate channel stochasticity from agent computations, and a stochastic receptive field encoder is employed for handling variable-size message sets, enabling end-to-end differentiable training.
Results
CLOVER consistently outperformed existing MARL methods, such as VDN and QMIX, in terms of convergence speed and final task performance across various benchmarks. The framework showed larger gains in more complex environments, and behavioral analysis confirmed that agents effectively learned positive signaling and listening strategies.
Implications
The findings suggest that incorporating realistic communication models into MARL can significantly enhance agent cooperation and performance in complex environments, making CLOVER applicable to real-world scenarios such as robotics, search-and-rescue operations, and other multi-agent systems requiring reliable communication.
$p1$: Better Prompt Optimization with Fewer Prompts
NLP
Large Language Models
Reinforcement Learning
Optimization
- Prompt optimization's effectiveness is influenced by the variance among system prompts versus responses.
- Increasing the number of user prompts can reduce the effectiveness of prompt optimization.
- The proposed p1 method filters user prompts to enhance the optimization signal.
- p1 shows substantial improvements over traditional prompt optimization methods.
Read more
$p1$: Better Prompt Optimization with Fewer Prompts
Summary
This paper investigates the effectiveness of prompt optimization for language models, which enhances their performance without altering their weights by finding better system prompts. The authors analyze the variance in rewards across different system prompts, identifying two components: variance among responses and variance among system prompts. They find that prompt optimization is successful when the variance among system prompts is significant compared to the variance among responses. Interestingly, increasing the number of user prompts can diminish the variance among system prompts, particularly in heterogeneous datasets, leading to poorer optimization outcomes. To address this, the authors propose 'p1', a user prompt filtering method that selects a small subset of user prompts with high variance across candidate system prompts. This approach enhances the optimization signal, allowing for more effective prompt optimization. Experimental results demonstrate that p1 significantly outperforms traditional methods, even when trained on a minimal number of prompts, and shows strong generalization across various reasoning benchmarks.
Methodology
The authors employ a reinforcement learning framework for prompt optimization, where a prompt-generation policy proposes candidate system prompts and the reward is based on the accuracy of a frozen language model. They analyze the variance in rewards to understand the conditions under which prompt optimization succeeds or fails. The p1 method is introduced as a filtering technique to select user prompts that maximize variance across system prompts, thereby strengthening the optimization signal.
Results
The experiments reveal that the p1 method significantly enhances prompt optimization performance, particularly on reasoning benchmarks. Notably, training with just two prompts from a specific dataset resulted in a system prompt that effectively generalized to other benchmarks, outperforming strong baselines such as GEPA.
Implications
The findings suggest that optimizing prompts with fewer, more informative user prompts can lead to better performance in language models, potentially reducing computational costs and improving efficiency in prompt optimization tasks. This approach could be beneficial in various applications involving large language models, especially in heterogeneous task settings.
Shift- and stretch-invariant non-negative matrix factorization with an application to brain tissue delineation in emission tomography data
Time Series
- Introduction of a shift- and stretch-invariant NMF framework for dynamic neuroimaging data.
- The model estimates both temporal shifts and stretching effects, improving the accuracy of TAC analysis.
- Implementation in the frequency domain enhances computational efficiency.
- Validation on synthetic and real SPECT data shows improved delineation of brain tissue structures.
Read more
Shift- and stretch-invariant non-negative matrix factorization with an application to brain tissue delineation in emission tomography data
Summary
This paper introduces a novel framework for non-negative matrix factorization (NMF) that is both shift- and stretch-invariant, specifically designed to enhance the analysis of dynamic neuroimaging data, such as emission tomography. Traditional NMF methods often struggle with the challenges posed by temporal delays and stretching effects in time-activity curves (TACs) due to physiological variability. The proposed method estimates both integer and non-integer temporal shifts and incorporates temporal stretching, allowing for a more accurate representation of the underlying brain tissue dynamics. The authors implemented the model in the frequency domain, which improves computational efficiency and enables the handling of shifts as phase modifications and stretching through zero-padding or truncation. The effectiveness of the model was validated using synthetic data and real brain SPECT data, demonstrating its capability to provide a more detailed characterization of brain tissue structures compared to conventional methods.
Methodology
The proposed method combines shift-invariance and stretch-invariance in the NMF framework. It operates in the frequency domain, where temporal shifts are represented as phase modifications and stretching is handled through frequency scaling. The model optimizes a loss function that accounts for these adjustments, allowing for accurate alignment of TACs across different brain tissues.
Results
The model demonstrated superior performance in characterizing brain tissue structures in both synthetic and real SPECT data, effectively accounting for the stretching and shifting of TACs that traditional methods could not manage. This led to improved delineation of brain regions and enhanced understanding of tracer kinetics.
Implications
The shift- and stretch-invariant NMF framework has significant implications for neuroimaging analysis, potentially leading to better diagnostic tools and insights into brain function. It can be applied in various medical imaging contexts where dynamic data is analyzed, improving the accuracy of regional signal extraction and tissue characterization.
Stochastic-Dimension Frozen Sampled Neural Network for High-Dimensional Gross-Pitaevskii Equations on Unbounded Domains
Optimization
Efficient ML
Theory
- Introduces SD-FSNN, a neural network that is unbiased and dimension-independent.
- Random sampling of weights and biases leads to faster training and improved accuracy.
- Employs adaptive ODE solvers for effective temporal evolution and causality.
- Integrates mechanisms for mass normalization and energy conservation.
Read more
Stochastic-Dimension Frozen Sampled Neural Network for High-Dimensional Gross-Pitaevskii Equations on Unbounded Domains
Summary
This paper introduces the Stochastic-Dimension Frozen Sampled Neural Network (SD-FSNN) as a novel approach to solving high-dimensional Gross-Pitaevskii equations (GPEs) on unbounded domains. The SD-FSNN is designed to be unbiased across all dimensions, with a computational cost that remains independent of the dimensionality, thus avoiding the exponential growth in resource requirements typical of Hermite-basis discretizations. By randomly sampling the hidden weights and biases, the SD-FSNN significantly enhances training speed and accuracy compared to traditional iterative, gradient-based optimization methods. The methodology incorporates a space-time separation strategy, utilizing adaptive ordinary differential equation (ODE) solvers to update evolution coefficients while respecting temporal causality. To maintain the structural integrity of the GPEs, the network integrates a Gaussian-weighted ansatz to ensure exponential decay at infinity, includes a normalization projection layer for mass conservation, and applies an energy conservation constraint to reduce long-term numerical dissipation. Experimental results demonstrate that the SD-FSNN outperforms existing methods across various spatial dimensions and interaction parameters, achieving better accuracy and faster training times while specifically addressing the challenges of high-dimensional GPEs on unbounded domains.
Methodology
The SD-FSNN employs stochastic sampling of weights and biases, a space-time separation strategy with adaptive ODE solvers, and integrates a Gaussian-weighted ansatz for structural preservation of GPEs. It also includes a normalization projection layer and energy conservation constraints to mitigate numerical dissipation.
Results
The SD-FSNN shows significant improvements in training time and accuracy over existing methods, with comparative experiments indicating superior performance across various spatial dimensions and interaction parameters. It reduces computational complexity from linear to dimension-independent.
Implications
The SD-FSNN has potential applications in quantum simulations and modeling of ultracold atomic systems, providing a more efficient and reliable numerical method for high-dimensional GPEs, which could enhance the fidelity of simulations in experimental contexts.
Guardian-as-an-Advisor: Advancing Next-Generation Guardian Models for Trustworthy LLMs
NLP
Large Language Models
Reinforcement Learning
- Introduction of the Guardian-as-an-Advisor (GaaA) framework for LLMs.
- Development of GuardSet, a large-scale dataset for training guardian models.
- GuardAdvisor model trained to provide risk labels and explanations, enhancing response quality.
- GaaA reduces unnecessary refusals while maintaining low latency.
Read more
Guardian-as-an-Advisor: Advancing Next-Generation Guardian Models for Trustworthy LLMs
Summary
This paper introduces the Guardian-as-an-Advisor (GaaA) framework, which aims to enhance the trustworthiness of large language models (LLMs) by addressing the limitations of existing guardian models that often rely on hard-gated safety checks. The authors argue that traditional models either over-refuse queries or misalign with the intended model specifications, leading to reduced utility. GaaA employs a soft-gating mechanism where a guardian model predicts a binary risk label and provides a concise explanation, which is then prepended to the original user query for re-inference. This approach allows the base model to operate within its original specifications while improving the quality of responses. To support this framework, the authors constructed GuardSet, a multi-domain dataset containing over 208,000 examples of harmful and harmless cases, specifically targeting robustness and honesty. The GuardAdvisor model is trained using supervised fine-tuning followed by reinforcement learning to ensure consistency between risk labels and explanations. Experimental results demonstrate that GuardAdvisor achieves competitive detection accuracy, significantly reduces unnecessary refusals, and maintains low latency, thus preserving the model's adherence to specifications while enhancing output quality.
Methodology
The authors developed the GaaA framework, which utilizes a soft-gating mechanism to provide risk labels and explanations without blocking model outputs. They constructed the GuardSet dataset through a three-stage pipeline involving collection, processing, and validation. The GuardAdvisor model was trained using a two-stage approach: supervised fine-tuning followed by reinforcement learning to ensure label-explanation consistency.
Results
GuardAdvisor achieved detection performance comparable to proprietary models, significantly reduced unnecessary refusals, and added minimal latency (2-10% overhead) while maintaining adherence to the original model specifications.
Implications
The GaaA framework can improve the trustworthiness of LLMs in various applications, including search, coding, healthcare, and productivity, by providing clearer guidance on risks associated with user queries. This could lead to safer and more effective interactions with AI systems.