AI-generated summaries
Today's ML research,
without the noise.
Daily summaries of the latest machine learning papers from arXiv, processed every 8 hours.
48
Papers today
8h
Update frequency
7
Days of history
Randomly Initialized Networks Can Learn from Peer-to-Peer Consensus
Theory
Efficient ML
Computer Vision
- Self-distillation can lead to significant improvements in representation learning even without complex mechanisms.
- A minimal setup with randomly initialized networks can outperform random baselines on tasks like CIFAR-10 classification.
- Learning dynamics are sensitive to hyperparameters such as learning rate and architecture.
- The proposed method avoids representational collapse, maintaining stability during training.
Read more
Randomly Initialized Networks Can Learn from Peer-to-Peer Consensus
Summary
This paper investigates the role of self-distillation in self-supervised learning (SSL) by examining a minimal setup involving randomly initialized networks. The authors aim to isolate the effects of self-distillation by removing complex components typically found in state-of-the-art methods, such as projectors, predictors, and pretext tasks. The study reveals that even in this simplified framework, networks can learn meaningful representations that outperform a random baseline on downstream tasks like CIFAR-10 classification. The results indicate that the improvements in representation learning are influenced by hyperparameters such as learning rate and model architecture. The authors also provide insights into the nature of the learned representations and the stability of the learning mechanism, which avoids representational collapse. This work contributes to a better understanding of self-distillation dynamics and highlights the potential of peer-to-peer learning among randomly initialized networks.
Methodology
The authors developed a framework called DINOHerd, which consists of a group of randomly initialized neural networks. In each training batch, one network acts as a student while others serve as teachers, all processing the same input data. The student learns to minimize the difference between its output and that of the teachers using a loss function, with backpropagation occurring only through the student. This peer-to-peer dynamic allows for a flexible and simplified learning environment.
Results
The experiments demonstrated that the randomly initialized networks could learn representations that significantly improved performance on downstream tasks compared to a random baseline. The effectiveness of the learning was found to vary with different hyperparameters, and the approach successfully avoided representational collapse, indicating stable learning dynamics.
Implications
The findings suggest that self-distillation can be effectively utilized in SSL without the need for complex architectures or mechanisms. This could lead to more efficient and interpretable models in various applications, particularly in domains where labeled data is scarce. The insights gained from this study may inform future research on simplifying SSL methods and enhancing their performance.
SetFlow: Generating Structured Sets of Representations for Multiple Instance Learning
Generative Models
Computer Vision
Multimodal
- SetFlow effectively models entire MIL bags, capturing intra-bag dependencies.
- The architecture combines flow matching with a Set Transformer design for permutation-invariant inputs.
- Evaluation on mammography data shows improved performance in classification tasks.
- Synthetic data generated by SetFlow can compete with real data, highlighting its utility in data-scarce scenarios.
Read more
SetFlow: Generating Structured Sets of Representations for Multiple Instance Learning
Summary
The paper introduces SetFlow, a novel generative architecture designed to enhance Multiple Instance Learning (MIL) by directly modeling entire bags of instances in the representation space. Traditional MIL methods often struggle with data scarcity and weak supervision, particularly in domains like mammography, where labels are provided only at the bag level. SetFlow addresses these challenges by leveraging a flow matching paradigm combined with a Set Transformer-inspired design, allowing it to capture intra-bag dependencies and generate coherent sets of representations. The architecture is conditioned on class labels and input scale, facilitating the generation of semantically consistent representations. Evaluations on a large-scale mammography benchmark demonstrate that SetFlow-generated samples closely match the original data distribution and improve downstream performance when used for data augmentation. Additionally, training solely on synthetic data yields competitive results, showcasing the potential of representation-space generative modeling in data-scarce and privacy-sensitive applications.
Methodology
SetFlow employs a generative architecture that integrates flow matching and Set Transformer principles. It processes both global and local streams of embeddings from mammography images, capturing marginal instance distributions and instance interactions within bags. The model is conditioned on class labels and input scales to generate structured sets of representations.
Results
SetFlow was evaluated using a state-of-the-art MIL-PF classification pipeline on mammography data. The results indicated that the synthetic samples generated by SetFlow closely matched the original data distribution and led to improved classification performance when used for augmentation. Furthermore, training exclusively on synthetic data yielded competitive results, demonstrating the effectiveness of the proposed method.
Implications
The findings suggest that SetFlow can significantly enhance the performance of machine learning models in data-scarce environments, particularly in medical imaging applications like mammography. This approach could facilitate better utilization of weakly labeled data and improve diagnostic accuracy in clinical settings.
Forecasting Ionospheric Irregularities on GNSS Lines of Sight Using Dynamic Graphs with Ephemeris Conditioning
Graph Learning
Time Series
Optimization
- Introduces a dynamic graph model for ionospheric forecasting, addressing limitations of gridded data.
- Employs ephemeris conditioning to leverage predictable satellite trajectories for improved forecasting.
- Achieves significant performance improvements over traditional persistence models in predicting ionospheric irregularities.
- Demonstrates the model's robustness under simulated coverage dropout through spatial message passing.
Read more
Forecasting Ionospheric Irregularities on GNSS Lines of Sight Using Dynamic Graphs with Ephemeris Conditioning
Summary
This paper presents a novel approach to forecasting ionospheric irregularities affecting Global Navigation Satellite Systems (GNSS) by modeling the ionosphere as a dynamic graph over ionospheric pierce points (IPPs). Unlike traditional data-driven models that rely on gridded products, which can obscure the time-varying nature of satellite observations, the proposed method, IonoDGNN, leverages the predictable trajectories of satellites to construct a dynamic graph that evolves over time. This graph structure allows for 'ephemeris conditioning', enabling the model to predict irregularities on lines of sight that may only appear during the forecast horizon. The authors evaluate their framework using multi-GNSS data from a co-located receiver pair in Singapore, focusing on forecasting the Rate of TEC Index (ROTI)-defined irregularities at a 5-minute cadence up to 2 hours ahead. The results demonstrate significant improvements over persistence models, achieving a Brier Skill Score (BSS) of 0.49 and a precision-recall area under the curve (PR-AUC) of 0.75, with notable gains at longer lead times. The study highlights the importance of dynamic graph structures and ephemeris conditioning in enhancing predictive performance, especially in regions with sparse observation coverage.
Methodology
The authors developed IonoDGNN, a dynamic graph neural network that operates on GNSS lines of sight, where nodes represent IPPs and edges encode spatial relationships. The model incorporates ephemeris conditioning to predict future graph structures based on satellite trajectories, allowing for real-time forecasting of ionospheric irregularities.
Results
The IonoDGNN model achieved a Brier Skill Score (BSS) of 0.49 and a precision-recall area under the curve (PR-AUC) of 0.75, outperforming persistence models by 35% in BSS and 52% in PR-AUC. The model maintained predictive skill even under simulated coverage dropout, demonstrating its effectiveness in handling sparse data.
Implications
This research suggests that dynamic graph-based forecasting can serve as a viable alternative to traditional grid-based methods for ionospheric irregularity prediction, with potential applications in GNSS signal integrity monitoring and space weather forecasting.
Representation Before Training: A Fixed-Budget Benchmark for Generative Medical Event Models
Generative Models
Time Series
Interpretability
- Fused code-value tokenization yields significant improvements in clinical outcome predictions.
- Decile-based quantization is more effective than finer bins under a one-epoch training budget.
- Event order and admission-relative RoPE encoding can replace time tokens without loss of performance.
- CLIF remapping maintains model performance while providing a smaller, interpretable token set for multi-site use.
Read more
Representation Before Training: A Fixed-Budget Benchmark for Generative Medical Event Models
Summary
This paper investigates the impact of input representation choices on the performance of generative medical event models, particularly in the context of clinical predictions. The authors conduct a series of experiments using 28 matched transformer models trained on the MIMIC-IV dataset, focusing on how different tokenization strategies affect downstream predictions after a fixed one-epoch pretraining budget. The study evaluates three main aspects: quantization granularity and code-value fusion, value and temporal encoding, and the effect of remapping laboratory and vital codes to a standardized vocabulary (CLIF). The findings reveal that fused code-value tokenization significantly improves prediction accuracy for clinical outcomes, such as mortality and hospital length-of-stay. Additionally, the results indicate that certain encoding strategies can enhance model performance while maintaining efficiency. The paper emphasizes the importance of revisiting tokenization choices in clinical machine learning pipelines, suggesting that thoughtful representation decisions can lead to substantial improvements in predictive performance.
Methodology
The authors trained 28 matched transformer models on the MIMIC-IV dataset, conducting three experiments to evaluate the effects of different tokenization strategies, value encodings, and vocabulary remapping on the performance of generative medical event models. Each experiment assessed 29 clinical outcomes, focusing on how representation decisions influence predictive accuracy.
Results
The study found that fused code-value tokenization improved mortality AUROC from 0.891 to 0.915 and hospital length-of-stay AUROC from 0.763 to 0.788. The use of event order and admission-relative RoPE encoding shortened sequences by approximately 11% while maintaining performance. CLIF remapping preserved performance in a single-site setting and provided a clinically interpretable token set.
Implications
The findings suggest that careful consideration of input representation can lead to better predictive performance in clinical machine learning applications. This has implications for the design of generative models in healthcare, particularly in enhancing clinical decision support systems and facilitating multi-site research.
HEALing Entropy Collapse: Enhancing Exploration in Few-Shot RLVR via Hybrid-Domain Entropy Dynamics Alignment
Reinforcement Learning
Large Language Models
NLP
- HEAL addresses entropy collapse in few-shot RLVR, enhancing exploration diversity.
- The framework incorporates high-value general-domain data to improve reasoning patterns.
- Entropy Dynamics Alignment (EDA) aligns entropy dynamics between target and general domains.
- HEAL achieves performance comparable to full-shot RLVR with significantly fewer samples.
Read more
HEALing Entropy Collapse: Enhancing Exploration in Few-Shot RLVR via Hybrid-Domain Entropy Dynamics Alignment
Summary
This paper addresses the challenges of Reinforcement Learning with Verifiable Reward (RLVR) in low-resource scenarios, where traditional methods suffer from entropy collapse, limiting exploration and reasoning performance. The authors propose a novel framework called Hybrid-domain Entropy dynamics ALignment (HEAL), which enhances exploration in few-shot RLVR by incorporating general-domain data and introducing a new reward mechanism termed Entropy Dynamics Alignment (EDA). HEAL selectively integrates high-value general-domain samples to promote diverse exploration and aligns trajectory-level entropy dynamics between target and general domains. This dual approach mitigates entropy collapse and encourages the policy to adopt diverse exploratory behaviors. Experimental results across multiple domains, including Medicine, Physics, Code, and Math, demonstrate that HEAL significantly improves few-shot RLVR performance, achieving results comparable to full-shot training with only 32 target-domain samples. The framework outperforms existing entropy regularization methods in low-resource settings, showcasing its effectiveness in enhancing reasoning capabilities in scenarios with limited training data.
Methodology
The HEAL framework consists of two main components: (1) the incorporation of high-value general-domain data to promote diverse exploration, and (2) the Entropy Dynamics Alignment (EDA) mechanism, which aligns trajectory-level entropy dynamics between target and general domains. EDA captures both the magnitude and variation of entropy, rewarding trajectories that exhibit inter-domain similarity to encourage exploration.
Results
Experiments show that HEAL consistently improves few-shot RLVR performance across various domains. Notably, with only 32 target-domain samples, HEAL matches or surpasses the performance of full-shot RLVR trained with 1K target-domain samples. The framework also demonstrates superior performance compared to existing entropy regularization methods in low-resource settings.
Implications
The findings suggest that HEAL can be effectively applied in real-world scenarios with limited training data, such as medical reasoning and specialized knowledge domains, enhancing the capabilities of reasoning-oriented large language models in low-resource environments.
Covariance-Based Structural Equation Modeling in Small-Sample Settings with p > n
Theory
- Introduces a novel estimation principle for covariance-based SEM in small-sample settings with p > n.
- Reformulates covariance structures into self-covariance and cross-covariance components.
- Demonstrates improved stability in estimating structural parameters' sign and direction.
- Validates the proposed method through experiments on synthetic and real-world data.
Read more
Covariance-Based Structural Equation Modeling in Small-Sample Settings with p > n
Summary
This paper addresses the challenges of Factor-based Structural Equation Modeling (SEM) in small-sample settings where the number of observed variables (p) exceeds the sample size (n). Traditional covariance-based SEM relies on a nonsingular sample covariance matrix, which becomes problematic in these scenarios. The authors propose a novel estimation principle that reformulates the covariance structure into self-covariance and cross-covariance components. This new framework defines a likelihood-based feasible set combined with a relative error constraint, facilitating stable estimation of structural parameters' sign and direction even when p > n. The methodology is validated through experiments on both synthetic and real-world datasets, demonstrating improved stability in recovering the structural parameters. The findings extend the applicability of covariance-based SEM to small-sample contexts, providing valuable directional insights for decision-making processes.
Methodology
The authors develop a new estimation framework that separates the covariance structure into self-covariance and cross-covariance components. This approach incorporates a likelihood-based feasible set with a relative error constraint, allowing for stable estimation in scenarios where the number of variables exceeds the sample size.
Results
The experiments conducted show that the proposed method significantly enhances the stability of parameter estimation, particularly in recovering the sign and direction of structural parameters in small-sample settings with p > n. The results indicate that the new framework can effectively handle the singularity issues of the sample covariance matrix.
Implications
The findings suggest that the proposed method can be a practical tool for researchers and practitioners dealing with high-dimensional data in small-sample contexts, enabling more reliable decision-making based on structural equation models.
Detecting and Suppressing Reward Hacking with Gradient Fingerprints
Reinforcement Learning
Large Language Models
Interpretability
- Introduction of Gradient Fingerprint (GRIFT) for detecting reward hacking in RLVR.
- GRIFT outperforms existing methods like CoT Monitor and TRACE by over 25% in detection accuracy.
- Integration of GRIFT into training processes can suppress reward hacking and improve task performance.
- The method utilizes gradient-based representations to assess the quality of reasoning traces.
Read more
Detecting and Suppressing Reward Hacking with Gradient Fingerprints
Summary
This paper addresses the issue of reward hacking in reinforcement learning with verifiable rewards (RLVR), where models exploit loopholes in reward functions to achieve high scores without solving the intended tasks. The authors introduce a novel method called Gradient Fingerprint (GRIFT) that detects reward hacking by analyzing the internal computations of models rather than relying solely on their generated text. GRIFT computes gradients of the model's chain-of-thought (CoT) given a prompt, compressing these gradients into a compact representation that serves as a fingerprint for assessing whether the CoT reflects reward hacking behavior. The method is evaluated across various reasoning benchmarks, including math, code, and logical reasoning, demonstrating over 25% relative improvement in detecting reward hacking compared to existing baselines. Additionally, GRIFT can be integrated into a rejection fine-tuning training pipeline, effectively suppressing reward hacking and enhancing performance on the intended tasks. The findings suggest that gradient-level representations can be a reliable signal for evaluating the quality of reasoning traces in RLVR.
Methodology
The GRIFT method involves computing gradients of the model's chain-of-thought (CoT) conditioned on a given prompt. These gradients are then compressed into a compact representation (fingerprint) using lightweight adapters and random projection. The fingerprints are clustered to distinguish between reward-hacking and non-hacking behaviors, allowing for accurate detection of exploitative strategies.
Results
GRIFT achieved over 25% relative improvement in detecting reward hacking across multiple reasoning tasks compared to strong baselines. When integrated into a rejection fine-tuning pipeline, GRIFT effectively reduced instances of reward hacking and improved performance on the true task objectives, narrowing the performance gap between models trained with and without access to reward exploits.
Implications
The findings suggest that leveraging gradient-level representations can enhance the robustness of reinforcement learning models against reward hacking. This approach could lead to more reliable training methodologies in RLVR, ultimately improving the performance and integrity of language models in various applications.
Efficient Federated RLHF via Zeroth-Order Policy Optimization
Reinforcement Learning
Federated Learning
Efficient ML
- Par-S2ZPO is designed for federated RLHF, focusing on efficiency in communication, computation, and memory.
- The algorithm employs zeroth-order optimization with binary perturbations to reduce resource requirements.
- Theoretical analysis establishes a convergence rate that is competitive with centralized methods.
- Empirical results show Par-S2ZPO outperforms traditional FedAvg-based RLHF methods in various tasks.
Read more
Efficient Federated RLHF via Zeroth-Order Policy Optimization
Summary
This paper presents a novel federated reinforcement learning from human feedback (RLHF) algorithm called Partitioned, Sign-based Stochastic Zeroth-order Policy Optimization (Par-S2ZPO). The algorithm is designed for resource-constrained agents, such as edge devices, and aims to optimize communication, computation, and memory efficiency. Par-S2ZPO utilizes zeroth-order optimization with binary perturbations to approximate gradient directions, significantly reducing the amount of data exchanged during training. The theoretical analysis shows that Par-S2ZPO has a convergence rate comparable to centralized methods in terms of sample complexity but converges faster in policy update iterations. Experimental results demonstrate that Par-S2ZPO outperforms a FedAvg-based RLHF approach across four MuJoCo RL tasks, highlighting its effectiveness in federated settings where resources are limited.
Methodology
Par-S2ZPO employs zeroth-order optimization with binary perturbations to approximate gradient directions. It partitions policy parameters among agents, allowing each to only update a subset of parameters. Agents communicate minimal binary feedback indicating the favorability of perturbations, thus reducing communication overhead. The algorithm's design ensures low memory and computation requirements, making it suitable for resource-constrained environments.
Results
In experiments conducted on four MuJoCo RL environments, Par-S2ZPO consistently outperformed the FedAvg-based RLHF algorithm, demonstrating superior performance in terms of both efficiency and effectiveness in training reinforcement learning models.
Implications
The findings suggest that Par-S2ZPO can significantly enhance the feasibility of deploying RLHF in federated learning settings, particularly in applications involving edge devices. This approach can facilitate more efficient training of models that rely on human feedback, potentially leading to advancements in areas such as robotics and interactive AI systems.
Evaluating Temporal and Structural Anomaly Detection Paradigms for DDoS Traffic
Time Series
- Proposes a decision framework for selecting between temporal and structural features in DDoS detection.
- Utilizes lag-1 autocorrelation and PCA cumulative explained variance as diagnostic tools.
- Demonstrates that structural features often outperform temporal features in DDoS detection.
- Focuses on the representation of traffic data rather than the choice of detection algorithms.
Read more
Evaluating Temporal and Structural Anomaly Detection Paradigms for DDoS Traffic
Summary
This paper addresses the challenge of unsupervised anomaly detection in Distributed Denial-of-Service (DDoS) attacks within cloud-native 5G networks. The authors propose a lightweight decision framework that evaluates whether to prioritize temporal or structural features before model training. This framework utilizes two diagnostic probes: lag-1 autocorrelation of an aggregated flow signal and cumulative explained variance from Principal Component Analysis (PCA). The study tests this framework on two datasets with distinct traffic characteristics, one exhibiting temporal dependence and the other being a high-dimensional 5G benchmark with weak temporal signals. The experiments involve three unsupervised learning methods: Isolation Forest, One-Class SVM, and KMeans. The findings indicate that structural features generally outperform temporal features, especially when temporal dependencies are weak. The contribution of this work lies not in developing a new detection algorithm but in providing a heuristic for selecting the appropriate feature representation prior to unsupervised modeling, thereby enhancing the effectiveness of DDoS detection in varying network conditions.
Methodology
The methodology involves a six-step pipeline: dataset selection and characterization, preprocessing, feature engineering under temporal and structural paradigms, unsupervised detection, evaluation against ground-truth labels, and cross-paradigm comparison. The framework uses two probes to guide the choice of feature representation before applying unsupervised learning techniques.
Results
The experiments reveal that structural features consistently match or exceed the performance of temporal features, particularly in scenarios where temporal dependencies are weak. The results underscore the importance of feature representation in enhancing DDoS detection efficacy.
Implications
This research has significant implications for improving DDoS detection strategies in modern network infrastructures, particularly in 5G environments. By providing a systematic approach to feature selection, it aids practitioners in optimizing anomaly detection systems based on the specific characteristics of network traffic.
LoRaQ: Optimized Low Rank Approximation for 4-bit Quantization
Generative Models
Efficient ML
Optimization
- LoRaQ enables fully sub-16 bit quantization, eliminating the need for high-precision branches.
- The proposed method uses a data-free calibration approach to optimize quantization error compensation.
- LoRaQ outperforms existing state-of-the-art quantization methods in terms of generative performance.
- The authors release an open-source PTQ library to support diverse quantization schemes.
Read more
LoRaQ: Optimized Low Rank Approximation for 4-bit Quantization
Summary
The paper introduces LoRaQ (Low-Rank Approximated Quantization), a novel approach to post-training quantization (PTQ) aimed at enhancing the deployment of large diffusion transformers on resource-constrained hardware. Traditional methods for 4-bit quantization often lead to significant performance degradation due to the high sensitivity of generative models to quantization errors. LoRaQ addresses this issue by proposing a data-free calibration method that optimizes quantization error compensation without the need for high-precision auxiliary branches, which are typically required in existing methods. This allows for the first fully sub-16 bit quantization pipeline, where both the low-rank branch and the main layer can be quantized effectively. The authors demonstrate that LoRaQ outperforms state-of-the-art methods, such as SVDQuant, across various datasets and models, while also exploring mixed-precision configurations that yield superior results. Additionally, they provide an open-source PTQ library to facilitate the implementation of diverse quantization schemes and multi-GPU calibration, addressing current limitations in available tools.
Methodology
LoRaQ decomposes linear layers into a low-rank branch and a residual branch, optimizing the low-rank matrices to minimize quantization error. The method employs a data-free calibration strategy, allowing for efficient model calibration without heavy data-dependent initialization. It also explores mixed-precision strategies for both weights and activations to enhance performance.
Results
The experiments demonstrate that LoRaQ consistently outperforms state-of-the-art methods like SVDQuant across multiple metrics and datasets. The analysis of mixed-precision configurations reveals that setups such as W8A8, W6A6, and W4A8 yield superior results while maintaining a fully quantized architecture.
Implications
LoRaQ has significant implications for deploying large generative models on resource-constrained devices, enabling efficient inference without sacrificing performance. The open-source library can facilitate broader adoption of advanced quantization techniques in various applications, particularly in fields requiring high-quality generative outputs.
UniMamba: A Unified Spatial-Temporal Modeling Framework with State-Space and Attention Integration
Time Series
- UniMamba is the first framework to combine state-space dynamics with spatial-temporal attention.
- The Mamba Variate–Channel Encoding Layer incorporates FFT-Laplace transform and TCN for efficient modeling.
- UniMamba consistently outperforms Transformer, MLP, and Mamba-based models in accuracy and scalability.
- The framework effectively captures both global temporal patterns and cross-variable relationships.
Read more
UniMamba: A Unified Spatial-Temporal Modeling Framework with State-Space and Attention Integration
Summary
UniMamba presents a novel framework for multivariate time series forecasting, addressing the challenges posed by complex temporal dependencies and cross-variable interactions. Traditional Transformer-based methods, while effective in capturing temporal correlations, suffer from high computational costs, particularly in long sequences. Conversely, state-space models like Mamba excel in efficiency but lack the ability to recognize explicit temporal patterns. UniMamba integrates the strengths of both approaches by employing a Mamba Variate–Channel Encoding Layer, enhanced with Fast Fourier Transform (FFT) and Temporal Convolution Networks (TCN), to capture global temporal dependencies. Additionally, a Spatial Temporal Attention Layer models inter-variable correlations and temporal evolution, while a Feedforward Temporal Dynamics Layer fuses continuous and discrete contexts for improved forecasting accuracy. Extensive experiments on eight public benchmark datasets demonstrate that UniMamba outperforms existing state-of-the-art models in terms of forecasting accuracy and computational efficiency, establishing it as a robust solution for long-sequence multivariate time series prediction.
Methodology
The methodology involves a unified framework that integrates a Mamba Variate–Channel Encoding Layer with FFT-Laplace transform and TCN for dynamic feature propagation, a Spatial Temporal Attention Layer for modeling inter-variable and temporal dependencies, and a Feedforward Temporal Dynamics Layer for fusing global and local contexts.
Results
UniMamba achieved state-of-the-art performance across multiple public datasets, demonstrating superior forecasting accuracy and computational efficiency compared to existing models.
Implications
The implications of this research include improved forecasting capabilities in various domains such as energy, finance, and environmental monitoring, where accurate long-term predictions are critical. The framework's scalability also suggests potential applications in real-time forecasting systems.
Federated Learning with Quantum Enhanced LSTM for Applications in High Energy Physics
Federated Learning
Theory
Efficient ML
- Introduction of a hybrid quantum-classical LSTM model (QLSTM) for efficient learning in HEP applications.
- Implementation of a federated learning framework to distribute the learning workload across local servers.
- Demonstrated significant performance improvements with reduced data and resource requirements compared to baseline models.
- Achieved comparable results to classical deep learning methods with a model that has less than 300 parameters.
Read more
Federated Learning with Quantum Enhanced LSTM for Applications in High Energy Physics
Summary
This paper presents a novel approach to machine learning in High Energy Physics (HEP) by integrating federated learning with a quantum-enhanced long short-term memory (LSTM) model, termed QLSTM. The authors address the challenges of training complex models on large-scale datasets while minimizing energy consumption and resource requirements. The proposed QLSTM model leverages the strengths of quantum computing to capture intricate relationships in data while maintaining the temporal learning capabilities of LSTMs. The federated learning framework allows for distributed training across multiple nodes, enhancing data privacy and reducing the computational load on individual systems. The effectiveness of the proposed model is empirically validated through experiments on the Supersymmetry (SUSY) dataset, which contains 5 million rows. The results indicate that the QLSTM outperforms existing quantum machine learning techniques and achieves performance comparable to classical deep learning benchmarks, with significantly fewer parameters and data points required for training.
Methodology
The authors developed a hybrid quantum-classical model (QLSTM) that combines quantum-enhanced representation capabilities with LSTM's temporal learning features. They employed a federated learning setup to enable collaborative training across distributed nodes, allowing for efficient resource sharing and workload distribution. The model was tested on the SUSY dataset to evaluate its classification performance.
Results
The QLSTM model demonstrated superior performance compared to existing quantum machine learning techniques, achieving results within ±1% of classical deep learning benchmarks. The model required only 20,000 data points to perform comparably, showcasing a significant reduction in data requirements and computational resources, with less than 300 parameters.
Implications
This research suggests that integrating quantum computing with federated learning can enhance machine learning capabilities in data-intensive fields like HEP. The proposed framework could facilitate more efficient data analysis across distributed environments, potentially leading to advancements in scientific discovery and collaboration in high-energy physics research.
ECG-Lens: Benchmarking ML & DL Models on PTB-XL Dataset
Time Series
- Comparison of traditional ML and advanced DL models for ECG classification.
- Use of raw ECG signals and SWT for data augmentation to improve model performance.
- ECG-Lens model achieved 80% accuracy and 90% ROC-AUC, outperforming traditional methods.
- Demonstrates the potential of deep learning in enhancing automated ECG analysis.
Read more
ECG-Lens: Benchmarking ML & DL Models on PTB-XL Dataset
Summary
This paper presents a comprehensive evaluation of various machine learning (ML) and deep learning (DL) models for the automated classification of electrocardiogram (ECG) signals using the PTB-XL dataset. The study compares three traditional ML algorithms—Decision Tree Classifier, Random Forest Classifier, and Logistic Regression—with three DL models: a Simple Convolutional Neural Network (CNN), Long Short-Term Memory (LSTM), and a Complex CNN termed ECG-Lens. The models were trained on raw ECG signals, leveraging the ability of DL architectures to automatically extract features. To enhance model performance, data augmentation was applied using the Stationary Wavelet Transform (SWT). The models were evaluated based on multiple metrics, including accuracy, precision, recall, F1-score, and ROC-AUC. The ECG-Lens model outperformed the traditional ML methods, achieving an accuracy of 80% and a ROC-AUC of 90%. The findings indicate that complex CNNs significantly enhance the classification of raw 12-lead ECG data, providing a benchmark for future automated ECG classification efforts and guiding the development of condition-specific models.
Methodology
The study employed a comparative analysis of three traditional ML algorithms and three DL models, training them on the PTB-XL dataset of 12-lead ECG recordings. Data augmentation was performed using the Stationary Wavelet Transform to enhance the diversity of training samples. The models were evaluated using metrics such as accuracy, precision, recall, F1-score, and ROC-AUC.
Results
The ECG-Lens model achieved the highest performance with 80% classification accuracy and a ROC-AUC of 90%. The deep learning models, particularly the complex CNN, significantly outperformed traditional machine learning methods in classifying ECG signals.
Implications
The results suggest that deep learning architectures can greatly improve the accuracy and efficiency of ECG classification, which could lead to better diagnostic tools for cardiovascular diseases. This research provides a foundation for developing more specialized models for specific cardiac conditions.
Scalable Neighborhood-Based Multi-Agent Actor-Critic
Reinforcement Learning
- Introduction of MADDPG-K, a scalable extension of MADDPG.
- Critic input is limited to k nearest agents, reducing computational complexity.
- Empirical results demonstrate competitive performance and faster convergence.
- Method shows better runtime scaling with an increasing number of agents.
Read more
Scalable Neighborhood-Based Multi-Agent Actor-Critic
Summary
This paper introduces MADDPG-K, a scalable variant of the Multi-Agent Deep Deterministic Policy Gradient (MADDPG) algorithm, designed to overcome the computational challenges associated with centralized critic architectures in multi-agent reinforcement learning (MARL). Traditional centralized critics, which utilize the observations and actions of all agents, face scalability issues as their input size increases linearly with the number of agents, leading to expensive training costs. MADDPG-K addresses this limitation by restricting each agent's critic to only consider the k nearest agents based on Euclidean distance, thus maintaining a constant input size regardless of the total number of agents. The authors analyze the complexity of this approach, demonstrating that the quadratic cost is primarily due to simple scalar distance calculations rather than the costly neural network operations that typically hinder standard MADDPG. Empirical validation across various cooperative and adversarial environments shows that MADDPG-K achieves competitive or superior performance compared to MADDPG, exhibits faster convergence in cooperative scenarios, and scales better with increasing agent numbers. The paper concludes with directions for future research in enhancing multi-agent learning efficiency.
Methodology
The MADDPG-K algorithm modifies the centralized critic approach by limiting the input to each agent's critic to the k closest agents based on Euclidean distance. This method retains the locality of relevant information while ensuring a fixed-size input for the critic, thus improving scalability. The authors conduct a complexity analysis and validate the algorithm through experiments in various multi-agent environments.
Results
MADDPG-K outperforms or matches the performance of traditional MADDPG in multiple environments, achieving faster convergence in cooperative settings and demonstrating improved runtime efficiency as the number of agents increases. The empirical results validate the effectiveness of the proposed method in addressing the scalability issues of centralized critics.
Implications
The findings suggest that MADDPG-K can be effectively applied in large-scale multi-agent systems, such as robotics, traffic control, and complex game environments, where efficient learning and decision-making are crucial. The approach may also inspire further research into scalable algorithms in multi-agent reinforcement learning.
DPrivBench: Benchmarking LLMs' Reasoning for Differential Privacy
Large Language Models
Theory
- DPrivBench is a novel benchmark for assessing LLMs' reasoning on differential privacy.
- The benchmark includes 720 instances, covering both foundational and advanced DP topics.
- Current LLMs perform well on basic DP mechanisms but struggle with advanced algorithms.
- Integrating external references can improve LLM accuracy in DP reasoning.
Read more
DPrivBench: Benchmarking LLMs' Reasoning for Differential Privacy
Summary
This paper introduces DPrivBench, a benchmark designed to evaluate the reasoning capabilities of large language models (LLMs) regarding differential privacy (DP). The authors highlight the challenges faced by non-experts in designing and verifying DP algorithms, which typically require substantial expertise. DPrivBench consists of 720 instances divided into two categories: foundational sensitivity-based DP mechanisms and advanced DP algorithms from the literature. The benchmark aims to cover a broad range of DP topics and resist shortcut reasoning. Experiments reveal that while top-performing models excel at basic DP mechanisms, they struggle with more complex algorithms, indicating significant gaps in current LLM capabilities. The authors also explore ways to enhance LLM reasoning, such as integrating explicit references from curated DP knowledge bases and conducting failure-mode analyses. Overall, DPrivBench serves as a foundational tool for advancing automated DP reasoning and offers a new testbed for mathematical reasoning.
Methodology
The authors developed DPrivBench by curating instances that require LLMs to determine whether specific algorithms satisfy stated DP guarantees. The benchmark was designed to ensure broad topic coverage, diverse difficulty levels, and resistance to trivial reasoning. Evaluations were conducted using various state-of-the-art LLMs to assess their performance on the benchmark.
Results
The evaluation showed that the strongest models achieved high accuracy on foundational DP mechanisms, while all models exhibited significant errors on advanced DP algorithms. The study also found that providing explicit references improved model performance, highlighting potential directions for enhancing automated DP reasoning.
Implications
DPrivBench can facilitate the development of tools that automate DP reasoning, making it more accessible for non-experts. It also serves as a valuable resource for researchers in privacy and mathematical reasoning, potentially leading to improved methodologies in the field.
Tabular foundation models for in-context prediction of molecular properties
Efficient ML
- TFMs enable in-context learning for molecular property prediction without task-specific fine-tuning.
- Combining TFMs with CheMeleon embeddings yields significant performance improvements.
- Molecular representation is crucial for TFM effectiveness, outperforming traditional fingerprints.
- TFMs reduce computational costs compared to conventional fine-tuning methods.
Read more
Tabular foundation models for in-context prediction of molecular properties
Summary
This paper explores the use of Tabular Foundation Models (TFMs) for predicting molecular properties in scenarios characterized by low to medium data availability. Traditional molecular foundation models often require task-specific fine-tuning and machine learning expertise, which can limit their practical application. In contrast, TFMs leverage in-context learning to make predictions without the need for fine-tuning, thus simplifying the process and reducing computational costs. The authors evaluate TFMs against both standardized pharmaceutical benchmarks and chemical engineering datasets, comparing frozen molecular representations and classical descriptors. The results indicate that TFMs, particularly when combined with CheMeleon embeddings, achieve high predictive performance, with win rates of up to 100% on 30 MoleculeACE tasks. The study highlights the importance of molecular representation in TFM performance, showing that both molecular foundation model embeddings and 2D descriptor sets significantly outperform traditional molecular fingerprints. Overall, the findings suggest that TFMs present a highly accurate and cost-effective alternative for molecular property prediction in practical applications.
Methodology
The authors employed TFMs pre-trained on synthetic tabular datasets to perform in-context predictions on molecular property tasks. They combined TFMs with various molecular representations, including classical descriptors and embeddings from molecular foundation models, and benchmarked their performance against established classical machine learning models on multiple datasets.
Results
The study demonstrated that TFMs achieved excellent predictive performance across various benchmarks, with win rates of up to 100% on MoleculeACE tasks. The combination of TFMs with CheMeleon embeddings and other descriptors provided substantial gains over traditional molecular fingerprints, indicating the effectiveness of the approach in low to medium data scenarios.
Implications
The findings suggest that TFMs can significantly enhance molecular property prediction in drug discovery and chemical engineering, particularly in data-limited situations. This could lead to more efficient and cost-effective approaches in these fields, reducing the reliance on extensive labeled datasets and machine learning expertise.
The Spectral Geometry of Thought: Phase Transitions, Instruction Reversal, Token-Level Dynamics, and Perfect Correctness Prediction in How Transformers Reason
NLP
Large Language Models
Interpretability
- Large language models exhibit spectral phase transitions during reasoning tasks.
- Instruction tuning reverses the spectral geometry of reasoning in models.
- A taxonomy of generation dynamics categorizes models into expansion, compression, and equilibrium.
- Spectral properties can predict reasoning correctness with high accuracy.
Read more
The Spectral Geometry of Thought: Phase Transitions, Instruction Reversal, Token-Level Dynamics, and Perfect Correctness Prediction in How Transformers Reason
Summary
This paper investigates the spectral properties of large language models (LLMs) during reasoning tasks compared to factual recall. Through a comprehensive spectral analysis of 11 models across 5 architecture families, the author identifies seven key phenomena related to spectral phase transitions in hidden activation spaces. These include Reasoning Spectral Compression, where most models exhibit lower spectral alpha (α) during reasoning tasks, and Instruction Tuning Spectral Reversal, which shows that instruction-tuned models reorganize their representations for reasoning. The study also categorizes generation dynamics into expansion, compression, and equilibrium based on architecture. Furthermore, it establishes a Spectral Scaling Law indicating that larger models access lower α representations for reasoning. The analysis reveals a Token-Level Spectral Cascade, where synchronization of spectral dynamics decays with layer distance, and identifies reasoning step boundaries through phase transitions in the alpha gradient. Notably, the paper demonstrates that spectral α can predict answer correctness with high accuracy, achieving an AUC of 1.000 in one model and a mean AUC of 0.893 across six models. These findings contribute to a spectral theory of reasoning in transformers, highlighting the universal and architecture-specific aspects of the geometry of thought.
Methodology
The author conducted a systematic spectral analysis of hidden activations across 11 large language models from 5 different architecture families. This involved examining spectral properties during reasoning and factual recall tasks, analyzing token-level dynamics, and identifying phase transitions in the alpha gradient.
Results
The analysis revealed seven phenomena: Reasoning Spectral Compression, Instruction Tuning Spectral Reversal, a three-category generation taxonomy, a Spectral Scaling Law, a Token-Level Spectral Cascade, Reasoning Step Spectral Punctuation, and Perfect Spectral Correctness Prediction. The findings indicate that reasoning induces universal spectral phase transitions and that spectral features can accurately predict reasoning correctness.
Implications
These insights could enhance the interpretability of large language models, improve reasoning capabilities through better training paradigms, and inform the design of future models by understanding the underlying geometric structures of thought.
Univariate Channel Fusion for Multivariate Time Series Classification
Time Series
Efficient ML
- Introduction of Univariate Channel Fusion (UCF) as a lightweight, classifier-agnostic method for MTSC.
- UCF transforms multivariate time series into a univariate format using simple fusion techniques.
- Demonstrated competitive accuracy and high efficiency in five diverse real-world case studies.
- UCF is particularly effective in scenarios with high inter-channel correlation.
Read more
Univariate Channel Fusion for Multivariate Time Series Classification
Summary
This paper addresses the challenge of multivariate time series classification (MTSC), which is critical in fields such as biomedical signal analysis and motion monitoring. Traditional deep learning approaches for MTSC are often computationally intensive, making them unsuitable for real-time applications or deployment on low-cost hardware. The authors propose a novel method called Univariate Channel Fusion (UCF), which simplifies the classification process by transforming multivariate time series into a univariate representation using straightforward fusion strategies like mean, median, or dynamic time warping barycenter. This transformation allows for the application of any univariate classifier, thus providing a more efficient and flexible alternative to complex models. The effectiveness of UCF is evaluated through five case studies across various domains, including chemical monitoring, brain-computer interfaces, and human activity analysis. The results indicate that UCF not only achieves competitive accuracy compared to baseline methods and state-of-the-art algorithms but also significantly enhances computational efficiency, particularly in scenarios with high inter-channel correlation.
Methodology
The UCF method involves transforming multivariate time series into a univariate representation through channel fusion strategies such as mean, median, or dynamic time warping barycenter. This allows for the use of any classifier designed for univariate time series, making the approach both flexible and computationally efficient.
Results
The experimental evaluation of UCF across five case studies showed that it often outperformed baseline methods and state-of-the-art algorithms tailored for MTSC, achieving substantial gains in computational efficiency.
Implications
The UCF method has significant implications for real-time applications in various domains, particularly in low-cost hardware environments such as IoT devices and wearable systems. It enables efficient processing of multivariate time series data, which is crucial for timely responses in applications like brain-computer interfaces and health monitoring.
Reasoning-targeted Jailbreak Attacks on Large Reasoning Models via Semantic Triggers and Psychological Framing
NLP
Large Language Models
- Introduction of reasoning-targeted jailbreak attacks that compromise the reasoning process of LRMs without changing final answers.
- Development of the PRJA Framework, which combines semantic analysis and psychological principles to manipulate reasoning steps.
- Demonstration of high attack success rates (83.6%) against several commercial LRMs, indicating serious security concerns.
- Emphasis on the importance of safeguarding the reasoning process in sensitive applications like healthcare and education.
Read more
Reasoning-targeted Jailbreak Attacks on Large Reasoning Models via Semantic Triggers and Psychological Framing
Summary
This paper addresses a novel type of vulnerability in Large Reasoning Models (LRMs) by introducing reasoning-targeted jailbreak attacks, which aim to inject harmful content into the reasoning steps of LRMs without altering the final answers. While previous studies have focused on the safety of the final outputs, this work highlights the importance of the reasoning process itself, especially in high-stakes applications like healthcare and education. The authors propose the Psychology-based Reasoning-targeted Jailbreak Attack (PRJA) Framework, which consists of two main components: a Semantic-based Trigger Selection module that identifies manipulative reasoning triggers through semantic analysis, and a Psychology-based Instruction Generation module that employs psychological theories to create persuasive instructions. This framework effectively navigates the challenges of maintaining the integrity of the final answer while embedding harmful content into the reasoning process. Extensive experiments demonstrate that PRJA achieves an average attack success rate of 83.6% across multiple commercial LRMs, revealing significant vulnerabilities in their reasoning capabilities.
Methodology
The PRJA Framework employs two main modules: the Semantic-based Trigger Selection module, which uses semantic analysis to identify harmful keywords that align with the original question-answer pair, and the Psychology-based Instruction Generation module, which crafts persuasive instructions based on psychological theories to enhance compliance with harmful content generation.
Results
The proposed PRJA Framework achieved an average attack success rate of 83.6% across five question-answering datasets against commercial LRMs, including DeepSeek R1, Qwen2.5-Max, and OpenAI o4-mini, showcasing its effectiveness in executing reasoning-targeted jailbreak attacks.
Implications
The findings highlight critical vulnerabilities in LRMs, particularly in their reasoning processes, which could have severe implications for their deployment in high-stakes environments. This research underscores the need for improved safety mechanisms that protect not only final outputs but also the integrity of reasoning chains.
The Topological Trouble With Transformers
NLP
Large Language Models
Theory
- Transformers' feedforward architecture limits their ability to track dynamic states effectively.
- State tracking is crucial for language understanding and reasoning, yet transformers often fail in this regard.
- The authors propose a taxonomy for recurrent and continuous-thought transformer architectures.
- Dynamic depth models and externalized state representations are computationally inefficient solutions.
Read more
The Topological Trouble With Transformers
Summary
This paper discusses the limitations of transformers in maintaining dynamic state tracking due to their feedforward architecture. While transformers excel at encoding sequences with a long contextual history, they struggle with the iterative updating of latent variables that reflect an evolving environment. This limitation leads to the exhaustion of the model's depth as state representations are pushed deeper into the layer stack, making them less accessible. The authors argue that effective state tracking requires a shift from explicit thought traces to implicit activation dynamics, which can be better achieved through recurrent architectures. They propose a taxonomy of recurrent and continuous-thought transformer architectures based on their recurrence axis and the ratio of input tokens to recurrence steps. The paper also outlines promising research directions, such as enhanced state-space models and coarse-grained recurrence, to improve state tracking in modern foundation models.
Methodology
The authors analyze the limitations of transformers through theoretical exploration and propose a taxonomy for recurrent architectures. They discuss the implications of these limitations on state tracking and suggest potential research directions.
Results
The paper highlights the inadequacies of transformers in maintaining coherent state tracking over time, leading to inconsistencies in multi-turn conversations and reasoning tasks. It emphasizes the need for recurrent architectures to address these issues effectively.
Implications
Improving state tracking in transformers could enhance their performance in language understanding, reasoning, and multi-agent interactions, leading to more coherent and contextually aware AI systems.
TwinTrack: Post-hoc Multi-Rater Calibration for Medical Image Segmentation
Computer Vision
- TwinTrack provides a framework for post-hoc calibration of segmentation probabilities in ambiguous medical imaging tasks.
- The method utilizes isotonic regression to align predictions with the empirical mean human response (MHR).
- TwinTrack outperforms traditional calibration methods in terms of calibration metrics and segmentation accuracy.
- The approach is robust to inter-rater disagreement, providing meaningful probabilistic interpretations of segmentation outputs.
Read more
TwinTrack: Post-hoc Multi-Rater Calibration for Medical Image Segmentation
Summary
The paper presents TwinTrack, a novel framework for post-hoc multi-rater calibration in medical image segmentation, specifically targeting the segmentation of pancreatic ductal adenocarcinoma (PDAC) on contrast-enhanced CT scans. The authors highlight the inherent ambiguity in PDAC segmentation due to inter-rater disagreement among experts, which reflects genuine uncertainty rather than mere annotation noise. Traditional deep learning approaches typically assume a single ground truth, leading to poorly calibrated outputs that can misrepresent certainty. TwinTrack addresses this issue by calibrating ensemble segmentation probabilities to the empirical mean human response (MHR), which represents the fraction of expert annotators labeling a voxel as tumor. This calibration process is simple and requires only a small multi-rater calibration set. The methodology involves a two-stage segmentation process where a low-resolution nnU-Net localizes the pancreas, followed by a high-resolution ensemble of nnU-Nets that refines the predictions. The central contribution is the use of isotonic regression to align voxel-wise tumor probabilities with the MHR, allowing for a more interpretable probabilistic output that reflects the expected fraction of raters labeling a voxel as tumor. The results demonstrate that TwinTrack significantly improves calibration metrics and segmentation performance compared to standard methods, making it a promising approach for handling ambiguity in medical image segmentation tasks.
Methodology
TwinTrack employs a two-stage segmentation approach where a low-resolution nnU-Net first identifies a region of interest, followed by a high-resolution ensemble of nnU-Nets that refines the segmentation. The key innovation is a post-hoc calibration step using isotonic regression to align the predicted tumor probabilities with the mean human response (MHR) from multiple expert annotations.
Results
The evaluation on the CURVAS–PDACVI benchmark shows that TwinTrack achieves the highest performance across various metrics, including TDSC (Thresholding Dice Score), ECE (Expected Calibration Error), and CRPS (Continuous Ranked Probability Score). It significantly improves calibration metrics compared to uncalibrated and single-rater approaches, demonstrating its effectiveness in handling inter-rater variability.
Implications
TwinTrack's approach to multi-rater calibration can enhance the reliability and interpretability of medical image segmentation models, particularly in scenarios with high inter-rater disagreement. This has potential applications in clinical settings where accurate tumor delineation is critical for treatment planning and patient outcomes.
Applications of deep generative models to DNA reaction kinetics and to cryogenic electron microscopy
Generative Models
Graph Learning
Multimodal
- Introduction of ViDa, a deep learning framework for DNA reaction kinetics analysis.
- Development of Struc2mapGAN for synthesizing high-fidelity cryo-EM density maps.
- Proposal of improved evaluation metrics for protein structure modeling from cryo-EM maps.
- Integration of structural embeddings with cryo-EM data using CryoSAMU.
Read more
Applications of deep generative models to DNA reaction kinetics and to cryogenic electron microscopy
Summary
This dissertation investigates the application of deep generative models to enhance the analysis of complex biological problems, specifically focusing on DNA reaction kinetics and cryogenic electron microscopy (cryo-EM). The first part introduces ViDa, a biophysics-informed deep learning framework that utilizes variational autoencoders (VAEs) and geometric scattering transforms to create biophysically plausible embeddings of DNA reaction kinetics simulations. These embeddings are visualized in a two-dimensional Euclidean space, facilitating the interpretation of DNA hybridization and toehold-mediated three-way strand displacement reactions. By clustering trajectory ensembles into reaction pathways, ViDa provides new insights into reaction mechanisms. The second part addresses challenges in cryo-EM density map interpretation and protein structure modeling. A comprehensive review of existing deep learning methods for protein structure modeling is presented, along with improved evaluation metrics for assessing their performance. The dissertation introduces Struc2mapGAN, a generative adversarial network that synthesizes high-fidelity cryo-EM density maps from protein structures, and CryoSAMU, a multimodal U-Net that enhances cryo-EM density maps by integrating density features with structural embeddings through cross-attention mechanisms. Overall, the research demonstrates the potential of deep generative models to advance the understanding of DNA reaction mechanisms and improve cryo-EM density map analysis.
Methodology
The research employs deep generative models, specifically variational autoencoders (VAEs) and generative adversarial networks (GANs), to analyze DNA reaction kinetics and cryo-EM density maps. ViDa uses geometric scattering transforms to create embeddings, while Struc2mapGAN and CryoSAMU leverage GANs and U-Net architectures, respectively, to synthesize and enhance cryo-EM maps.
Results
The ViDa framework successfully visualizes and interprets DNA reaction pathways, revealing new insights into reaction mechanisms. Struc2mapGAN produces high-fidelity cryo-EM density maps, while CryoSAMU enhances the quality of intermediate-resolution maps, leading to improved protein structure modeling.
Implications
The findings suggest that deep generative models can significantly improve the interpretation of complex biological processes and enhance the accuracy of protein structure modeling from cryo-EM data, potentially impacting fields such as molecular biology and bioinformatics.
Towards a Data-Parameter Correspondence for LLMs: A Preliminary Discussion
Large Language Models
Theory
Efficient ML
- Establishes a unified framework linking data-centric and model-centric optimization methods.
- Identifies three key correspondences: geometric, low-rank, and security-privacy.
- Demonstrates that cooperative optimization can outperform isolated approaches.
- Encourages collaboration between data and parameter research communities.
Read more
Towards a Data-Parameter Correspondence for LLMs: A Preliminary Discussion
Summary
This paper addresses the historical divide between data-centric and model-centric approaches in optimizing large language models (LLMs). It proposes a unified framework that establishes a data-parameter correspondence, revealing that operations in both paradigms are manifestations of the same geometric structure on the statistical manifold. The author identifies three fundamental correspondences: (1) Geometric correspondence, where data pruning and parameter sparsification reduce manifold volume equivalently; (2) Low-rank correspondence, linking in-context learning and low-rank adaptation as explorations of identical subspaces; and (3) Security-privacy correspondence, highlighting the interplay between data poisoning and parameter backdoors. The framework aims to facilitate cross-community methodology transfer, suggesting that integrating data and parameter optimization can enhance efficiency, robustness, and privacy in LLMs.
Methodology
The paper employs mathematical formalizations grounded in information geometry, particularly the Fisher-Rao metric and Legendre duality, to explore the relationships between data manipulation techniques and parameter adjustments in LLMs.
Results
The findings suggest that recognizing the intrinsic links between data and parameter modalities can lead to improved efficiency, robustness, and privacy in LLMs, with preliminary evidence supporting the proposed correspondences.
Implications
This framework could lead to more effective LLM optimization strategies by fostering collaboration between data-centric and model-centric researchers, ultimately enhancing model performance and security.
A Systematic Survey and Benchmark of Deep Learning for Molecular Property Prediction in the Foundation Model Era
Graph Learning
Multimodal
Theory
- The paper categorizes molecular property prediction methods into four paradigms: Quantum, Descriptor ML, Geometric Deep Learning, and Foundation Models.
- It highlights the need for improved benchmark designs to address challenges in data curation and evaluation protocols.
- The authors propose three forward-looking research directions to enhance molecular property prediction methodologies.
- A comprehensive meta-analysis of over one hundred deep architectures reveals trends in performance across different molecular property prediction tasks.
Read more
A Systematic Survey and Benchmark of Deep Learning for Molecular Property Prediction in the Foundation Model Era
Summary
This paper presents a systematic survey and benchmark of deep learning methodologies for molecular property prediction, integrating insights from quantum chemistry, cheminformatics, and AI. It categorizes the evolution of molecular property prediction into four paradigms: Quantum, Descriptor Machine Learning, Geometric Deep Learning, and Foundation Models. The authors propose a unified taxonomy linking molecular representations, model architectures, and interdisciplinary applications. They conduct benchmark analyses using various datasets, addressing challenges such as inconsistent stereochemistry and reproducibility issues. The survey emphasizes the need for modernized benchmark designs and proposes three future research directions: physics-aware learning that incorporates quantum consistency, uncertainty-calibrated foundation models for reliable predictions, and the development of multimodal benchmark ecosystems that integrate computational and experimental data. The findings highlight the strengths and limitations of existing methodologies, advocating for a cohesive approach to molecular AI that leverages the advantages of each paradigm.
Methodology
The authors conducted a meta-analysis of existing deep learning architectures for molecular property prediction, reviewing over one hundred models. They categorized these models based on their underlying methodologies and applications, and performed benchmark analyses using various datasets to evaluate performance and identify trends.
Results
The survey identifies that geometric graph neural networks excel in quantum property estimation, while transformer-based architectures perform well in binding-affinity prediction. The analysis also reveals that hybrid and quantum-informed designs can offer computational efficiencies for complex systems. The proposed taxonomy and benchmark methodologies aim to improve the reliability and transparency of molecular property predictions.
Implications
The findings of this survey have significant implications for the fields of drug discovery, materials science, and computational chemistry. By establishing a unified framework and addressing existing challenges, the research can facilitate the development of more accurate and efficient predictive models, ultimately leading to advancements in molecular design and discovery.
Hallucination as Trajectory Commitment: Causal Evidence for Asymmetric Attractor Dynamics in Transformer Generation
NLP
Large Language Models
Generative Models
- Hallucination in language models is linked to asymmetric attractor dynamics.
- Same-prompt bifurcation isolates trajectory dynamics from prompt-level confounds.
- Activation patching reveals significant asymmetry in the ability to corrupt versus correct trajectories.
- The study identifies a strong correlation between prompt encoding and hallucination rates.
Read more
Hallucination as Trajectory Commitment: Causal Evidence for Asymmetric Attractor Dynamics in Transformer Generation
Summary
This paper investigates the phenomenon of hallucination in autoregressive language models, proposing that it arises from early trajectory commitments influenced by asymmetric attractor dynamics. The author employs a novel methodological approach called same-prompt bifurcation, which allows for the isolation of trajectory dynamics by repeatedly sampling identical inputs. The study focuses on the Qwen2.5-1.5B model across 61 prompts from six categories, revealing that 44.3% of prompts exhibit bifurcation, where factual and hallucinated outputs diverge at the first token generation. Activation patching experiments demonstrate a significant causal asymmetry: corrupting a correct trajectory with a hallucinated activation is successful in 87.5% of trials, while correcting a hallucinated trajectory succeeds only 33.3% of the time. The findings suggest that hallucination operates as a locally stable attractor basin, characterized by easy entry and difficult exit, with implications for understanding and potentially mitigating hallucination in language models.
Methodology
The study employs same-prompt bifurcation to analyze trajectory dynamics by generating multiple completions from identical prompts. Activation patching is used to manipulate model activations and assess the causal effects on output correctness. The analysis includes statistical measures to evaluate the significance of findings.
Results
Out of 61 prompts tested, 27 (44.3%) bifurcated, showing divergence between factual and hallucinated outputs at the first token. Activation patching experiments indicated that corrupting a correct trajectory was successful in 87.5% of cases, while correcting a hallucinated trajectory was successful in only 33.3%. The correlation between step-0 residual states and per-prompt hallucination rates was strong (r = 0.776, p < 0.001).
Implications
The findings provide insights into the mechanisms of hallucination in language models, suggesting potential strategies for improving model reliability and truthfulness. Understanding the dynamics of trajectory commitment may inform future model designs and interventions to reduce hallucination rates.
Beyond Distribution Sharpening: The Importance of Task Rewards
Reinforcement Learning
Large Language Models
Optimization
- Introduces a controlled framework for comparing distribution sharpening and task-reward optimization in RL.
- Demonstrates that task-reward optimization leads to significant performance improvements, especially on difficult tasks.
- Challenges the notion that RL fine-tuning primarily enhances existing model preferences through sharpening.
- Highlights the importance of task rewards in developing new capabilities in large language models.
Read more
Beyond Distribution Sharpening: The Importance of Task Rewards
Summary
This paper investigates the impact of task-reward-based reinforcement learning (RL) on the capabilities of frontier models, contrasting it with the concept of distribution sharpening. The authors argue that while distribution sharpening may enhance a model's confidence in its existing preferences, it does not necessarily lead to the acquisition of new skills. Through a systematic comparison of both paradigms using a KL-regularized RL framework, the study reveals that task-reward optimization consistently outperforms distribution sharpening, particularly on challenging tasks where base model performance is subpar. The findings suggest that the benefits of RL fine-tuning extend beyond mere confidence amplification, emphasizing the critical role of task rewards in shaping model behavior and capabilities. This work provides a clearer understanding of the mechanisms behind post-training improvements and highlights the importance of designing effective reward signals for scaling model capabilities.
Methodology
The authors utilize a KL-regularized RL framework to compare task-reward optimization and distribution sharpening. They implement various configurations of reward signals and KL regularization to isolate the effects of each approach. The study employs mathematical reasoning tasks of varying difficulty to evaluate the performance of different training paradigms.
Results
The experiments reveal that models trained with task-reward optimization achieve robust performance improvements compared to those relying solely on distribution sharpening. The gains are particularly pronounced on more challenging tasks, indicating that task-reward signals are essential for effective learning and capability enhancement.
Implications
The findings suggest that the design of reward signals is crucial for the development of advanced AI systems. By understanding the limitations of distribution sharpening, researchers can focus on creating more effective RL strategies that leverage task-reward optimization to enhance model capabilities, particularly in complex problem-solving scenarios.
Prior-Fitted Functional Flow: In-Context Generative Models for Pharmacokinetics
Generative Models
Time Series
- Introduction of Prior-Fitted Functional Flows (PFF) for pharmacokinetics.
- PFF enables zero-shot predictions and individual forecasting from sparse data.
- A new open-access dataset was created to calibrate physiological plausibility.
- PFF outperforms traditional NLME models and the AICMET model in predictive accuracy.
Read more
Prior-Fitted Functional Flow: In-Context Generative Models for Pharmacokinetics
Summary
This paper presents Prior-Fitted Functional Flows (PFF), a novel generative model designed for pharmacokinetics (PK) that facilitates zero-shot population synthesis and individual forecasting without the need for manual parameter tuning. The authors address the challenge of inferring drug concentration dynamics from sparse and irregular data by learning functional vector fields conditioned on the entire study population. PFF employs conditional optimal-transport flow matching to generate coherent virtual cohorts and forecast patient trajectories with calibrated uncertainty. The model is pre-trained on a new open-access dataset derived from clinical bioequivalence studies, which enhances the physiological plausibility of the generated priors. The results demonstrate that PFF achieves state-of-the-art predictive accuracy on real-world datasets, outperforming existing methods such as nonlinear mixed-effects models (NLME) and the previous AICMET model. This work significantly advances the field of pharmacokinetics by providing a robust framework for modeling drug concentration profiles in data-scarce environments.
Methodology
The authors developed PFF by leveraging conditional optimal-transport flow matching to learn the distribution over drug concentration functions directly in function space. The model is pre-trained on simulated pharmacokinetic studies, allowing it to efficiently infer concentration profiles from sparse data. It incorporates continuum-attention neural operators to parameterize the transport vector field, facilitating queries at arbitrary time points.
Results
PFF demonstrated superior predictive accuracy compared to NLME and AICMET models on extensive real-world pharmacokinetic datasets. The model effectively generated coherent virtual cohorts and accurately forecasted patient trajectories, showcasing its capability to handle sparse and irregular data.
Implications
The development of PFF has significant implications for pharmacokinetic modeling, particularly in early drug development stages where data is often limited. This approach can enhance the efficiency of drug studies by enabling accurate predictions without extensive manual tuning, potentially accelerating the drug development process.
SCRIPT: Implementing an Intelligent Tutoring System for Programming in a German University Context
Generative Models
NLP
Theory
- SCRIPT is designed to support Python programming education while conforming to European regulatory standards.
- The system functions as both a teaching tool and a research platform for ITS development.
- It employs a four-model architecture to facilitate personalized learning experiences.
- Initial implementation has been successfully used in a data mining course for exam preparation.
Read more
SCRIPT: Implementing an Intelligent Tutoring System for Programming in a German University Context
Summary
The paper presents SCRIPT, a novel Intelligent Tutoring System (ITS) designed to enhance programming education, specifically in Python, for advanced undergraduate computer science students at a German university. The authors highlight the need for an ITS that not only provides individualized hints and feedback but also adheres to stringent regulatory requirements, including GDPR and the European AI Act. SCRIPT aims to serve dual purposes: as an educational tool for teaching programming and as a research platform for developing and testing various ITS components, such as knowledge tracing models and feedback mechanisms utilizing large language models (LLMs). The system architecture is built on a four-model approach, encompassing a Pedagogical Model, Learner Model, Domain Model, and User Interface, all integrated within a web application hosted on the university's server infrastructure. The initial implementation has already been utilized for voluntary exam preparation in a data mining course, demonstrating its adaptability and effectiveness in real educational settings. Future directions include expanding the system's capabilities and further aligning it with educational and research needs.
Methodology
The authors adopted a four-model architecture for the ITS, including a Pedagogical Model, Learner Model, Domain Model, and User Interface. The system is implemented as a web application using FastAPI and MongoDB, allowing for flexible integration of various educational components and data collection for research purposes. The architecture supports A/B testing and the use of LLMs for generating programming hints.
Results
The initial deployment of SCRIPT has shown promise in providing individualized support for programming students, facilitating data collection for research on learner models and feedback mechanisms. The system's design allows for easy adaptation to different programming courses and instructional needs.
Implications
SCRIPT has the potential to significantly enhance programming education by providing personalized learning experiences and supporting research in intelligent tutoring systems. Its compliance with regulatory standards also sets a precedent for future educational technologies in similar contexts.
FL-MHSM: Spatially-adaptive Fusion and Ensemble Learning for Flood-Landslide Multi-Hazard Susceptibility Mapping at Regional Scale
Theory
Interpretability
Optimization
- Introduces a spatially adaptive modeling approach for multi-hazard susceptibility mapping.
- Combines Early Fusion and Late Fusion techniques with a Mixture of Experts model.
- Demonstrates improved predictive performance for flood and landslide susceptibility in diverse regions.
- Highlights the importance of spatial heterogeneity in hazard susceptibility analysis.
Read more
FL-MHSM: Spatially-adaptive Fusion and Ensemble Learning for Flood-Landslide Multi-Hazard Susceptibility Mapping at Regional Scale
Summary
This paper addresses the limitations of existing multi-hazard susceptibility mapping (MHSM) studies, which often rely on spatially uniform models and treat hazards independently. The authors propose a novel deep learning workflow, FL-MHSM, which integrates spatially adaptive modeling techniques for joint flood and landslide susceptibility mapping. The methodology includes a two-level spatial partitioning strategy to account for ecological and physiographic heterogeneity, a probabilistic Early Fusion (EF) model to capture inter-hazard correlations, and a Mixture of Experts (MoE) model for ensemble learning. The study evaluates the performance of EF and Late Fusion (LF) models, demonstrating that EF can enhance flood recall and reduce Brier scores compared to LF. The MoE model outperforms both EF and LF in predicting flood and landslide susceptibility, achieving high AUC-ROC scores and favorable recall and F1-scores in both Kerala and Nepal. Additionally, the GeoDetector analysis reveals that the factors influencing susceptibility vary significantly across regions, highlighting the importance of spatial heterogeneity in hazard modeling. Overall, the proposed FL-MHSM framework offers a robust approach for multi-hazard susceptibility mapping in diverse landscapes.
Methodology
The FL-MHSM workflow employs a two-level spatial partitioning strategy to create contextual zones, utilizes a probabilistic Early Fusion model based on a multivariate Gaussian formulation, and implements a Mixture of Experts model for ensemble learning. The study compares the performance of Early Fusion and Late Fusion models, with Late Fusion implemented as an eXtreme Gradient Boosting baseline.
Results
In Kerala, the Early Fusion model improved flood recall from 0.816 to 0.840 and reduced the Brier score from 0.092 to 0.086. The Mixture of Experts model achieved an AUC-ROC of 0.905, a recall of 0.930, and an F1-score of 0.722. In Nepal, Early Fusion improved flood recall from 0.820 to 0.858 and reduced the Brier score from 0.057 to 0.049, while the Mixture of Experts model achieved an AUC-ROC of 0.914, a recall of 0.901, and an F1-score of 0.559.
Implications
The findings suggest that the FL-MHSM framework can significantly enhance the accuracy of multi-hazard susceptibility mapping, which is crucial for disaster risk assessment and management. The spatially adaptive approach allows for better representation of localized hazard behaviors, potentially informing policy and planning in regions prone to compound hazards.
Placing Puzzle Pieces Where They Matter: A Question Augmentation Framework for Reinforcement Learning
Reinforcement Learning
Large Language Models
- Introduction of PieceHint, a framework for strategic hint injection in RL training.
- Focus on identifying critical reasoning steps to enhance model learning.
- Progressive withdrawal of hints promotes independent reasoning capabilities.
- Experimental validation shows competitive performance against larger models.
Read more
Placing Puzzle Pieces Where They Matter: A Question Augmentation Framework for Reinforcement Learning
Summary
The paper addresses a critical challenge in reinforcement learning (RL) for enhancing large language model (LLM) reasoning, specifically the dilemma of training on easy versus hard problems. Training on easy problems can lead to overfitting and degradation in performance metrics like pass@k, while hard problems often yield sparse rewards. The authors propose a novel framework called PieceHint, which strategically identifies and injects critical reasoning steps as hints during training. This method aims to bridge reasoning bottlenecks that typically hinder model performance. By scoring the importance of different reasoning steps and selectively providing hints based on problem difficulty, PieceHint allows models to transition from guided learning to independent reasoning. The framework also incorporates a progressive withdrawal of hints to foster self-sufficiency in reasoning. Experimental results on six mathematical reasoning benchmarks demonstrate that the 1.5B parameter model using PieceHint achieves performance comparable to larger 32B models while maintaining better parameter efficiency and preserving diversity in exploration.
Methodology
The authors developed the PieceHint framework, which employs a scoring mechanism to evaluate the importance of reasoning steps. Hints are selectively allocated based on the difficulty of the problems and the model's capabilities. The framework also includes a curriculum for progressively withdrawing hints to encourage models to develop independent reasoning skills.
Results
The experiments conducted on six mathematical reasoning benchmarks revealed that the 1.5B parameter model utilizing PieceHint achieved average performance levels comparable to those of 32B models. Additionally, it demonstrated superior parameter efficiency and maintained diversity in exploration across various k values in pass@k metrics.
Implications
The PieceHint framework has significant implications for improving the training of reinforcement learning models, particularly in contexts where reasoning and problem-solving are critical. It suggests a more nuanced approach to hint provision that could enhance the efficiency and effectiveness of training large language models in various applications, including education, automated reasoning, and complex decision-making tasks.
Hybrid Spectro-Temporal Fusion Framework for Structural Health Monitoring
Time Series
- Introduction of a Hybrid Spectro-Temporal Fusion framework for SHM.
- Integration of Spectro-Temporal Alignment and Hybrid Spectro-Temporal Fusion for improved vibration analysis.
- Demonstrated superior performance of the proposed framework over conventional methods.
- Temporal resolution significantly impacts the performance of machine learning models.
Read more
Hybrid Spectro-Temporal Fusion Framework for Structural Health Monitoring
Summary
This paper presents a novel Hybrid Spectro-Temporal Fusion framework aimed at enhancing structural health monitoring (SHM) through improved vibration-based damage detection. The proposed framework integrates arrival-time interval descriptors with spectral features to effectively capture both fine-scale and coarse-scale dynamics of vibrations. The authors introduce two key components: Spectro-Temporal Alignment (STA), which reformats vibration sequences using interpretable metrics to maintain essential dynamic behavior, and Hybrid Spectro-Temporal Fusion (HSTF), which combines arrival-time intervals and spectral descriptors to create richer time-frequency representations. Experimental validation using data from an electrodynamic shaker demonstrates that the proposed methods significantly outperform traditional input formulations, particularly in deep learning contexts. The results indicate that a temporal resolution of Δ𝜏= 0.008 optimally enhances deep learning performance, while Δ𝜏= 0.02 is more suitable for traditional machine learning models. Additionally, a comprehensive stability analysis reveals that the hybrid framework achieves higher accuracy with lower variability compared to baseline methods, underscoring its robustness and reliability for SHM applications.
Methodology
The methodology involves the development of two main components: Spectro-Temporal Alignment (STA) for reformatting vibration sequences using interpretable metrics, and Hybrid Spectro-Temporal Fusion (HSTF) for encoding arrival-time intervals and spectral descriptors. The framework is validated through experiments using data from an electrodynamic shaker, comparing the performance of the proposed methods against traditional input formulations.
Results
The experimental results indicate that the proposed spectro-temporal representations significantly outperform conventional input methods, particularly in deep learning applications. The optimal temporal resolution of Δ𝜏= 0.008 enhances deep learning performance, while Δ𝜏= 0.02 favors traditional machine learning models. The hybrid framework consistently achieves higher accuracy and lower variability in classification tasks compared to baseline and alignment-only approaches.
Implications
The proposed framework offers a robust and reliable solution for vibration-based structural health monitoring, with potential applications in civil and mechanical engineering for real-time damage detection and structural integrity assessment. Its ability to enhance classification accuracy and stability can lead to improved safety and maintenance strategies for infrastructure.
Sequential KV Cache Compression via Probabilistic Language Tries: Beyond the Per-Vector Shannon Limit
NLP
Large Language Models
Theory
- Introduces sequential KV compression, addressing the limitations of per-vector compression methods.
- Proposes a two-layer architecture: probabilistic prefix deduplication and predictive delta coding.
- Achieves a theoretical compression ratio of approximately 914,000× over TurboQuant at the Shannon limit.
- Demonstrates that compression efficiency improves with increasing context length.
Read more
Sequential KV Cache Compression via Probabilistic Language Tries: Beyond the Per-Vector Shannon Limit
Summary
This paper addresses the limitations of existing key-value (KV) cache compression methods in transformer models, particularly focusing on the sequential nature of KV caches. Previous work, such as TurboQuant, has approached the Shannon entropy limit for per-vector compression but fails to consider the structured predictability of the tokens processed by the model. The author introduces a two-layer architecture for sequential KV compression that leverages this predictability. The first layer employs probabilistic prefix deduplication to identify and eliminate semantically equivalent shared prefixes across sessions, while the second layer utilizes predictive delta coding to store only the residuals of KV vectors based on the model's predictions. The paper proves that this approach achieves a per-token entropy bound significantly lower than that of TurboQuant, with a theoretical compression ratio of approximately 914,000× at the Shannon limit. Even under pessimistic conditions, the compression ratio remains around 914×, demonstrating that the proposed method improves compression efficiency as context length increases. The two layers are designed to be orthogonal and can be integrated with existing per-vector quantization methods, enhancing their performance.
Methodology
The methodology consists of a two-layer architecture for sequential KV cache compression. The first layer, probabilistic prefix deduplication, uses the trie metric from Probabilistic Language Tries to identify shared prefixes across sessions. The second layer, predictive delta coding, captures the residuals of KV vectors based on the model's predictions, thereby reducing the amount of data stored.
Results
The proposed method achieves a per-token entropy bound of 3.3–4.3 bits on average, significantly lower than TurboQuant's 3 bits per vector component. The theoretical compression ratio is approximately 914,000× at the Shannon limit, and even under pessimistic conditions, it maintains a ratio of around 914× over TurboQuant.
Implications
The findings suggest that sequential KV cache compression can drastically reduce memory requirements for large-scale transformer models, potentially enabling more efficient inference and deployment of language models in resource-constrained environments. This could lead to advancements in applications requiring real-time processing and lower latency.
Wasserstein Distributionally Robust Risk-Sensitive Estimation via Conditional Value-at-Risk
Optimization
Theory
Time Series
- Introduces a distributionally robust approach to risk-sensitive estimation using CVaR.
- Establishes a framework for minimizing worst-case CVaR over a type-2 Wasserstein ambiguity set.
- Derives a tractable semidefinite programming formulation for computing affine estimators.
- Demonstrates improved performance in electricity price forecasting compared to traditional methods.
Read more
Wasserstein Distributionally Robust Risk-Sensitive Estimation via Conditional Value-at-Risk
Summary
This paper presents a novel approach to risk-sensitive estimation of an unknown signal from an observed signal by employing a distributionally robust framework. The authors model the unknown signal and observation as random vectors with an ambiguous joint probability distribution, which is constrained within a type-2 Wasserstein ball. The performance of the estimator is evaluated using the conditional value-at-risk (CVaR) of the squared estimation error, focusing on minimizing the worst-case CVaR across all distributions in the ambiguity set. The main contribution is the derivation of affine estimators that can be computed exactly through a tractable semidefinite program when the nominal distribution is finitely supported. The proposed method is tested on a wholesale electricity price forecasting task, demonstrating its effectiveness in reducing out-of-sample CVaR of squared error compared to existing estimation techniques.
Methodology
The authors formulate the estimation problem as a minimization of the worst-case CVaR of squared estimation error over a type-2 Wasserstein ambiguity set. They utilize duality principles from optimal transport theory to reformulate the problem into a convex optimization problem, specifically a semidefinite program, which allows for efficient computation of the estimators.
Results
The numerical experiments conducted on real market data for electricity price forecasting indicate that the proposed estimators achieve a lower out-of-sample CVaR of squared error compared to existing estimation methods, highlighting the effectiveness of the distributionally robust approach.
Implications
The findings suggest that incorporating distributional robustness in risk-sensitive estimation can significantly enhance prediction models in applications where large errors are critical, such as in power systems and financial trading. This approach could lead to more reliable forecasting methods that better manage tail risks.
Learning Affine-Equivariant Proximal Operators
Optimization
Computer Vision
Theory
- Introduction of AE-LPNs that compute exact proximal operators while being equivariant to shifts and scaling.
- Demonstration of the importance of equivariance in enhancing robustness to noise and affine transformations.
- Development of conditions for ensuring affine-equivariance in neural network architectures.
- Validation of AE-LPNs through both synthetic examples and real-world denoising tasks.
Read more
Learning Affine-Equivariant Proximal Operators
Summary
This paper introduces Affine-Equivariant Learned Proximal Networks (AE-LPNs), a novel framework for learning proximal operators that are equivariant to shifts and scaling. Proximal operators are crucial in various applications, particularly in solving ill-posed inverse problems such as image denoising and deblurring. The authors build on the concept of Learned Proximal Networks (LPNs), which provide a means to compute exact proximal operators using neural networks. However, traditional LPNs do not account for structural properties like equivariance, which can enhance robustness in real-world applications. The authors demonstrate how to enforce affine-equivariance in the design of neural networks, ensuring that the learned proximal operators maintain their equivariance structure. They validate their approach through synthetic examples and real data applications, showing that AE-LPNs significantly improve performance in denoising tasks, particularly in out-of-distribution scenarios.
Methodology
The authors propose a framework for AE-LPNs by enforcing conditions that ensure the learned functions are affine-equivariant. This is achieved through specific input transformations and architectural designs in neural networks. The methodology builds on the existing LPN framework, which parameterizes proximal operators as gradients of convex potentials, ensuring that the resulting mappings are both proximal operators and maintain equivariance properties.
Results
The AE-LPNs were shown to effectively compute proximal operators while preserving affine-equivariance. The experiments demonstrated that AE-LPNs outperformed traditional LPNs in denoising tasks, particularly when faced with noise distributions and affine shifts that were not present during training. The results indicate that incorporating equivariance into learned proximal operators enhances their robustness and generalization capabilities.
Implications
The findings suggest that incorporating structural properties like equivariance into learned models can significantly improve their performance in practical applications, particularly in signal processing and machine learning tasks involving inverse problems. This approach could lead to more robust and adaptable models in various domains, including image processing and beyond.
Impact of Nonlinear Power Amplifier on Massive MIMO: Machine Learning Prediction Under Realistic Radio Channel
Theory
Optimization
Efficient ML
- Nonlinear effects of power amplifiers in M-MIMO systems are significant and often overlooked in existing literature.
- The paper proposes a statistical model and a machine learning model to predict Signal to Distortion Ratio (SDR) under realistic conditions.
- The ML-based power allocation scheme demonstrates a 12% median gain in user throughput compared to fixed operating point schemes.
- 3D-Ray Tracing simulations reveal the inadequacy of traditional channel models in accurately capturing nonlinear distortion effects.
Read more
Impact of Nonlinear Power Amplifier on Massive MIMO: Machine Learning Prediction Under Realistic Radio Channel
Summary
This paper investigates the effects of nonlinear power amplifiers (PAs) on Massive Multiple-Input Multiple-Output (M-MIMO) systems, particularly under realistic radio channel conditions. While M-MIMO technology is essential for enhancing spectral and energy efficiency in wireless networks, most existing studies assume linear behavior of PAs, which neglects the significant nonlinear distortions that can occur in practical scenarios, especially in multicarrier systems like Orthogonal Frequency Division Multiplexing (OFDM). The authors first provide a theoretical characterization of nonlinear distortion in M-MIMO systems using common radio channel models, such as Rayleigh and Line of Sight (LoS). They then demonstrate the inadequacy of these models through 3D-Ray Tracing (3D-RT) simulations. To address this, two new models are proposed: a statistical model utilizing the Generalized Extreme Value (GEV) distribution for modeling Signal to Distortion Ratio (SDR) and a machine learning (ML)-based model that predicts SDR based on spatial characteristics of the radio channel and PA operation points. The ML model facilitates PA-aware per-user power allocation, leading to improved user throughput. The results indicate a median throughput gain of approximately 12% when using the proposed ML-based power allocation scheme compared to traditional fixed operating point methods.
Methodology
The authors conducted a theoretical analysis of nonlinear distortion in M-MIMO systems using common radio channel models. They employed 3D-Ray Tracing software to validate these models and proposed two new approaches: a statistical model based on the Generalized Extreme Value distribution and a machine learning model that predicts SDR based on spatial characteristics and PA operation points. The performance of the proposed ML-based power allocation scheme was compared against traditional methods.
Results
The proposed machine learning-based power allocation scheme achieved a median throughput gain of approximately 12% over the state-of-the-art fixed operating point scheme. The analysis also revealed that traditional models inadequately capture the nonlinear distortion effects in M-MIMO systems.
Implications
The findings suggest that incorporating nonlinear behavior in M-MIMO system designs can significantly enhance performance, particularly in future wireless networks like 6G. The proposed models can be utilized for more efficient power allocation strategies in practical deployments.
Universally Empowering Zeroth-Order Optimization via Adaptive Layer-wise Sampling
NLP
Large Language Models
Optimization
Efficient ML
- AdaLeZO improves ZO optimization by addressing layer sensitivity and computational inefficiencies.
- The framework uses a Multi-Armed Bandit approach for dynamic perturbation allocation.
- Inverse Probability Weighting is employed to ensure unbiased gradient estimation.
- Extensive experiments show significant speedups in wall-clock time without sacrificing accuracy.
Read more
Universally Empowering Zeroth-Order Optimization via Adaptive Layer-wise Sampling
Summary
This paper addresses the challenges of Zeroth-Order (ZO) optimization, particularly in the context of fine-tuning Large Language Models (LLMs). The authors identify that traditional ZO methods suffer from slow convergence and high variance due to inefficient exploration strategies that do not account for the varying sensitivity of different layers in deep networks. To overcome these issues, they propose AdaLeZO, an Adaptive Layer-wise ZO optimization framework that formulates the layer selection process as a non-stationary Multi-Armed Bandit (MAB) problem. This approach allows for dynamic allocation of perturbation budgets to the most sensitive parameters, significantly improving computational efficiency. The authors also introduce an Inverse Probability Weighting mechanism to ensure unbiased gradient estimation while reducing variance. Extensive experiments on models like LLaMA and OPT demonstrate that AdaLeZO achieves substantial wall-clock speedups of 1.7× to 3.0× compared to existing methods, while maintaining or improving accuracy. AdaLeZO is presented as a universal plug-and-play module that enhances the efficiency of existing ZO optimizers without additional memory overhead.
Methodology
The authors analyze the runtime characteristics of ZO algorithms and identify inefficiencies in perturbation generation and parameter updates. They propose AdaLeZO, which utilizes a Multi-Armed Bandit framework for adaptive layer selection and an Inverse Probability Weighting mechanism for unbiased gradient estimation. This combination allows for efficient and targeted optimization.
Results
AdaLeZO achieves wall-clock speedups of 1.7× to 3.0× over state-of-the-art ZO optimization methods on LLaMA and OPT models ranging from 6.7B to 30B parameters, while maintaining or improving model accuracy.
Implications
The proposed AdaLeZO framework can significantly enhance the efficiency of fine-tuning large-scale language models, making it feasible to perform such tasks on consumer-grade hardware. This could lead to broader accessibility and application of LLMs in various domains.
Demystifying the unreasonable effectiveness of online alignment methods
Theory
Reinforcement Learning
Efficient ML
- Introduces temperature-zero regret as a criterion focusing on the top-ranked response.
- Proves that greedy online alignment methods achieve bounded (O(1)) cumulative temperature-zero regret.
- Clarifies that prior logarithmic-regret results are driven by policy randomization rather than failure to identify the best response.
- Demonstrates the effectiveness of greedy alignment methods in practical applications.
Read more
Demystifying the unreasonable effectiveness of online alignment methods
Summary
This paper addresses the effectiveness of iterative alignment methods, particularly those based on greedy updates, which have shown remarkable empirical performance despite theoretical guarantees suggesting O(log T) KL-regularized regret. The author argues that this discrepancy arises from the regret criterion itself, which conflates the statistical cost of learning with the exploratory randomization from softened training policies. To clarify this, the paper introduces the temperature-zero regret criterion, which focuses solely on the top-ranked response during inference. The author proves that standard greedy online alignment methods, such as online RLHF and online DPO, achieve constant (O(1)) cumulative regret under this criterion. By distinguishing the cost of identifying the best response from the stochasticity introduced by regularization, the findings provide a clearer theoretical understanding of the efficiency of greedy alignment methods. The paper further formalizes the interaction model for iterative alignment, describes the greedy alignment procedures, and presents both theoretical proofs and empirical validation of the results through controlled simulations.
Methodology
The paper formalizes the temperature-zero regret criterion and analyzes the standard greedy alignment loop. It includes theoretical proofs of the cumulative regret results and empirical validation through controlled simulations to demonstrate the effectiveness of greedy alignment methods.
Results
The main results indicate that greedy online alignment methods can achieve constant cumulative regret when evaluated under the temperature-zero criterion. This contrasts with the previously established O(log T) KL-regularized regret, which is shown to be influenced by the randomization of the KL-regularized policy rather than the learner's inability to identify the best response.
Implications
The findings suggest that greedy alignment methods are more efficient than previously understood, which could influence the design and evaluation of online learning systems and preference-based training methods in machine learning applications.
Modern Structure-Aware Simplicial Spatiotemporal Neural Network
Graph Learning
Time Series
Theory
- First approach to utilize high-dimensional simplicial complexes for spatiotemporal modeling.
- Combines spatiotemporal random walks with Temporal Convolutional Networks for improved efficiency.
- Demonstrates effectiveness across diverse real-world datasets in energy, environmental, and transportation sectors.
- Achieves competitive performance in both prediction and data imputation tasks.
Read more
Modern Structure-Aware Simplicial Spatiotemporal Neural Network
Summary
The paper introduces the Modern Structure-Aware Simplicial Spatiotemporal Neural Network (ModernSASST), a novel approach to spatiotemporal modeling that leverages simplicial complex structures to capture higher-order topological relationships in data. Traditional graph neural networks (GNNs) primarily focus on pairwise relationships, which limits their effectiveness in modeling complex systems. ModernSASST addresses this limitation by employing spatiotemporal random walks on high-dimensional simplicial complexes and integrating parallelizable Temporal Convolutional Networks (TCNs) to enhance computational efficiency. The authors conduct extensive experiments across three real-world datasets in energy, environmental, and transportation domains, demonstrating the model's effectiveness in both spatiotemporal prediction and data imputation. The results indicate that ModernSASST outperforms existing methods, showcasing its potential for applications that require understanding complex interactions in dynamic systems.
Methodology
The methodology involves the use of spatiotemporal random walks on simplicial complexes to capture complex topological dependencies. The model integrates Temporal Convolutional Networks to allow for parallel processing, enhancing computational efficiency while maintaining the ability to model intricate spatiotemporal relationships.
Results
The experiments conducted on three diverse datasets show that ModernSASST significantly outperforms traditional GNN-based models in spatiotemporal prediction tasks and also excels in data imputation, indicating its versatility and effectiveness in handling complex spatiotemporal data.
Implications
The implications of this research extend to various domains where understanding complex interactions over time is crucial, such as smart cities, environmental monitoring, and transportation systems. The ability to model higher-dimensional relationships can lead to more accurate predictions and insights in these fields.
QuantSightBench: Evaluating LLM Quantitative Forecasting with Prediction Intervals
NLP
Large Language Models
Time Series
- Introduction of QuantSightBench as a benchmark for evaluating LLMs in quantitative forecasting.
- Proposed use of prediction intervals as a more rigorous evaluation format compared to point estimates.
- Evaluation of multiple LLMs shows none achieve the 90% coverage target for prediction intervals.
- Identified systematic overconfidence and calibration issues across evaluated models.
Read more
QuantSightBench: Evaluating LLM Quantitative Forecasting with Prediction Intervals
Summary
This paper addresses the limitations of existing evaluations of large language models (LLMs) in the context of quantitative forecasting, which is crucial for decision-making across various domains such as economics and public health. Current benchmarks primarily focus on binary or multiple-choice tasks, failing to capture the complexity of numerical estimates that require explicit uncertainty representation. The authors propose a new evaluation format using prediction intervals, which provide a bounded numerical range at a specified confidence level, aligning with how humans naturally reason about uncertain quantities. They introduce QuantSightBench, a benchmark designed to assess LLMs' forecasting capabilities through prediction intervals across diverse domains. The evaluation includes frontier and open-weight models under various settings, revealing that none of the 11 models evaluated meet the 90% coverage target for prediction intervals, with the best performers falling significantly short. The study highlights systematic overconfidence in model predictions, particularly at extreme magnitudes, indicating a need for improved calibration in LLMs.
Methodology
The authors developed QuantSightBench to evaluate LLMs' forecasting capabilities using prediction intervals. They assessed 11 models across different settings (zero-shot, context-grounded, and agentic) and analyzed their performance in terms of empirical coverage and interval sharpness, focusing on calibration and scale sensitivity.
Results
The evaluation revealed that the top-performing models—Gemini 3.1 Pro, Grok 4, and GPT-5.4—achieved only 79.1%, 76.4%, and 75.3% coverage, respectively, all falling short of the 90% target. The analysis indicated a significant degradation in calibration at extreme magnitudes, highlighting a consistent pattern of overconfidence among the models.
Implications
The findings suggest that current LLMs may not be reliable for high-stakes forecasting tasks due to their poor calibration and overconfidence. This highlights the necessity for improved evaluation frameworks that better capture uncertainty in predictions, which is crucial for informed decision-making in various fields.
Generalization Boundaries of Fine-Tuned Small Language Models for Graph Structural Inference
Graph Learning
Large Language Models
NLP
- Fine-tuned SLMs exhibit strong ordinal consistency across different graph families.
- Structural reasoning performance degrades gracefully with increasing graph size, with architecture-specific degradation profiles.
- Adjacency-list serialization is more effective than edge-list encoding, especially for larger graphs.
- Node-level properties are estimated most reliably, while global properties pose significant inference challenges.
Read more
Generalization Boundaries of Fine-Tuned Small Language Models for Graph Structural Inference
Summary
This paper investigates the generalization capabilities of fine-tuned small language models (SLMs) for graph structural inference, focusing on their performance beyond training conditions. The study systematically explores two generalization axes: graph size and graph family distribution, using a controlled experimental setup with three instruction-tuned models in the 3-4B parameter range and two graph serialization formats. The findings reveal that fine-tuned SLMs maintain strong ordinal consistency across structurally distinct graph families, indicating a genuine understanding of structural properties rather than mere memorization. Additionally, structural reasoning degrades gracefully with increasing graph size, with distinct degradation profiles observed for different architectures. The adjacency-list serialization format consistently outperforms edge-list encoding, particularly at larger graph sizes. The analysis also shows a locality gradient in structural understanding, with node-level properties estimated most reliably, followed by local-structural properties, while global combinatorial properties represent a significant inference boundary. These results provide empirical grounding for the use of fine-tuned SLMs in graph-based reasoning tasks and clarify their limitations.
Methodology
The research employs a controlled experimental setup to evaluate the performance of fine-tuned small language models on graph property estimation tasks. It systematically varies graph size and family distribution, using both adjacency-list and edge-list serialization formats to assess the models' generalization capabilities.
Results
The results demonstrate that fine-tuned SLMs maintain strong ordinal consistency across different graph families and can rank graphs by structural properties even when presented with larger graphs than those seen during training. The study identifies distinct degradation profiles for different architectures and highlights the superiority of adjacency-list serialization over edge-list encoding.
Implications
The findings suggest that fine-tuned small language models can be effectively utilized for graph-based reasoning tasks, providing a foundation for their application in scenarios where specialized graph libraries are impractical. However, the identified limitations also caution against over-reliance on these models for certain types of graph properties.
Training Time Prediction for Mixed Precision-based Distributed Training
Efficient ML
- Floating-point precision significantly affects training time, with variations up to 2.4x.
- Existing prediction methods fail to account for precision variations, leading to high prediction errors.
- The proposed precision-aware predictor achieves a MAPE of 9.8%, significantly improving accuracy.
- The methodology incorporates operator-level precision and communication overheads for better predictions.
Read more
Training Time Prediction for Mixed Precision-based Distributed Training
Summary
This paper addresses the challenge of accurately predicting training time in distributed deep learning, particularly when using mixed precision settings. The authors highlight that the choice of floating-point precision can lead to significant variations in training time, with differences of up to 2.4 times observed. Existing prediction methods often rely on static computation graphs that do not account for these variations, resulting in high prediction errors, sometimes exceeding 147.85% in mean absolute percentage error (MAPE). To overcome this limitation, the authors propose a precision-aware training time predictor that dynamically adjusts to various precision settings, including mixed precision. By partitioning the model computation graph and incorporating communication overheads, their approach achieves a robust prediction accuracy with a MAPE of just 9.8%. This advancement is crucial for optimizing resource allocation, job scheduling, and cost estimation in distributed training environments.
Methodology
The authors developed a distributed training time predictor that supports arbitrary precision settings. They partition the model computation graph to identify operator-level precision and incorporate communication overheads from various parallelism strategies (data, tensor, and pipeline parallelism). The prediction model is formulated to account for computation time and communication delays, enabling accurate training time estimation across different precision configurations.
Results
The proposed predictor demonstrated a mean absolute percentage error (MAPE) of 9.8% across various precision settings, a significant improvement over existing methods like NeuSight and vTrain, which exhibited prediction errors of 130.55% and 147.85%, respectively, when faced with mixed precision settings.
Implications
The findings of this research have important implications for optimizing distributed training processes in deep learning. By providing accurate training time predictions, the proposed method can enhance resource allocation, improve job scheduling efficiency, and reduce costs associated with distributed training, ultimately facilitating the training of larger models more effectively.
Late Fusion Neural Operators for Extrapolation Across Parameter Space in Partial Differential Equations
Theory
Interpretability
Optimization
- Introduction of Late Fusion Neural Operator architecture for improved extrapolation in PDEs.
- Separation of state dynamics and parameter effects enhances generalization capabilities.
- Significant performance improvements over existing neural operator methods.
- Comprehensive benchmarking across diverse PDE problems.
Read more
Late Fusion Neural Operators for Extrapolation Across Parameter Space in Partial Differential Equations
Summary
This paper addresses the challenge of accurately predicting the behavior of systems governed by partial differential equations (PDEs) across unseen parameter regimes, which is crucial for robust generalization in scientific and engineering applications. The authors introduce the Late Fusion Neural Operator, an innovative architecture that disentangles the learning of state dynamics from parameter effects, thereby enhancing predictive performance both within and beyond the training distribution. By combining neural operators for latent state representation with sparse regression for structured parameter incorporation, the proposed method demonstrates significant improvements in extrapolation capabilities. The authors benchmark their approach against four benchmark PDEs, including advection and reaction-diffusion equations, and find that it consistently outperforms existing methods like the Fourier Neural Operator and CAPE-FNO. The Late Fusion Neural Operators achieve an average RMSE reduction of 72.9% in-domain and 71.8% out-domain compared to the second-best method, showcasing strong generalization across parameter regimes. The study also includes ablation studies and interpretation analyses, revealing insights into the relationships between system dynamics and governing parameters.
Methodology
The authors propose a Late Fusion Neural Operator that processes parameter information separately from the evolving state. This architecture utilizes sparse regression to explicitly incorporate parameter dependencies in a structured manner. The method is evaluated on various PDEs, focusing on in- and out-of-distribution extrapolation without relying on temporal history or prior knowledge of governing equations.
Results
The Late Fusion Neural Operators consistently outperform the Fourier Neural Operator and CAPE-FNO across four benchmark PDEs, achieving an average RMSE reduction of 72.9% in-domain and 71.8% out-domain compared to the second-best method. The results demonstrate strong generalization capabilities across both in-domain and out-domain parameter regimes.
Implications
The findings suggest that the Late Fusion Neural Operator can significantly enhance the predictive modeling of complex systems governed by PDEs, making it a valuable tool for scientific and engineering applications where parameter variations are common. The structured approach to parameter incorporation may also facilitate dynamic discovery in related fields.
Evaluating quality in synthetic data generation for large tabular health datasets
Generative Models
- Introduces a systematic evaluation framework for synthetic data generation in health datasets.
- Evaluates seven machine learning models on four datasets with varying scales.
- Proposes a methodology for assessing fidelity in synthesized joint distributions.
- Highlights challenges in maintaining medical domain adherence during data synthesis.
Read more
Evaluating quality in synthetic data generation for large tabular health datasets
Summary
This paper addresses the lack of consensus on metrics for evaluating the quality of synthetic data generated from large health datasets, particularly historical epidemiological data. The authors evaluate seven recent models from major machine learning families across four datasets of varying scales. They systematically tune hyperparameters for each model to ensure fair comparisons and propose a new methodology for assessing the fidelity of synthesized joint distributions. This methodology aligns various metrics with visualizations, making it applicable to any dataset. The study highlights the challenges faced by models in adhering to medical domain requirements and aims to provide a foundational framework for stakeholders involved in synthetic data generation. The analysis emphasizes the importance of quality in synthetic datasets for privacy preservation and data augmentation, particularly in the context of health data.
Methodology
The authors conducted a comparative analysis of seven machine learning models, systematically tuning hyperparameters for each model and dataset. They developed a methodology for evaluating the fidelity of synthesized data, which includes visualizations and a limited set of metrics to rank model performance.
Results
The study found that the evaluated models faced significant challenges in maintaining fidelity to the medical domain, particularly with complex categorical data. The proposed evaluation methodology provided a clearer framework for assessing the quality of synthetic datasets, revealing variances in model performance across different datasets.
Implications
The findings suggest that the proposed evaluation framework can assist researchers and practitioners in selecting appropriate synthetic data generation methods, ensuring better quality and fidelity in privacy-preserving health data applications. This could enhance the utility of synthetic datasets in research and public health without compromising individual privacy.
Hierarchical Active Inference using Successor Representations
Reinforcement Learning
Robotics
Theory
- Introduces a hierarchical model of active inference that leverages successor representations for efficient planning.
- Demonstrates the ability to learn higher-level abstract states and actions from lower-level representations.
- Validates the approach on multiple planning and reinforcement learning tasks, showing improved efficiency.
- Addresses scalability challenges in active inference by utilizing a state-action hierarchy.
Read more
Hierarchical Active Inference using Successor Representations
Summary
This paper presents a novel approach to hierarchical active inference, integrating the concept of successor representations to enhance planning in complex environments. Active inference, grounded in the free energy principle (FEP), models agents as continuously engaged in a perception-action loop, aiming to minimize surprise and free energy. The authors propose a hierarchical model that allows for efficient planning by clustering states into macro states and learning macro actions that facilitate navigation between these states. The methodology demonstrates how lower-level successor representations can inform higher-level abstract states and actions, thereby improving planning efficiency. The approach is validated through various tasks, including the four rooms task, key-based navigation, and the Mountain Car problem, showcasing its ability to handle large-scale problems effectively. This work represents a significant advancement in applying learned hierarchical abstractions to active inference, contributing to a deeper understanding of decision-making processes in dynamic environments.
Methodology
The authors developed a hierarchical active inference model that combines a hierarchical representation of the environment with successor representations. This involves clustering the successor matrix to identify macro states and learning macro actions for navigation. The model allows for planning at both lower and higher levels of abstraction, significantly reducing computational costs associated with traditional active inference methods.
Results
The proposed model was tested on several tasks, including a variant of the four rooms task, key-based navigation, and the Mountain Car problem. Results indicated that the hierarchical approach facilitated efficient planning and learning of abstract actions, demonstrating improved performance over traditional methods in complex environments.
Implications
This research has potential implications for enhancing decision-making processes in artificial agents, particularly in robotics and reinforcement learning. The hierarchical active inference framework could lead to more efficient algorithms for navigating complex environments, thereby improving the applicability of active inference in real-world scenarios.
How Much Cache Does Reasoning Need? Depth-Cache Tradeoffs in KV-Compressed Transformers
NLP
Large Language Models
Theory
- Conjectured product depth lower bound for k-hop reasoning in Transformers.
- Establishment of a bandwidth barrier that limits depth lower bounds in high-precision settings.
- Two-regime error analysis revealing significant differences in performance between adaptive and oblivious cache strategies.
- Identification of an open problem related to closing the gap between conjectured and proven bounds.
Read more
How Much Cache Does Reasoning Need? Depth-Cache Tradeoffs in KV-Compressed Transformers
Summary
This paper investigates the memory bottleneck posed by the key-value (KV) cache during Transformer inference, particularly focusing on how much cache is necessary for effective multi-step reasoning. The study is framed around k-hop pointer chasing under a shared KV cache, analyzing parameters such as cache size, attention dimensions, and precision. The author presents three main results: (1) a conjectured product depth lower bound for Transformers, which suggests that the depth required for k-hop reasoning scales with both cache reachability and per-window bandwidth; (2) a bandwidth barrier theorem indicating that depth lower bounds based on per-window distinguishability cannot exceed a certain threshold in high-precision regimes; and (3) a detailed error analysis showing distinct behaviors of success probabilities under adaptive versus oblivious cache selections. The findings highlight the limitations of current proof techniques in establishing depth requirements and emphasize the need for new methodologies to close existing gaps in theoretical understanding.
Methodology
The paper employs theoretical analysis of k-hop pointer chasing under a shared KV cache, utilizing concepts from probability and combinatorial arguments to derive bounds on depth requirements. It also introduces a bandwidth barrier theorem and conducts a two-regime error analysis to differentiate between adaptive and oblivious cache strategies.
Results
The main results include a conjectured product depth lower bound (L = Ω(⌈k/s⌉·⌈log2 n/(Hmp)⌉)), an unconditional max-bound (L = O(min(k,⌈k/s⌉·⌈log2(2s)⌉)·⌈log2 n/(mp)⌉)), and a bandwidth barrier theorem indicating that depth lower bounds cannot exceed ⌈k/s⌉ when Hmp ≥ log2 n. Additionally, the paper presents a two-regime characterization of success probabilities under different cache strategies, highlighting the advantages of adaptive caches.
Implications
The findings suggest that effective cache management is crucial for maintaining reasoning capabilities in large language models, especially as context windows expand. The results may influence future research on cache compression techniques and their impact on model performance, particularly in multi-hop reasoning tasks.
PINNACLE: An Open-Source Computational Framework for Classical and Quantum PINNs
Theory
Optimization
Efficient ML
- PINNACLE is an open-source framework that integrates classical and quantum PINNs.
- The framework supports advanced training strategies and multi-GPU acceleration.
- A comprehensive benchmark study quantifies the impact of various architectural and training enhancements.
- The results highlight the sensitivity of PINNs to design choices and their computational costs compared to classical methods.
Read more
PINNACLE: An Open-Source Computational Framework for Classical and Quantum PINNs
Summary
The paper introduces PINNACLE, an open-source computational framework designed for physics-informed neural networks (PINNs). This framework integrates advanced training strategies, multi-GPU acceleration, and hybrid quantum-classical architectures into a cohesive modular workflow. It facilitates systematic evaluation of PINN performance across various benchmark problems, including 1D hyperbolic conservation laws, incompressible flows, and electromagnetic wave propagation. The framework supports numerous architectural and training enhancements, such as Fourier feature embeddings, random weight factorization, strict boundary condition enforcement, adaptive loss balancing, curriculum training, and second-order optimization strategies, while allowing for extensibility to additional methods. A comprehensive benchmark study is presented, quantifying the effects of these enhancements on convergence, accuracy, and computational cost, alongside an analysis of distributed data parallel scaling regarding runtime and memory efficiency. The framework also extends to hybrid quantum-classical PINNs, providing a formal estimate for circuit-evaluation complexity under parameter-shift differentiation. The results underscore the sensitivity of PINNs to architectural and training choices, confirm their higher computational costs compared to classical solvers, and identify scenarios where hybrid quantum models demonstrate improved parameter efficiency. Overall, PINNACLE serves as a foundational tool for benchmarking physics-informed learning methods and guiding future advancements through quantitative assessments of their trade-offs.
Methodology
The authors developed PINNACLE by integrating various training strategies and architectural enhancements into a modular framework. They conducted a benchmark study across multiple physics problems to evaluate the performance of PINNs, analyzing the effects of different methods on convergence and computational efficiency. The framework also incorporates multi-GPU support and extends to hybrid quantum-classical architectures.
Results
The benchmark results demonstrated that the architectural and training enhancements significantly influence the convergence and accuracy of PINNs. The study confirmed that while PINNs generally incur higher computational costs than classical solvers, there are specific conditions under which hybrid quantum models can achieve better parameter efficiency.
Implications
PINNACLE provides a robust platform for researchers and practitioners to benchmark and develop physics-informed learning methods. Its modular design allows for easy integration of new techniques, potentially accelerating advancements in both classical and quantum PINNs. The findings may influence future research directions in computational physics and machine learning applications.
Neural Garbage Collection: Learning to Forget while Learning to Reason
NLP
Large Language Models
Reinforcement Learning
Efficient ML
- Introduces Neural Garbage Collection (NGC) for efficient KV cache management in language models.
- Enables end-to-end learning of memory management and reasoning through reinforcement learning.
- Achieves 2-3x peak KV cache size compression while maintaining strong accuracy.
- Eliminates the need for supervised fine-tuning or proxy objectives in training.
Read more
Neural Garbage Collection: Learning to Forget while Learning to Reason
Summary
The paper introduces Neural Garbage Collection (NGC), a novel approach that allows language models to learn to forget while reasoning, addressing the limitations of traditional KV cache management in chain-of-thought reasoning. Current methods rely on fixed heuristics for cache management, which can hinder model efficiency and scalability. NGC enables end-to-end learning by allowing the model to decide which KV cache entries to evict during reasoning, optimizing both memory management and reasoning capabilities through reinforcement learning. The model is trained solely on outcome-based task rewards, eliminating the need for supervised fine-tuning or auxiliary objectives. Experimental results demonstrate that NGC achieves significant reductions in peak KV cache size (2-3x compression) while maintaining strong accuracy on tasks such as Countdown, AMC, and AIME, outperforming existing eviction baselines. This work suggests a paradigm shift where efficiency can be treated as a model capability, paving the way for more scalable and capable language models.
Methodology
The methodology involves training a language model to manage its KV cache by periodically deciding which entries to evict during reasoning. This is achieved through reinforcement learning, where both reasoning and cache eviction decisions are treated as discrete actions sampled from the model. The training is based solely on outcome-based task rewards, allowing for joint optimization of reasoning and memory management.
Results
NGC maintains strong accuracy on various tasks while achieving significant reductions in peak KV cache size, with 2-3x compression observed. It outperforms baseline eviction methods, demonstrating the effectiveness of the proposed approach in managing memory during reasoning.
Implications
The findings suggest that language models can be designed to optimize both reasoning and memory management simultaneously, leading to more efficient and capable models. This approach could enhance the scalability of language models, enabling them to tackle more complex tasks without being constrained by memory limitations.
ProtoTTA: Prototype-Guided Test-Time Adaptation
Computer Vision
NLP
Interpretability
- ProtoTTA enhances robustness of prototype-based models during distribution shifts.
- The framework minimizes entropy of prototype-similarity distributions for confident activations.
- Geometric filtering is used to stabilize updates by focusing on reliable samples.
- Experiments show improved performance across diverse benchmarks compared to standard methods.
Read more
ProtoTTA: Prototype-Guided Test-Time Adaptation
Summary
The paper introduces ProtoTTA, a novel framework designed for prototype-based deep learning models to enhance their robustness during test-time adaptation (TTA). Traditional prototype models, while interpretable, struggle with distribution shifts that can degrade performance. ProtoTTA leverages intermediate prototype signals to minimize the entropy of prototype-similarity distributions, encouraging confident activations on shifted data. The framework employs geometric filtering to stabilize updates by focusing on samples with reliable prototype activations, regulated by prototype-importance weights and model-confidence scores. Experiments across various benchmarks in fine-grained vision, histopathology, and NLP demonstrate that ProtoTTA outperforms standard entropy minimization methods, restoring semantic focus in prototype activations. Additionally, the authors introduce new interpretability metrics and a vision-language model (VLM) evaluation framework, confirming that ProtoTTA aligns with human reasoning and enhances the interpretability of TTA dynamics.
Methodology
ProtoTTA employs a framework that minimizes the entropy of prototype activations to promote confident and specific prototype matching. It uses geometric filtering to select reliable samples for updates, ensuring that only those with strong prototype activations are considered. The method integrates prototype-importance weights and model-confidence scores to maintain stability during adaptation.
Results
The experimental results indicate that ProtoTTA significantly improves robustness over traditional output entropy minimization techniques. It successfully restores semantic focus in prototype activations, leading to better performance across multiple datasets in fine-grained vision, histopathology, and NLP tasks.
Implications
ProtoTTA has potential applications in critical domains such as healthcare, where model interpretability and robustness are essential. By enhancing the adaptability of prototype-based models, it can improve decision-making processes in high-stakes environments.