ArXiv ML Papers | Daily Summaries

ASCIIBench: Evaluating Language-Model-Based Understanding of Visually-Oriented Text

Kerry Luo, Michael Fu, Joshua Peguero, Husnain Malik, Anvay Patil, Joyce Lin, Megan Van Overborg, Ryan Sarmiento, Kevin Zhu

ASCIIBench is the first publicly available benchmark for evaluating LLMs on ASCII art classification and generation tasks.
The dataset consists of 5,315 curated ASCII images across 752 classes, emphasizing structural and positional reasoning challenges.
Fine-tuned CLIP embeddings were used to evaluate generation fidelity and representation quality, revealing bottlenecks in multimodal embeddings.
Vision-only models consistently outperformed text-only and text+vision models in classification tasks, with GPT-4o achieving the highest macro accuracy of 82.2%.
The study positions ASCII art as a unique stress test for multimodal representations, motivating the development of tailored embedding methods.

Abstract

This paper introduces ASCIIBench, a novel benchmark designed to evaluate large language models (LLMs) on tasks involving ASCII art, a symbolic medium that combines text and visual structure. The benchmark includes a curated dataset of 5,315 class-labeled ASCII images across 752 categories, sourced ethically from ascii.co.uk. The study explores both classification and generation tasks, leveraging fine-tuned CLIP embeddings to assess the fidelity of LLM-generated ASCII art. Results reveal that current multimodal models struggle with precise spatial and positional reasoning required for ASCII art understanding, with vision-only models outperforming text-only and text+vision approaches. The paper highlights limitations in representation quality, particularly in separating low-variance classes, and suggests ASCII art as a stress test for multimodal embeddings. All resources, including the dataset and fine-tuned CLIP weights, are publicly available to encourage further research in symbolic visual modalities.

Methodology

The authors curated a high-quality dataset of ASCII art, removing noise such as signatures and tags. They evaluated multiple LLMs (e.g., GPT-3.5, GPT-4, GPT-5-mini, Claude-3.5 Sonnet) on classification and generation tasks using text-only, vision-only, and text+vision modalities. Fine-tuned CLIP embeddings were employed to assess generation fidelity and representation quality, using metrics like cosine similarity, alignment, and uniformity. Classification accuracy was measured using macro and micro accuracy, while generation fidelity was evaluated via ROC-AUC scores.

Results

Vision-only models consistently outperformed text-only and text+vision models in classification tasks, with GPT-4o achieving the highest macro accuracy of 82.2%. Fine-tuned CLIP embeddings improved alignment and uniformity metrics, confirming representation quality but highlighting challenges in separating low-variance classes. Generation fidelity, measured via CLIP cosine similarity, showed stable performance but revealed limitations in embedding space representation.

Implications

The findings underscore the need for improved multimodal embedding methods tailored to symbolic visual modalities like ASCII art. ASCIIBench provides a valuable resource for testing and developing models capable of precise spatial and positional reasoning. Potential applications include enhancing multimodal AI systems for structured data understanding, such as tables, diagrams, and symbolic art.

View on arXiv

Context-Aware Mixture-of-Experts Inference on CXL-Enabled GPU-NDP Systems

Zehao Fan, Zhenyu Liu, Yunzhen Liu, Yayue Hou, Hadjer Benmeziane, Kaoutar El Maghraoui, Liu Liu

Introduces a context-aware MoE inference system that uses prefill-stage activation statistics to guide expert placement and precision allocation.
Dynamically pins hot experts to GPU memory and maps cold experts to CXL-NDP devices, reducing memory transfer overhead.
Implements mixed-precision quantization for NDP-resident experts, allocating bitwidths based on prefill-stage importance and quantization loss.
Achieves up to 8.7× improvement in decoding throughput compared to state-of-the-art methods, with only a 0.13% average accuracy drop.
Demonstrates the potential of CXL-enabled NDP systems for efficient large-scale model inference.

Abstract

This paper addresses the memory and computational bottlenecks in Mixture-of-Experts (MoE) model inference, particularly when expert weights exceed GPU memory capacity. The authors propose a novel context-aware system for MoE inference that leverages Compute Express Link (CXL)-enabled near-data processing (NDP) systems. By dynamically analyzing activation statistics during the prefill stage, the system optimizes expert placement and precision allocation. Hot experts are pinned to GPU memory in full precision, while cold experts are executed on CXL-NDP devices with mixed-precision quantization. This approach minimizes costly weight transfers between GPU and external memory, reduces bandwidth contention, and improves overall system efficiency. The proposed system achieves significant improvements in decoding throughput with minimal accuracy loss, making it a practical solution for scaling large language models.

Methodology

The authors developed a context-aware MoE inference system that collects activation statistics during the prefill stage of each input sequence. These statistics are used to guide expert placement, dynamically assigning hot experts to GPU memory and cold experts to CXL-NDP devices. Additionally, the system employs mixed-precision quantization for NDP-resident experts, using a precomputed quantization loss table to allocate bitwidths (1–4 bits) based on expert importance. The system overlaps GPU and NDP execution to further optimize performance.

Results

The proposed system achieves up to 8.7× improvement in decoding throughput compared to state-of-the-art methods, while maintaining an average accuracy drop of only 0.13%. This demonstrates the effectiveness of context-aware expert placement and mixed-precision quantization in reducing memory transfer overhead and improving computational efficiency.

Implications

This work has significant implications for scaling large language models in resource-constrained environments. By leveraging CXL-enabled NDP systems and context-aware optimizations, the proposed approach enables efficient inference for memory-intensive models, potentially reducing the cost and energy consumption of deploying large-scale AI systems. It could be applied to various domains, including natural language processing, recommendation systems, and other applications requiring large-scale model inference.

View on arXiv

Distance Is All You Need: Radial Dispersion for Uncertainty Estimation in Large Language Models

Manh Nguyen, Sunil Gupta, Hung Le

Introduces Radial Dispersion Score (RDS), a simple and model-agnostic uncertainty estimation metric based on radial dispersion geometry.
RDS avoids semantic clustering, internal model states, and calibration, making it applicable to both black-box and open-weight LLMs.
A probability-weighted variant, RDSw, incorporates token probabilities for improved performance.
RDS naturally supports per-sample uncertainty scoring, enabling applications like best-of-N selection and confidence-based filtering.
Achieves state-of-the-art performance in hallucination detection and answer selection across multiple datasets and LLMs.

Abstract

This paper introduces the Radial Dispersion Score (RDS), a novel, simple, and model-agnostic uncertainty estimation metric for large language models (LLMs). Unlike existing methods that rely on semantic clustering, internal model states, or calibration, RDS measures the radial dispersion of sampled generations in an embedding space, providing a clean geometric interpretation of uncertainty. A probability-weighted variant, RDSw, incorporates token probabilities from the LLM when available, further enhancing performance. RDS is parameter-free, scalable, and applicable to both black-box APIs and open-weight models. The method also supports per-sample uncertainty scoring, enabling applications such as best-of-N selection and confidence-based filtering. Across four free-form QA datasets and multiple LLMs, RDS and RDSw achieve state-of-the-art performance in hallucination detection and answer selection, demonstrating robustness and scalability.

Methodology

The authors propose RDS, which measures the total radial dispersion of sampled generations embedded on a unit hypersphere. RDS calculates the ℓ1 distance of each embedding from the empirical centroid. A probability-weighted variant, RDSw, incorporates token-level probabilities when available. The method is model-agnostic, parameter-free, and does not rely on semantic clustering or internal model states. It is evaluated on four free-form QA datasets using multiple LLMs, comparing against nine strong baselines.

Results

RDS and RDSw deliver state-of-the-art performance in hallucination detection and answer selection tasks. The methods outperform nine baselines, including semantic entropy and geometric methods like EigenScore, across four challenging QA datasets. RDS demonstrates robustness to sample size and embedding choice, while RDSw further improves accuracy when token probabilities are available. The per-sample scoring capability of RDS also enhances best-of-N selection and confidence-based filtering applications.

Implications

The proposed RDS and RDSw metrics provide a robust and scalable solution for uncertainty estimation in LLMs, applicable to both black-box APIs and open-weight models. These methods can improve the reliability of LLM-based systems by enabling better hallucination detection, answer selection, and confidence-based filtering. The simplicity and model-agnostic nature of RDS make it a practical tool for a wide range of applications, including QA systems, content generation, and safety-critical AI deployments.

View on arXiv

Dual-Path Region-Guided Attention Network for Ground Reaction Force and Moment Regression

Xuan Li, Samuel Bello

Introduces DP-RGNet, a dual-path attention-based architecture for GRF/GRM regression from high-density plantar pressure data.
Incorporates anatomy-informed spatial priors and dynamic positional encodings to improve feature extraction and prediction accuracy.
Outperforms baseline models like CNN and CNN-LSTM on both custom and public datasets, achieving state-of-the-art results.
Demonstrates generalizability across different sensor systems and datasets, highlighting its robustness for real-world applications.
Addresses limitations of traditional force plates and low-density wearable sensors, enabling portable and accurate gait analysis.

Abstract

This paper introduces the Dual-Path Region-Guided Attention Network (DP-RGNet), a novel deep learning architecture designed to estimate three-dimensional ground reaction forces (GRFs) and moments (GRMs) from high-density plantar pressure data. Accurate GRF/GRM estimation is critical for biomechanics research, clinical rehabilitation, and sports science. Traditional methods rely on force plates, which are accurate but limited to controlled lab environments. Wearable sensors, such as instrumented insoles, offer a portable alternative but often suffer from low sensor density and limited spatial resolution. DP-RGNet addresses these challenges by leveraging a dual-path architecture: one path incorporates anatomy-informed spatial priors to organize local feature aggregation, while the other captures global context without explicit regional constraints. The model also integrates spatial and dynamic positional encodings to enhance its ability to model complex spatiotemporal dependencies. Evaluated on both a custom high-density insole dataset and a public walking dataset, DP-RGNet outperformed baseline models, including CNN and CNN-LSTM architectures, achieving significantly lower normalized root mean square error (NRMSE) in GRF/GRM predictions. This work demonstrates the potential of combining anatomy-inspired priors with deep learning for robust, real-world gait analysis.

Methodology

The proposed DP-RGNet architecture consists of two complementary paths: (1) a region-guided path that uses anatomy-informed spatial priors and region mask attention mechanisms to aggregate local features, and (2) a global path that captures unconstrained context from the entire sensor field. The model also integrates spatial positional encoding (based on sensor coordinates) and dynamic positional encoding (based on the center of pressure) to enhance spatiotemporal modeling. The architecture was trained and evaluated on two datasets: a custom high-density insole dataset and a public walking dataset, using normalized root mean square error (NRMSE) as the primary evaluation metric.

Results

DP-RGNet achieved the lowest six-component average NRMSE of 5.78% on the custom insole dataset and 1.42% for vertical GRF on the public dataset, outperforming strong baseline models such as CNN and CNN-LSTM. These results demonstrate the model's ability to accurately predict GRFs and GRMs across different datasets and sensor systems.

Implications

The proposed DP-RGNet has significant implications for biomechanics, clinical rehabilitation, and sports science. By enabling accurate and portable GRF/GRM estimation, it can facilitate real-world gait analysis, improve clinical assessments for neurological and musculoskeletal conditions, and optimize athletic performance. Additionally, the integration of anatomy-inspired priors with deep learning offers a promising direction for future wearable sensor-based applications.

View on arXiv

Evaluating Long-Context Reasoning in LLM-Based WebAgents

Andy Chung, Yichi Zhang, Kaixiang Lin, Aditya Rawal, Qiaozi Gao, Joyce Chai

Introduced a benchmark to evaluate long-context reasoning in WebAgents using sequentially dependent subtasks and noisy historical trajectories.
Observed dramatic performance degradation in LLMs as context length increases, with success rates dropping below 10% in extended contexts.
Identified key failure modes, including agents getting stuck in loops and losing track of task objectives during long-context tasks.
Proposed an implicit Retrieval-Augmented Generation (iRAG) approach, which improves task success rates but does not fully resolve long-context reasoning challenges.
Highlighted the need for more robust architectures to support coherent task execution in realistic, long-term user interaction scenarios.

Abstract

This paper investigates the ability of large language model (LLM)-based WebAgents to reason across long interaction histories in realistic web environments. The authors introduce a novel benchmark designed to evaluate long-context reasoning capabilities through sequentially dependent subtasks, where irrelevant task trajectories are injected to simulate extended interaction histories. Context lengths range from 25,000 to 150,000 tokens, reflecting real-world scenarios. Four popular LLMs—Claude-3.7, GPT-4.1, Llama 4, and o4-mini—are evaluated, revealing significant performance degradation as context length increases. Success rates drop from 40-50% in baseline conditions to below 10% in long-context scenarios. Error analysis identifies key failure modes, including agents getting stuck in loops and losing track of task objectives. The authors propose an implicit Retrieval-Augmented Generation (iRAG) approach that generates task-relevant summaries, yielding modest improvements but failing to address fundamental limitations in long-context reasoning. The findings underscore critical challenges in deploying WebAgents for long-term, personalized user interactions and highlight the need for more robust architectures.

Methodology

The authors developed a benchmark simulating multi-session user interactions by injecting irrelevant task trajectories into the context. Context lengths ranged from 25,000 to 150,000 tokens. Four LLM-based WebAgents were evaluated on sequentially dependent subtasks, and error analysis was conducted to identify reasoning failures. Additionally, an implicit Retrieval-Augmented Generation (iRAG) approach was tested to improve task success rates by summarizing relevant information from noisy contexts.

Results

The evaluation revealed significant performance degradation in long-context scenarios, with success rates dropping from 40-50% in baseline conditions to less than 10% in extended contexts. Agents frequently failed due to looping behavior and loss of task objectives. The iRAG approach provided modest improvements but could not fully address the challenges of long-context reasoning.

Implications

The findings highlight critical limitations in current LLM-based WebAgents for long-term, personalized interactions. This research provides valuable insights for developing more robust architectures capable of maintaining coherent reasoning and task execution across extended contexts. Applications include improving digital assistants for tasks requiring long-term memory and personalization, such as e-commerce, research, and entertainment.

View on arXiv

Exploiting ftrace's function_graph Tracer Features for Machine Learning: A Case Study on Encryption Detection

Kenan Begovic, Abdulaziz Al-Ali, Qutaibah Malluhi

Introduced a novel methodology for extracting graph-based features from ftrace's function_graph tracer for ML applications.
Achieved 99.28% accuracy in detecting encryption activities using the proposed approach.
Validated the method through a multi-label classification task for identifying running programs based on trace data.
Created a curated dataset, 'Is It Encrypted?', to support future research in system behavior and security analysis.
Highlighted the potential of integrating kernel-level tracing with ML for security-sensitive applications like ransomware detection.

Abstract

This paper explores the use of the Linux kernel's ftrace framework, specifically the function_graph tracer, to generate system-level data for machine learning (ML) applications. The authors focus on a case study for detecting encryption activities, leveraging graph-based features extracted from kernel function call traces. The study introduces a novel methodology for preprocessing raw trace data and extracting features such as function call sequences, invocation durations, and inter-function relationships. These features are utilized in ML models to achieve high accuracy in detecting cryptographic operations. The paper also validates the approach through a secondary experiment involving multi-label classification to identify running programs based on trace data. The results demonstrate the effectiveness of the proposed method, achieving an accuracy of 99.28% in encryption detection. This work bridges the gap between system tracing and ML, offering new possibilities for security analytics, performance monitoring, and anomaly detection.

Methodology

The authors utilized the Linux kernel's ftrace function_graph tracer to capture system-level function call traces. These traces were preprocessed to extract graph-based features such as centrality measures, call durations, and inter-function relationships. The processed data was then used to train ML models for encryption detection and program identification tasks. The methodology included feature engineering, dataset creation, and evaluation across multiple learning algorithms.

Results

The proposed approach achieved a high accuracy of 99.28% in detecting cryptographic operations. Additionally, the method was validated in a multi-label classification task, demonstrating its capability to identify running programs based on trace data. These results underscore the effectiveness of using graph-based features derived from the function_graph tracer for ML applications.

Implications

This work opens new avenues for applying ML to system-level tracing data, with potential applications in security analytics, such as ransomware detection and anomaly detection, as well as performance monitoring and application identification. The curated dataset and methodologies provide a foundation for future research in integrating kernel analytics with machine learning.

View on arXiv

Feature Engineering vs. Deep Learning for Automated Coin Grading: A Comparative Study on Saint-Gaudens Double Eagles

Tanmay Dogra, Eric Ngo, Mohammad Alam, Jean-Paul Talavera, Asim Dahal

Feature engineering with domain-specific knowledge outperforms deep learning in tasks with limited and imbalanced datasets.
The feature-based ANN achieved 86% exact grade matches and 98% accuracy within a 3-grade tolerance, significantly outperforming CNN and SVM models.
The hybrid CNN approach, despite leveraging transfer learning and engineered features, struggled with class imbalance and failed to predict specific grades accurately.
The study underscores the importance of incorporating expert knowledge into feature extraction for niche applications like numismatic grading.
This approach has broader implications for quality assessment tasks in domains with scarce labeled data.

Abstract

This paper investigates the effectiveness of feature engineering versus deep learning for automated grading of Saint-Gaudens Double Eagle gold coins on the Sheldon scale. The authors compare a feature-based Artificial Neural Network (ANN), a hybrid Convolutional Neural Network (CNN) leveraging EfficientNetV2, and a Support Vector Machine (SVM) baseline. Using a dataset of 1,785 coins with severe class imbalance, the study demonstrates that the feature-based ANN achieves superior performance, with 86% exact grade matches and 98% accuracy within a 3-grade tolerance. In contrast, the CNN and SVM models perform poorly, achieving only 31% and 30% exact matches, respectively. The feature extraction pipeline incorporates Sobel edge detection, HSV color analysis, and wedge-based spatial analysis to encode domain-specific numismatic expertise. The findings highlight the limitations of deep learning in specialized tasks with limited data and emphasize the importance of domain knowledge in feature design for quality assessment applications.

Methodology

The study utilized two approaches: (1) a feature-based pipeline extracting 192 custom features using Sobel edge detection, HSV color analysis, and wedge-based spatial analysis, followed by training an ANN and SVM; (2) a hybrid CNN combining EfficientNetV2 for raw image processing with engineered features. The dataset consisted of 1,785 coins graded on the Sheldon scale, with images captured under standardized conditions. SMOTE was applied to address class imbalance during training.

Results

The feature-based ANN achieved 86% exact grade matches and 98% accuracy within a 3-grade tolerance, outperforming the CNN (31% exact matches) and SVM (30% exact matches). The CNN performed better on broader tolerance metrics but failed to predict specific grades accurately due to regression averaging effects. The study highlights the robustness of feature engineering in handling class imbalance and limited data.

Implications

The findings suggest that feature engineering is a viable alternative to deep learning for specialized tasks with limited data, such as automated coin grading. This approach can be extended to other quality assessment applications in domains where expert knowledge is critical and data scarcity is a challenge. It also raises questions about the suitability of deep learning for niche applications with highly imbalanced datasets.

View on arXiv

Federated Learning for Anomaly Detection in Maritime Movement Data

Anita Graser, Axel Weißenfeld, Clemens Heistracher, Melitta Dragaschnig, Peter Widhalm

Introduces M³fed, a federated learning model for anomaly detection in maritime movement data.
Builds on the centralized M³ model, which uses Gaussian Mixture Models for spatially explicit anomaly detection.
Implements M³fed using the Flower framework, enabling decentralized training with privacy preservation.
Demonstrates comparable anomaly detection performance to centralized models while reducing communication costs.
Addresses the lack of federated learning applications in spatial and mobility data science, particularly in the maritime domain.

Abstract

This paper introduces M³fed, a novel federated learning (FL) model designed for anomaly detection in maritime movement data. The model builds upon the centralized M³ (Massive Movement Model), which uses a spatially explicit Gaussian Mixture Model (GMM) to detect anomalies in movement patterns. M³fed adapts this approach to a federated learning framework, enabling decentralized training across multiple clients while preserving data privacy and reducing communication costs. The authors implement M³fed using the Flower FL framework and evaluate its performance on maritime Automatic Identification System (AIS) data. The study compares the federated model to the centralized M³ model in terms of anomaly detection accuracy and communication efficiency. Results demonstrate that M³fed achieves comparable anomaly detection performance to the centralized model while significantly reducing communication overhead, making it a viable solution for privacy-sensitive and bandwidth-constrained environments.

Methodology

The authors adapt the centralized M³ model to a federated learning framework using the Flower library. M³fed divides geographic space into a grid and models movement patterns in each grid cell using Gaussian Mixture Models. Federated training involves local model updates on client devices, which are aggregated by a central server to create a global model. Anomalies are detected by comparing new movement data to the trained model using statistical tests. The study evaluates M³fed on maritime AIS data, comparing its performance to the centralized M³ model in terms of anomaly detection accuracy and communication efficiency.

Results

M³fed achieves anomaly detection performance comparable to the centralized M³ model while significantly reducing communication costs. The federated approach preserves data privacy by keeping training data on local devices and only sharing model updates with the central server. The results highlight the potential of federated learning for privacy-sensitive and bandwidth-constrained applications in the maritime domain.

Implications

M³fed provides a scalable and privacy-preserving solution for anomaly detection in maritime movement data, addressing challenges related to data sensitivity and communication constraints. The approach can be extended to other domains involving mobility and spatial data, such as transportation, logistics, and environmental monitoring. By reducing communication costs, M³fed also enables deployment in resource-constrained environments, such as remote or maritime settings with limited connectivity.

View on arXiv

GRASP: GRouped Activation Shared Parameterization for Parameter-Efficient Fine-Tuning and Robust Inference of Transformers

Malyaban Bal, Abhronil Sengupta

GRASP introduces a grouped parameterization approach, reducing trainable parameters from O(n × D) to O(n × K), where K ≪ D.
The method achieves competitive or superior performance compared to established PEFT techniques like LoRA and BitFit on GLUE and E2E NLG tasks.
StochGRASP extends GRASP by learning Gaussian distributions for parameter updates, improving robustness to hardware-induced noise.
GRASP's grouped modulation preserves task-specific expressivity despite aggressive parameter compression.
The framework is particularly suited for deployment on energy-efficient, noise-prone edge AI hardware.

Abstract

This paper introduces GRASP (GRouped Activation Shared Parameterization), a novel parameter-efficient fine-tuning (PEFT) framework for large pre-trained transformers. GRASP reduces the number of trainable parameters by partitioning the D-dimensional hidden representations of selected layers into K groups (where K ≪ D) and learning shared scaling and shifting parameters for each group. This approach significantly reduces computational overhead while maintaining task-specific expressivity. The authors also propose StochGRASP, a stochastic extension of GRASP that learns Gaussian distributions for parameter updates instead of deterministic values. This probabilistic formulation, combined with a noise-aware loss function, enhances robustness under hardware-induced noise, making it suitable for deployment on energy-efficient, noise-prone edge AI hardware. GRASP achieves competitive or superior performance compared to existing PEFT methods like LoRA and BitFit on benchmark tasks (GLUE and E2E NLG) while requiring orders-of-magnitude fewer trainable parameters. StochGRASP further demonstrates improved robustness under varying noise conditions, addressing challenges in deploying transformers on emerging hardware platforms.

Methodology

GRASP partitions the D-dimensional hidden representations of transformer layers into K groups and learns shared scaling and shifting parameters for each group, drastically reducing the number of trainable parameters. StochGRASP extends this by learning Gaussian distributions for these parameters, enabling robust inference under hardware noise. A noise-aware loss function is introduced to regularize the learned standard deviations toward a target noise profile. The methods are evaluated on GLUE (RoBERTa-base/large) and E2E NLG (GPT-2 Medium) tasks, comparing performance and parameter efficiency against existing PEFT methods.

Results

GRASP achieves comparable or superior performance to state-of-the-art PEFT methods like LoRA and BitFit while using significantly fewer trainable parameters. On GLUE and E2E NLG tasks, GRASP demonstrates competitive accuracy with an order-of-magnitude reduction in parameter count. StochGRASP further improves robustness under hardware noise, outperforming deterministic variants in noisy inference scenarios.

Implications

The proposed GRASP and StochGRASP frameworks enable efficient fine-tuning of large transformers with minimal computational and memory overhead, making them highly suitable for resource-constrained environments. StochGRASP's robustness to hardware noise positions it as a promising solution for deploying transformers on emerging energy-efficient AI hardware, such as non-volatile memory-based accelerators, where noise and variability are significant challenges.

View on arXiv

Gradient Descent with Provably Tuned Learning-rate Schedules

Dravyansh Sharma

Develops a framework for tuning learning rates in gradient descent that applies to non-convex and non-smooth functions, including neural networks.
Achieves sample complexity bounds of ˜O(H^3/ε^2) for learning rate tuning, matching prior results for convex functions up to logarithmic factors.
Extends the methodology to tune learning rate schedules, momentum parameters, and initialization vectors.
Provides guarantees for both convergence speed and the quality of the final iterate in gradient-based optimization.
Demonstrates applicability to widely used optimizers like momentum-based methods (e.g., Adam).

Abstract

This paper addresses the challenge of tuning hyperparameters, particularly the learning rate, in gradient-based optimization methods, which are foundational to machine learning. Unlike prior work that relies on strong assumptions such as convexity and smoothness of the objective function, this work develops a framework that extends to non-convex and non-smooth functions, including neural networks with commonly used activation functions like ReLU, sigmoid, and tanh. The author provides theoretical guarantees for the sample complexity of learning the step size in gradient descent, achieving bounds comparable to those for convex functions but applicable to a broader class of functions. The framework is further extended to tune multiple hyperparameters, such as learning rate schedules, momentum, and initialization parameters, and is shown to optimize both convergence speed and the quality of the final iterate. This work bridges the gap between theoretical guarantees and practical applicability in gradient-based optimization, particularly for deep learning tasks.

Methodology

The paper builds on the data-driven algorithm design paradigm, analyzing the pseudo-dimension of the function class associated with gradient descent step-size tuning. The author relaxes traditional assumptions like convexity and smoothness, focusing instead on piecewise-polynomial and piecewise-Pfaffian functions, which encompass common neural network activation functions. The framework is extended to handle multiple hyperparameters and non-convergence scenarios by assigning a fixed cost for non-convergent cases. Theoretical bounds are derived for sample complexity in tuning learning rates, schedules, and initialization parameters.

Results

The author establishes sample complexity bounds of ˜O(H^3/ε^2) for tuning the learning rate in non-convex, non-smooth settings, matching prior results for convex functions up to logarithmic factors. For learning rate schedules, the bound is extended to ˜O(H^4/ε^2). The framework is shown to generalize across tasks, ensuring small gaps between training and test errors. Additionally, the methodology is demonstrated to optimize both convergence speed and the quality of the final iterate, with applications to momentum-based optimizers like Adam.

Implications

This work has significant implications for deep learning and other machine learning applications where non-convex and non-smooth functions are common. By providing a principled approach to hyperparameter tuning with theoretical guarantees, it can improve the efficiency and performance of gradient-based optimization methods. The framework's applicability to multi-task learning and meta-learning scenarios further enhances its utility in real-world applications, such as neural network training and transfer learning.

View on arXiv

GraphBench: Next-generation graph learning benchmarking

Timo Stoll, Chendi Qian, Ben Finkelshtein, Ali Parviz, Darius Weber, Fabrizio Frasca, Hadar Shavit, Antoine Siraudin, Arman Mielke, Marie Anastacio, Erik Müller, Maya Bechler-Speicher, Michael Bronstein, Mikhail Galkin, Holger Hoos, Mathias Niepert, Bryan Perozzi, Jan Tönshoff, Christopher Morris

GraphBench addresses the fragmentation in graph learning benchmarks by integrating diverse datasets and tasks into a unified framework.
It supports node-, edge-, graph-level, and generative tasks across domains such as social networks, chip design, and weather forecasting.
The suite includes standardized evaluation protocols, domain-specific metrics, and OOD generalization tests to ensure robust and reproducible research.
GraphBench benchmarks modern GNNs and graph transformers, providing strong baselines and insights into model performance across tasks.
The framework emphasizes real-world impact by including datasets and tasks relevant to industry and emerging research areas.

Abstract

GraphBench is a comprehensive benchmarking suite designed to address the limitations of current graph learning benchmarks. Existing benchmarks often focus on narrow, task-specific datasets and inconsistent evaluation protocols, which hinder reproducibility and broader progress in graph machine learning. GraphBench spans diverse domains, including social networks, chip design, combinatorial optimization, and weather forecasting, and supports multiple prediction tasks (node-level, edge-level, graph-level, and generative). It introduces standardized evaluation protocols, including consistent dataset splits, domain-specific metrics, and out-of-distribution (OOD) generalization tests. Additionally, GraphBench provides a unified hyperparameter tuning framework and benchmarks using modern graph neural networks (GNNs) and graph transformers, offering strong baselines and insights into architectural trade-offs. By unifying heterogeneous tasks and datasets under one framework, GraphBench aims to foster reproducible, robust, and impactful research in graph learning, paving the way for the development of graph foundation models.

Methodology

GraphBench introduces a unified benchmarking framework that integrates diverse datasets and tasks across multiple domains. It provides standardized dataset splits, domain-specific evaluation metrics, and scripts for hyperparameter tuning. The framework also includes OOD generalization tests to evaluate model robustness. Modern graph learning architectures, including message-passing neural networks (MPNNs) and graph transformers, are benchmarked to establish strong baselines and analyze architectural trade-offs.

Results

GraphBench demonstrates its utility by benchmarking state-of-the-art GNNs and graph transformers across a wide range of tasks and datasets. The results highlight the strengths and weaknesses of different architectures, providing insights into their performance in diverse scenarios. The inclusion of OOD generalization tests further emphasizes the importance of robust model evaluation.

Implications

GraphBench has the potential to standardize graph learning research, enabling more reproducible and impactful studies. Its focus on real-world datasets and tasks can drive progress in industrial applications such as chip design, weather forecasting, and combinatorial optimization. Additionally, the framework's emphasis on OOD generalization could inspire the development of more robust and generalizable graph learning models.

View on arXiv

Mitigating the Curse of Detail: Scaling Arguments for Feature Learning and Sample Complexity

Noa Rubin, Orit Davidovich, Zohar Ringel

Introduces a heuristic scaling framework to predict feature learning patterns and sample complexity in deep learning models.
Simplifies the analytical complexity of existing theories while reproducing known scaling exponents.
Extends theoretical insights to complex architectures, including three-layer non-linear networks and attention mechanisms.
Derives lower bounds on sample complexity and predicts the threshold sample size for effective learning.
Provides a fully analytic computation of sample complexity using variational approximations and heuristic scaling relations.

Abstract

This paper addresses the challenge of understanding feature learning and sample complexity in deep learning models, particularly in the 'rich regime' where networks exhibit complex feature learning behaviors. The authors propose a novel heuristic framework based on scaling arguments to predict the data and width scales at which feature learning patterns emerge. This approach simplifies the analytical complexity of existing theories, which often rely on computationally intensive numerical solutions or are limited to shallow or linear models. The proposed framework enables the derivation of scaling laws for sample complexity and feature learning, extending theoretical insights to more complex architectures, such as three-layer non-linear networks and attention mechanisms. The authors use a Bayesian perspective, leveraging stochastic gradient Langevin dynamics (SGLD), to derive lower bounds on sample complexity and predict the conditions under which learning becomes feasible. Their method re-derives known results for two-layer networks and expands the scope to previously intractable models, providing insights into layer-wise learning and neuron specialization.

Methodology

The authors adopt a Bayesian framework to analyze feature learning and sample complexity. They derive lower bounds on sample complexity by bounding the test mean squared error (MSE) and leveraging variational approximations. The framework uses heuristic scaling relations to predict how features propagate across layers and employs stochastic gradient Langevin dynamics (SGLD) as a proxy for network training. The approach is validated by re-deriving known results for two-layer networks and extending the analysis to more complex architectures.

Results

The proposed framework successfully reproduces known scaling laws for two-layer networks, including the benefits of rich feature learning and Grokking transitions. It also predicts sample complexity, layer-wise learning dynamics, and neuron specialization in three-layer non-linear networks. The method demonstrates the ability to generalize theoretical insights to more complex architectures, bridging the gap between toy models and real-world deep learning systems.

Implications

This work provides a scalable and analytically tractable approach to understanding feature learning and sample complexity in deep learning. It has potential applications in designing more efficient neural network architectures, optimizing training protocols, and improving interpretability in complex models. The framework could also inform hyperparameter tuning and model selection by offering theoretical predictions for optimal configurations.

View on arXiv

Network of Theseus

Vighnesh Subramaniam, Colin Conwell, Boris Katz, Andrei Barbu, Brian Cheung

Introduces Network of Theseus (NoT), a method for part-by-part architectural conversion while preserving performance.
Demonstrates successful conversions across diverse architectures, such as CNNs to MLPs and transformers to RNNs.
Finds that untrained guide networks can effectively transfer inductive biases, achieving comparable performance to trained guides.
Proposes a new similarity metric, differentiable mutual nearest neighbors (D-MNN), for representational alignment.
Enables decoupling of training and inference architectures, expanding the design space for efficient deployment.

Abstract

The paper introduces the Network of Theseus (NoT), a novel method for progressively converting one neural network architecture into another while preserving the original network's performance. This approach challenges the traditional assumption in deep learning that the architecture used during training must be the same as the one used during inference. NoT achieves this by incrementally replacing components of a 'guide' network with modules from a 'target' architecture, aligning their representations using similarity metrics. This decoupling of training and inference architectures allows for greater flexibility in selecting architectures optimized for deployment constraints, such as efficiency or hardware compatibility. The authors demonstrate NoT's effectiveness across a variety of architectural transformations, including converting convolutional networks to multilayer perceptrons (MLPs) and transformers to recurrent neural networks (RNNs). Surprisingly, they find that even untrained guide networks can transfer useful inductive biases, achieving performance comparable to trained guides. The method has significant implications for architectural design, enabling exploration of accuracy-efficiency trade-offs and expanding the space of viable inference-time architectures.

Methodology

NoT progressively replaces components of a guide network with modules from a target architecture. At each stage, the new module is optimized to align its representations with the guide network using similarity metrics like centered kernel alignment (CKA) and the newly proposed differentiable mutual nearest neighbors (D-MNN). After the conversion, the target architecture is fine-tuned on the downstream task to ensure performance preservation.

Results

The authors demonstrate that NoT can successfully convert between vastly different architectures, such as CNNs to MLPs, vision transformers to token-wise MLPs, and transformers to RNNs, while preserving the original performance. They also show that untrained guide networks can transfer inductive biases effectively. Additionally, NoT enables conversions from larger to smaller architectures, achieving efficient inference without significant performance loss.

Implications

NoT has practical implications for deploying neural networks in resource-constrained environments, as it allows for the use of efficient architectures optimized for specific deployment constraints. It also facilitates exploration of architectural design by decoupling training and inference, enabling researchers to test architectures with desirable properties without retraining from scratch. The method could lead to advancements in model compression, neural architecture search, and the development of novel architectures tailored for specific tasks or hardware.

View on arXiv

On the Limits of Test-Time Compute: Sequential Reward Filtering for Better Inference

Yue Yu, Qiwei Di, Quanquan Gu, Dongruo Zhou

Parallel TTC methods, such as Best-of-N (BoN), are shown to be statistically suboptimal under realistic assumptions about LLM pretraining data.
The proposed Reward-Filtered Sequential Best-of-n (RF-SeqBoN) method selectively incorporates high-reward generations into the input context, improving inference quality.
RF-SeqBoN achieves better sample complexity guarantees and test-time budget efficiency compared to parallel TTC methods.
Extensive experiments across multiple benchmarks demonstrate consistent performance improvements with RF-SeqBoN.
Theoretical and empirical results highlight the advantages of sequential TTC over traditional parallel approaches.

Abstract

This paper investigates the limitations of Test-Time Compute (TTC) methods, particularly focusing on the widely used Best-of-N (BoN) sampling strategy, and proposes a novel sequential approach called Reward-Filtered Sequential Best-of-n (RF-SeqBoN). TTC methods aim to enhance the performance of large language models (LLMs) during inference without additional training. The authors analyze the theoretical limitations of parallel TTC methods like BoN and demonstrate that they are suboptimal under realistic modeling assumptions. To address these limitations, the authors introduce RF-SeqBoN, a sequential inference method that selectively incorporates high-reward generations into the input context, progressively refining the model's output distribution. Theoretical analysis shows that RF-SeqBoN achieves better sample complexity and test-time budget efficiency compared to parallel TTC methods. Empirical evaluations across diverse benchmarks and LLMs confirm the practical effectiveness of the proposed method, showing consistent improvements in performance and efficiency.

Methodology

The authors formalize TTC with reward models as a decision problem under a mixture-of-reference-policies model. They derive theoretical lower bounds on the test-time budget required for parallel TTC methods like BoN and develop the RF-SeqBoN algorithm, which iteratively refines the generation distribution by feeding high-reward outputs back into the model's input. Theoretical guarantees are provided for the improved performance of RF-SeqBoN, and its effectiveness is validated through experiments on diverse benchmarks and LLM architectures.

Results

Theoretical analysis shows that RF-SeqBoN achieves strictly better sample complexity and budget-performance trade-offs compared to parallel TTC methods. Empirical results across various benchmarks and LLMs demonstrate consistent improvements in test-time budget efficiency and task performance, confirming the practical advantages of the proposed method.

Implications

The findings suggest that sequential TTC methods like RF-SeqBoN can significantly enhance the efficiency and effectiveness of LLM inference, particularly in scenarios with limited test-time computational budgets. This approach could be applied to improve performance in real-world applications such as natural language processing, decision-making systems, and other tasks requiring high-quality outputs from LLMs.

View on arXiv

Prototype-Based Semantic Consistency Alignment for Domain Adaptive Retrieval

Tianle Hu, Weijun Lv, Na Han, Xiaozhao Fang, Jie Wen, Jiaxing Li, Guoxu Zhou

PSCA introduces orthogonal class prototypes to improve class-level semantic alignment and inter-class separability.
Semantic consistency alignment adaptively weights pseudo-labels based on geometric proximity and semantic predictions, reducing error propagation.
Enhanced feature reconstruction circumvents direct quantization, improving hash coding quality.
Domain-specific quantization functions ensure unified binary hash codes across domains under mutual approximation constraints.
PSCA achieves state-of-the-art performance in domain adaptive retrieval tasks across diverse datasets.

Abstract

This paper introduces the Prototype-Based Semantic Consistency Alignment (PSCA) framework to address key limitations in domain adaptive retrieval (DAR), which involves transferring knowledge from a labeled source domain to an unlabeled target domain for effective cross-domain retrieval. Existing DAR methods often focus excessively on pair-wise sample alignment, neglecting class-level semantic alignment, and fail to adequately handle pseudo-label reliability or mitigate quantization errors caused by domain shifts. PSCA is a two-stage framework designed to overcome these challenges. In the first stage, orthogonal class prototypes are established in a shared subspace to maximize inter-class separability and gather intra-class samples. Semantic consistency alignment evaluates pseudo-label reliability by combining geometric proximity and semantic predictions, adaptively weighting pseudo-labels to reduce error propagation. The resulting membership matrix and prototypes reconstruct enhanced features, improving hash coding quality. In the second stage, domain-specific quantization functions process reconstructed features under mutual approximation constraints, generating unified binary hash codes across domains. Extensive experiments demonstrate PSCA's superior performance in retrieval tasks across multiple datasets.

Methodology

The PSCA framework operates in two stages: (1) Prototype learning establishes orthogonal class prototypes in a shared subspace, guided by semantic consistency alignment that evaluates pseudo-label reliability using geometric proximity and semantic predictions. This stage generates a membership matrix and reconstructed features. (2) Domain-specific quantization functions process reconstructed features under mutual approximation constraints, producing unified binary hash codes for cross-domain retrieval.

Results

PSCA outperforms existing domain adaptive retrieval methods in terms of retrieval accuracy and hash coding quality across multiple benchmark datasets. The framework effectively mitigates domain discrepancies and improves semantic alignment, leading to superior retrieval performance.

Implications

PSCA has significant implications for real-world applications such as cross-domain image retrieval in e-commerce platforms, where product images and user queries often differ in distribution. The framework's ability to generate unified hash codes across domains can enhance retrieval efficiency and accuracy in scenarios involving domain shifts.

View on arXiv

QoSDiff: An Implicit Topological Embedding Learning Framework Leveraging Denoising Diffusion and Adversarial Attention for Robust QoS Prediction

Guanchen Du, Jianlong Xu, Wei Wei

QoSDiff eliminates the dependency on explicit user-service interaction graphs, addressing scalability and noise issues in traditional graph-based methods.
The framework uses a denoising diffusion probabilistic model (DELM) to recover latent structures from noisy and sparse data.
An adversarial attention mechanism (AAIM) dynamically identifies high-order user-service interactions while filtering out noise.
QoSDiff demonstrates superior robustness to data sparsity and noise, outperforming state-of-the-art baselines in QoS prediction tasks.
The framework exhibits strong cross-dataset generalization capabilities, making it suitable for diverse real-world applications.

Abstract

The paper introduces QoSDiff, a novel framework for Quality of Service (QoS) prediction that addresses the limitations of traditional graph-based methods. QoSDiff eliminates the need for explicit user-service interaction graph construction, which is often infeasible in large-scale, sparse, or noisy environments. The framework leverages two key components: a Diffusion-based Embedding Learning Module (DELM) and an Adversarial Attention-based Interaction Module (AAIM). DELM employs a denoising diffusion probabilistic model to recover latent user-service embeddings from noisy data, bypassing the need for explicit graph topology. AAIM incorporates a bidirectional hybrid attention mechanism within an adversarial learning paradigm to dynamically distinguish meaningful patterns from noise. Extensive experiments on two large-scale real-world datasets demonstrate that QoSDiff achieves state-of-the-art performance, showing superior robustness to data sparsity, noise, and cross-dataset generalization challenges. The proposed approach has significant implications for improving service selection, recommendation, and composition in dynamic web service environments.

Methodology

QoSDiff consists of two main components: (1) The Diffusion-based Embedding Learning Module (DELM), inspired by denoising diffusion probabilistic models, progressively denoises latent embeddings from noisy data without relying on explicit graph structures. (2) The Adversarial Attention-based Interaction Module (AAIM) employs a bidirectional hybrid attention mechanism within an adversarial learning framework to capture high-order interactions and separate informative patterns from noise. Together, these components enable robust QoS prediction in sparse and noisy environments.

Results

QoSDiff significantly outperforms state-of-the-art QoS prediction models on two large-scale real-world datasets. It demonstrates superior robustness to data sparsity and noise, achieving higher prediction accuracy and better generalization across datasets. The framework also reduces the computational overhead associated with explicit graph construction, making it scalable for large-scale service environments.

Implications

QoSDiff has the potential to transform QoS prediction in service computing by enabling robust and scalable predictions in dynamic and noisy environments. Its applications include improving service selection, recommendation systems, and service composition in cloud computing and IoT ecosystems. The framework's ability to generalize across datasets also makes it suitable for cross-domain applications, enhancing user satisfaction and optimizing service delivery.

View on arXiv

RLHFSpec: Breaking the Efficiency Bottleneck in RLHF Training via Adaptive Drafting

Siqi Wang, Hailong Yang, Junjie Zhu, Xuezhu Wang, Yufan Xu, Depei Qian

RLHFSpec integrates speculative decoding into RLHF generation for the first time, addressing performance bottlenecks.
A workload-aware drafting strategy selection mechanism dynamically optimizes speculative decoding strategies based on workload changes.
Sample reallocation improves GPU utilization by redistributing workloads across instances, addressing load imbalances caused by varying response lengths.
RLHFSpec achieves higher throughput in the generation stage and accelerates overall RLHF execution.
The system is particularly effective in scenarios with long-tailed response distributions, such as Chain of Thought reasoning.

Abstract

This paper addresses the efficiency bottleneck in Reinforcement Learning from Human Feedback (RLHF) training, specifically focusing on the generation stage, which dominates the overall execution time due to its autoregressive decoding process. The authors propose RLHFSpec, a novel system that integrates speculative decoding into RLHF generation to accelerate execution. RLHFSpec introduces two key innovations: a workload-aware drafting strategy selection mechanism that dynamically optimizes speculative decoding strategies based on workload changes, and a sample reallocation mechanism that mitigates GPU resource underutilization caused by varying response lengths. These techniques collectively improve throughput in the generation stage and alleviate inefficiencies in RLHF training. Experimental results demonstrate that RLHFSpec achieves significant performance speedups compared to state-of-the-art methods, enhancing both generation and overall RLHF execution efficiency.

Methodology

The authors leverage speculative decoding, a technique traditionally used in online serving, and adapt it for RLHF generation by introducing a workload-aware drafting strategy selection mechanism. This mechanism dynamically adjusts speculative decoding strategies based on the verification cost and the number of accepted tokens. Additionally, they propose a sample reallocation policy with a lightweight migration mechanism to redistribute workloads across GPU instances, ensuring optimal resource utilization despite dynamic workloads and varying response lengths.

Results

Experimental evaluations show that RLHFSpec significantly improves throughput in the generation stage compared to existing methods. The system achieves higher GPU utilization and reduces execution time for RLHF training, particularly in scenarios with long-tailed response distributions. Overall, RLHFSpec demonstrates substantial performance speedups in RLHF execution, making it more efficient and scalable.

Implications

RLHFSpec has the potential to enhance the efficiency of RLHF training for large language models, enabling faster fine-tuning and deployment of models in real-world applications. Its ability to handle dynamic workloads and optimize GPU utilization makes it particularly suitable for scenarios involving complex reasoning tasks, such as Chain of Thought reasoning. This work could pave the way for more efficient training methodologies in the development of advanced AI systems.

View on arXiv

RNNs perform task computations by dynamically warping neural representations

Arthur Pellegrino, Angus Chadwick

The authors derive a relationship between the topology of input data manifolds and the neural activity manifolds of dynamical systems, showing that the latter is constrained to an m+1-dimensional manifold for inputs on an m-dimensional manifold.
A pullback Riemannian metric is introduced to characterize the geometry of neural activity manifolds, reflecting the computations performed by the system.
In a contextual decision-making task, RNNs dynamically compress irrelevant input information over time, demonstrating task-specific warping of the neural manifold.
In a sequential working memory task, RNN activity is shown to lie on a hyper-torus, with its geometry dynamically warping to retrieve different memories at different times.
The framework extends beyond attractor manifolds, providing a formal mathematical approach to study dynamic neural representations in RNNs and other dynamical systems.

Abstract

This paper explores how recurrent neural networks (RNNs) perform computations by dynamically warping their internal neural representations of task variables. The authors propose a novel Riemannian geometric framework to analyze the relationship between the topology and geometry of input data manifolds and the neural activity manifolds of dynamical systems. They demonstrate that RNNs dynamically adapt their internal representations to solve tasks, compress irrelevant information, and retrieve task-relevant features over time. The framework provides a mathematical basis for understanding the computations performed by RNNs beyond traditional fixed-point attractor analysis, offering insights into the dynamic nature of neural representations in time-dependent tasks.

Methodology

The authors develop a Riemannian geometric framework to analyze the topology and geometry of neural activity manifolds in dynamical systems. They derive the pullback Riemannian metric, which maps the geometry of the input manifold to the neural state-space. This framework is applied to analyze RNNs trained on specific tasks, such as contextual decision-making and sequential working memory, to study how their internal representations evolve dynamically.

Results

The study reveals that RNNs dynamically warp their internal representations to solve tasks. In the contextual decision-making task, the neural manifold compresses irrelevant input information over time. In the sequential working memory task, the RNN's activity forms a hyper-torus, with its geometry dynamically adapting to retrieve different memories at different times. These findings highlight the dynamic and task-specific nature of RNN computations.

Implications

This work provides a new mathematical framework for understanding the computations performed by RNNs and other dynamical systems, offering insights into how neural networks process time-varying inputs. The framework could be applied to improve interpretability in machine learning models, enhance the design of RNN-based systems, and deepen our understanding of neural computations in biological systems. Additionally, it may inform the development of more efficient architectures for tasks involving dynamic, time-dependent data.

View on arXiv

Realizable Abstractions: Near-Optimal Hierarchical Reinforcement Learning

Roberto Cipollone, Luca Iocchi, Matteo Leonetti

Introduces Realizable Abstractions, a formal framework linking low-level MDPs to high-level decision processes with near-optimality guarantees.
Proposes RARL, a new HRL algorithm that learns compositional policies through constrained MDPs and iterative refinement of abstractions.
Demonstrates PAC guarantees, sample efficiency, and robustness to inaccuracies in abstractions and optimistic rewards.
Addresses non-Markovianity issues commonly found in HRL and provides conditions for reducing abstraction horizons.
Extends prior frameworks like MDP homomorphisms and stochastic bisimulations to improve expressive power and applicability.

Abstract

This paper introduces Realizable Abstractions, a novel framework for defining hierarchical abstractions in reinforcement learning (HRL) that avoids non-Markovianity issues and provides near-optimality guarantees. The authors propose a formal relation between low-level Markov Decision Processes (MDPs) and high-level decision processes, enabling efficient compositional policy learning without requiring specific knowledge of the ground MDP. They develop RARL (Realizable Abstractions Reinforcement Learning), an algorithm that learns compositional, near-optimal policies by leveraging constrained MDPs (CMDPs) to solve the realization problem. RARL is shown to be sample-efficient, Probably Approximately Correct (PAC), and robust to inaccuracies in abstractions. The paper provides theoretical insights into abstraction design, including conditions for reducing the effective horizon, and demonstrates the algorithm's ability to iteratively refine abstractions while driving exploration in the ground MDP.

Methodology

The authors define Realizable Abstractions as a relation between low-level MDPs and high-level decision processes, avoiding non-Markovian dependencies. They cast the realization problem as a Constrained MDP (CMDP) and solve it using online RL algorithms for CMDPs. RARL iteratively refines abstractions by sampling in the ground MDP and uses the current abstraction solution to guide exploration.

Results

RARL achieves compositional and near-optimal policies for low-level MDPs, with formal PAC guarantees and polynomial sample complexity. The algorithm is robust to inaccuracies in abstractions and overly optimistic abstract rewards, and it effectively reduces the abstraction horizon under specific conditions.

Implications

The framework and algorithm have significant implications for improving efficiency and scalability in hierarchical reinforcement learning. They enable better policy reuse, compositionality, and abstraction design for complex decision-making tasks, with potential applications in robotics, autonomous systems, and large-scale planning problems.

View on arXiv

Reliable Statistical Guarantees for Conformal Predictors with Small Datasets

Miguel Sánchez-Domínguez, Lucas Lacasa, Javier de Vicente, Gonzalo Rubio, Eusebio Valero

The paper identifies limitations of traditional conformal prediction guarantees in small dataset scenarios, where coverage variability can undermine reliability.
A new statistical guarantee is proposed, providing probabilistic coverage information for individual conformal predictors, even with limited calibration data.
The proposed framework converges to standard CP guarantees as dataset size increases, ensuring consistency across different data regimes.
The methodology is validated through illustrative examples and made accessible via an open-source software tool compatible with existing CP libraries.
The work emphasizes the importance of reliable uncertainty quantification for surrogate models in safety-critical applications.

Abstract

This paper addresses the challenge of providing reliable statistical guarantees for conformal predictors (CP) when working with small calibration datasets, a common scenario in surrogate modeling for science and engineering applications. Conformal prediction is a framework that provides uncertainty quantification with statistical guarantees, but its traditional guarantees are marginal and may fail to provide reliable coverage for small datasets due to high variability. The authors propose a novel statistical guarantee that offers probabilistic information about the coverage of individual conformal predictors, even with limited data. This new framework converges to the standard CP guarantees for large datasets while maintaining reliability for small datasets. The paper includes a detailed introduction to uncertainty quantification for machine learning practitioners, validates the proposed methodology through pedagogical examples, and provides an open-source software tool to facilitate adoption. The work is particularly relevant for safety-critical applications where reliable uncertainty quantification is essential.

Methodology

The authors extend the conformal prediction framework by introducing a new statistical guarantee that accounts for small calibration dataset sizes. They derive theoretical results showing the convergence of their method to standard CP guarantees for large datasets. The methodology is validated using simple, pedagogical examples to demonstrate its practical utility and reliability. Additionally, they provide an open-source software implementation to facilitate adoption by practitioners.

Results

The proposed framework successfully provides reliable coverage guarantees for individual conformal predictors, even with small datasets. Theoretical analysis confirms that the new guarantees converge to traditional CP guarantees as dataset size increases. Validation on illustrative examples demonstrates the practical applicability and robustness of the method.

Implications

This work has significant implications for safety-critical applications in science and engineering, where reliable uncertainty quantification is essential. By addressing the limitations of conformal prediction with small datasets, the proposed framework enables more robust deployment of surrogate models in fields such as fluid dynamics, structural mechanics, and medicine. The open-source software further lowers the barrier to adoption, making the methodology accessible to a broader audience of machine learning practitioners.

View on arXiv

Rethinking Decoupled Knowledge Distillation: A Predictive Distribution Perspective

Bowen Zheng, Ran Cheng

The Generalized Decoupled Knowledge Distillation (GDKD) loss introduces a hierarchical partitioning mechanism for logits, enhancing the flexibility and effectiveness of logit-based distillation.
Empirical analysis reveals that partitioning logits based on the top logit improves the interrelationships among non-top logits, and focusing on non-top logits enhances knowledge extraction.
The proposed GDKD algorithm efficiently processes multimodal predictive distributions, balancing accuracy and training speed without additional computational overhead.
GDKD demonstrates superior performance compared to existing logit-based and feature-based knowledge distillation methods across diverse benchmarks.
The approach simplifies the distillation process while maintaining high efficacy, making it suitable for resource-constrained environments.

Abstract

This paper revisits Decoupled Knowledge Distillation (DKD), a logit-based knowledge distillation method, from the perspective of predictive distributions. The authors propose a novel Generalized Decoupled Knowledge Distillation (GDKD) loss, which refines and extends the logit partitioning mechanism to improve the transfer of knowledge from teacher models to student models. By analyzing the predictive distributions of teacher models, the study uncovers two critical insights: the importance of partitioning logits based on the top logit to enhance relationships among non-top logits, and the benefits of amplifying the focus on non-top logits during distillation. Building on these insights, the authors introduce a streamlined GDKD algorithm designed to efficiently handle multimodal predictive distributions, achieving a balance between accuracy and training efficiency. Extensive experiments across multiple benchmarks, including CIFAR-100, ImageNet, Tiny-ImageNet, CUB-200-2011, and Cityscapes, demonstrate that GDKD outperforms both DKD and other leading knowledge distillation methods without requiring additional parameters.

Methodology

The authors developed the GDKD loss by introducing a hierarchical partitioning mechanism for logits, allowing for a more versatile decoupling of teacher and student predictive distributions. They conducted empirical analyses to understand the impact of predictive distributions on distillation gradients and designed a streamlined GDKD algorithm to optimize the utilization of logits, particularly in multimodal predictive distributions. The approach was evaluated on multiple datasets using standard benchmarks.

Results

The GDKD method consistently outperformed the original DKD and other state-of-the-art knowledge distillation techniques across benchmarks such as CIFAR-100, ImageNet, Tiny-ImageNet, CUB-200-2011, and Cityscapes. It achieved superior accuracy and training efficiency without introducing additional parameters, demonstrating its effectiveness in both performance and computational cost.

Implications

The proposed GDKD method provides a more efficient and effective approach to knowledge distillation, making it particularly suitable for scenarios where computational resources are limited, such as mobile and embedded systems. Its ability to handle multimodal predictive distributions and focus on non-top logits opens new avenues for improving model compression and knowledge transfer in deep learning applications.

View on arXiv

Score Matching for Estimating Finite Point Processes

Haoqun Cao, Yixuan Zhang, Feng Zhou

The paper establishes a formal framework for score matching on finite point processes using Janossy measures, addressing theoretical gaps in prior work.
A weighted score-matching (WSM) estimator is introduced, with proven consistency and convergence rates in parametric settings.
For nonparametric models, the authors identify normalization issues that prevent SM from uniquely identifying the ground-truth distribution and propose a survival-classification augmentation as a remedy.
The proposed methods are integration-free, making them computationally efficient and scalable for spatio-temporal point process models.
Extensive experiments validate the theoretical claims and show performance comparable to MLE on both synthetic and real-world datasets.

Abstract

This paper addresses the limitations of existing score matching (SM) methods for estimating finite point processes, which are statistical models used to characterize event occurrences in bounded spaces. Traditional maximum likelihood estimation (MLE) for point processes is computationally expensive due to the need for normalizing constants, especially in high-dimensional settings. While prior works have applied SM to point processes, they lack rigorous theoretical guarantees and fail to recover the ground-truth intensity in many cases. The authors propose a formal framework for SM on finite point processes using Janossy measures and introduce a weighted score-matching (WSM) estimator. They analyze its statistical properties under parametric settings and extend it to nonparametric models. To address the non-identifiability of SM in nonparametric cases, they propose a survival-classification augmentation, which provides a complete, integration-free training objective for intensity-based point process models. Experiments on synthetic and real-world datasets demonstrate that the proposed methods achieve accuracy comparable to MLE while being computationally efficient and scalable.

Methodology

The authors develop a formal framework for score matching on finite point processes using Janossy measures. They propose a weighted score-matching (WSM) estimator for parametric models and analyze its statistical properties. For nonparametric models, they address the non-identifiability of SM by introducing a survival-classification augmentation, which creates a complete, integration-free training objective. The methods are evaluated on synthetic and real-world temporal and spatio-temporal datasets.

Results

The proposed WSM estimator achieves consistency and convergence rates in parametric settings. For nonparametric models, the survival-classification augmentation resolves normalization issues, enabling accurate recovery of ground-truth intensities. Experiments show that the methods achieve accuracy comparable to MLE while being more computationally efficient and scalable.

Implications

The proposed methods have significant implications for modeling and analyzing point processes in fields such as seismology, finance, criminology, and neuroscience. By eliminating the need for computationally expensive normalizing constants, these methods enable efficient and scalable training of intensity-based point process models, particularly in high-dimensional spatio-temporal settings.

View on arXiv

SmartAlert: Implementing Machine Learning-Driven Clinical Decision Support for Inpatient Lab Utilization Reduction

April S. Liang, Fatemeh Amrollahi, Yixing Jiang, Conor K. Corbin, Grace Y.E. Kim, David Mui, Trevor Crowell, Aakash Acharya, Sreedevi Mony, Soumya Punnathanam, Jack McKeown, Margaret Smith, Steven Lin, Arnold Milstein, Kevin Schulman, Jason Hom, Michael A. Pfeffer, Tho D. Pham, David Svec, Weihan Chu, Lisa Shieh, Christopher Sharp, Stephen P. Ma, Jonathan H. Chen

SmartAlert is a machine learning-based CDS system integrated into the EHR to reduce repetitive inpatient lab testing.
The system uses a probabilistic regression model to predict stable lab results and provides real-time alerts to clinicians.
A randomized controlled pilot across two hospitals showed a 15% reduction in repetitive CBC testing without adverse safety outcomes.
Successful implementation required stakeholder alignment, governance processes, and careful user interface design.
The study demonstrates the feasibility of deploying ML-driven tools to optimize clinical decision-making and reduce healthcare costs.

Abstract

This paper introduces SmartAlert, a machine learning-driven clinical decision support (CDS) system designed to reduce unnecessary inpatient laboratory testing, specifically targeting repetitive complete blood count (CBC) tests. The system integrates into the electronic health record (EHR) and uses a probabilistic regression model to predict the likelihood of stable lab results, providing clinicians with real-time guidance on whether repeat testing is necessary. The study describes the implementation of SmartAlert in a randomized controlled pilot across two hospitals, involving 9270 admissions over seven months. Results demonstrated a 15% reduction in repetitive CBC testing within 52 hours of alert display, without compromising patient safety. Key challenges addressed include stakeholder engagement, governance for deploying ML models in clinical settings, and user interface design. The findings highlight the potential of ML-driven CDS systems to improve healthcare efficiency and reduce costs while maintaining safety.

Methodology

The study deployed SmartAlert in two hospitals using a randomized controlled design. The system architecture leveraged the DEPLOYR framework, integrating real-time patient data from the EHR via FHIR APIs. A probabilistic regression model predicted the likelihood of stable CBC results, and alerts were displayed to clinicians when repetitive lab tests were ordered. The pilot study included 9270 admissions over seven months, and outcomes were measured in terms of CBC testing rates and secondary safety metrics.

Results

The implementation of SmartAlert led to a 15% relative reduction in repetitive CBC testing (1.54 vs. 1.82 tests within 52 hours, p < 0.01). Importantly, there were no adverse effects on secondary safety outcomes, demonstrating that the system safely reduced unnecessary testing. Qualitative feedback from clinicians highlighted the importance of clear alert design and alignment with clinical workflows.

Implications

SmartAlert demonstrates the potential of machine learning-driven CDS systems to optimize clinical workflows, reduce unnecessary healthcare costs, and improve patient outcomes. The study provides a framework for deploying ML models in real-world clinical settings, emphasizing the importance of stakeholder engagement, governance, and user-centered design. This approach could be extended to other areas of healthcare to address inefficiencies and improve decision-making.

View on arXiv

The Initialization Determines Whether In-Context Learning Is Gradient Descent

Shifeng Xie, Rui Yuan, Simone Rossi, Thomas Hannagan

Multi-head linear self-attention (LSA) cannot replicate one-step gradient descent (GD) when regression weights have a non-zero mean, highlighting a fundamental limitation of the ICL-GD correspondence.
The initialization of the query's prediction (yq) is a decisive factor; misalignment between yq and the prior induces a persistent performance gap.
The authors propose yq-LSA, a simple extension of LSA with a trainable initialization vector, which restores equivalence with GD even in the non-zero mean setting.
Theoretical analysis and experiments confirm the effectiveness of yq-LSA in linear regression tasks, demonstrating its ability to close the performance gap with GD.
Introducing explicit initial guesses in large language models (LLMs) improves their in-context learning performance on tasks like semantic similarity.

Abstract

This paper investigates the connection between in-context learning (ICL) in large language models (LLMs) and gradient descent (GD), focusing on the role of initialization in linear self-attention (LSA). Previous studies have shown that LSA can approximate GD under restrictive assumptions, such as zero-mean Gaussian priors and zero initialization for GD. However, these assumptions are unrealistic in practical settings. The authors extend this analysis to more realistic conditions by incorporating non-zero Gaussian prior means in linear regression tasks. They demonstrate that multi-head LSA cannot replicate one-step GD under these conditions, even with an arbitrarily large number of heads. To address this limitation, the authors propose yq-LSA, an extension of LSA that introduces a trainable initialization vector for the query (yq). This modification restores the equivalence between LSA and GD in the non-zero mean setting. The paper provides theoretical proofs and experimental validation, showing that yq-LSA bridges the gap between ICL and GD. Additionally, the authors explore the practical implications of their findings by demonstrating that introducing explicit initial guesses improves ICL performance in LLMs on semantic similarity tasks.

Methodology

The authors analyze the relationship between ICL and GD in the context of linear regression with linear self-attention (LSA). They extend the LSA embedding matrix by introducing an initial guess for the query (yq) and derive theoretical bounds on the number of attention heads required for ICL to approximate GD. They then propose yq-LSA, which incorporates a trainable initialization vector, and validate its performance through theoretical proofs and experiments on linear regression tasks. Finally, they test the practical utility of their findings by applying yq-LSA-inspired modifications to LLMs in semantic similarity tasks.

Results

The study proves that multi-head LSA cannot replicate one-step GD under non-zero mean conditions, regardless of the number of heads. The proposed yq-LSA successfully restores the equivalence between LSA and GD by introducing a trainable initialization vector. Experimental results confirm that yq-LSA closes the performance gap in linear regression tasks. Additionally, incorporating explicit initial guesses in LLMs improves their performance on semantic similarity tasks, demonstrating the practical relevance of the proposed approach.

Implications

This work provides a deeper understanding of the mechanisms underlying in-context learning in LLMs and highlights the importance of initialization in achieving GD-like behavior. The proposed yq-LSA framework offers a principled way to improve ICL performance in both theoretical and practical settings. These findings could inform the design of more effective transformer architectures and prompting strategies for tasks requiring in-context learning, such as few-shot learning and semantic similarity.

View on arXiv

gistml

ASCIIBench: Evaluating Language-Model-Based Understanding of Visually-Oriented Text

Context-Aware Mixture-of-Experts Inference on CXL-Enabled GPU-NDP Systems

Distance Is All You Need: Radial Dispersion for Uncertainty Estimation in Large Language Models

Dual-Path Region-Guided Attention Network for Ground Reaction Force and Moment Regression

Evaluating Long-Context Reasoning in LLM-Based WebAgents

Exploiting ftrace's function_graph Tracer Features for Machine Learning: A Case Study on Encryption Detection

Feature Engineering vs. Deep Learning for Automated Coin Grading: A Comparative Study on Saint-Gaudens Double Eagles

Federated Learning for Anomaly Detection in Maritime Movement Data

GRASP: GRouped Activation Shared Parameterization for Parameter-Efficient Fine-Tuning and Robust Inference of Transformers

Gradient Descent with Provably Tuned Learning-rate Schedules

GraphBench: Next-generation graph learning benchmarking

Mitigating the Curse of Detail: Scaling Arguments for Feature Learning and Sample Complexity

Network of Theseus

On the Limits of Test-Time Compute: Sequential Reward Filtering for Better Inference

Prototype-Based Semantic Consistency Alignment for Domain Adaptive Retrieval

QoSDiff: An Implicit Topological Embedding Learning Framework Leveraging Denoising Diffusion and Adversarial Attention for Robust QoS Prediction

RLHFSpec: Breaking the Efficiency Bottleneck in RLHF Training via Adaptive Drafting

RNNs perform task computations by dynamically warping neural representations

Realizable Abstractions: Near-Optimal Hierarchical Reinforcement Learning

Reliable Statistical Guarantees for Conformal Predictors with Small Datasets

Rethinking Decoupled Knowledge Distillation: A Predictive Distribution Perspective

Score Matching for Estimating Finite Point Processes

SmartAlert: Implementing Machine Learning-Driven Clinical Decision Support for Inpatient Lab Utilization Reduction

The Initialization Determines Whether In-Context Learning Is Gradient Descent