Abstract
This paper introduces the Radial Dispersion Score (RDS), a novel, simple, and model-agnostic uncertainty estimation metric for large language models (LLMs). Unlike existing methods that rely on semantic clustering, internal model states, or calibration, RDS measures the radial dispersion of sampled generations in an embedding space, providing a clean geometric interpretation of uncertainty. A probability-weighted variant, RDSw, incorporates token probabilities from the LLM when available, further enhancing performance. RDS is parameter-free, scalable, and applicable to both black-box APIs and open-weight models. The method also supports per-sample uncertainty scoring, enabling applications such as best-of-N selection and confidence-based filtering. Across four free-form QA datasets and multiple LLMs, RDS and RDSw achieve state-of-the-art performance in hallucination detection and answer selection, demonstrating robustness and scalability.
Methodology
The authors propose RDS, which measures the total radial dispersion of sampled generations embedded on a unit hypersphere. RDS calculates the â„“1 distance of each embedding from the empirical centroid. A probability-weighted variant, RDSw, incorporates token-level probabilities when available. The method is model-agnostic, parameter-free, and does not rely on semantic clustering or internal model states. It is evaluated on four free-form QA datasets using multiple LLMs, comparing against nine strong baselines.
Results
RDS and RDSw deliver state-of-the-art performance in hallucination detection and answer selection tasks. The methods outperform nine baselines, including semantic entropy and geometric methods like EigenScore, across four challenging QA datasets. RDS demonstrates robustness to sample size and embedding choice, while RDSw further improves accuracy when token probabilities are available. The per-sample scoring capability of RDS also enhances best-of-N selection and confidence-based filtering applications.
Implications
The proposed RDS and RDSw metrics provide a robust and scalable solution for uncertainty estimation in LLMs, applicable to both black-box APIs and open-weight models. These methods can improve the reliability of LLM-based systems by enabling better hallucination detection, answer selection, and confidence-based filtering. The simplicity and model-agnostic nature of RDS make it a practical tool for a wide range of applications, including QA systems, content generation, and safety-critical AI deployments.
View on arXiv