Uqlm

UQLM: Uncertainty Quantification for Language Models, is a Python package for UQ-based LLM hallucination detection

Generate Convert Improve

Install / Use

/learn @cvs-health/Uqlm

About this skill

Quality Score

0/100

README

<picture> <source media="(prefers-color-scheme: dark)" srcset="assets/images/uqlm_flow_ds_dark.png"> <source media="(prefers-color-scheme: light)" srcset="assets/images/uqlm_flow_ds.png"> <img src="assets/images/uqlm_flow_ds.png" alt="UQLM Flow Diagram" /> </picture> <h1 align="center">uqlm: Uncertainty Quantification for Language Models</h1> <a href="https://github.com/cvs-health/uqlm/actions"> <img src="https://github.com/cvs-health/uqlm/actions/workflows/ci.yaml/badge.svg" alt="Build Status"> </a> <a href="https://pypi.org/project/uqlm/"> <img src="https://badge.fury.io/py/uqlm.svg" alt="PyPI version"> </a> <a href="https://cvs-health.github.io/uqlm/latest/index.html"> <img src="https://img.shields.io/badge/docs-latest-blue.svg" alt="Documentation Status"> </a> <a href="https://pypi.org/project/uqlm/"> <img src="https://img.shields.io/badge/python-3.10%2B-blue" alt="Python Versions"> </a> <a href="https://opensource.org/licenses/Apache-2.0"> <img src="https://img.shields.io/badge/License-Apache%202.0-blue.svg" alt="License"> </a> <a href="https://github.com/astral-sh/uv"> <img src="https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/astral-sh/uv/main/assets/badge/v0.json" alt="uv"> </a> <a href="https://github.com/astral-sh/ruff"> <img src="https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/astral-sh/ruff/main/assets/badge/v2.json" alt="Ruff"> </a> <a href="https://www.jmlr.org/papers/v27/25-1557.html"> <img src="https://img.shields.io/badge/JMLR-Published-112467?style=flat&style=for-the-badge&logo=semantic-scholar&logoColor=white" alt="JMLR Publication"> </a> <a href="https://openreview.net/pdf?id=WOFspd4lq5"> <img src="https://img.shields.io/badge/TMLR-Published-4FA1CA?style=flat&logo=semantic-scholar&logoColor=white" alt="TMLR Publication"> </a> <a href="https://arxiv.org/abs/2602.17431"> <img src="https://img.shields.io/badge/arXiv-LongTextUQ-B31B1B?logo=arXiv&logoColor=white" alt="arXiv"> </a>

UQLM is a Python library for Large Language Model (LLM) hallucination detection using state-of-the-art uncertainty quantification techniques.

Installation

The latest version can be installed from PyPI:

pip install uqlm

Hallucination Detection

UQLM provides a suite of response-level scorers for quantifying the uncertainty of Large Language Model (LLM) outputs. Each scorer returns a confidence score between 0 and 1, where higher scores indicate a lower likelihood of errors or hallucinations. We categorize these scorers into different types:

| Scorer Type | Added Latency | Added Cost | Compatibility | Off-the-Shelf / Effort | |------------------------|----------------------------------------------------|------------------------------------------|-----------------------------------------------------------|---------------------------------------------------------| | Black-Box Scorers | ⏱️ Medium-High (multiple generations & comparisons) | 💸 High (multiple LLM calls) | 🌍 Universal (works with any LLM) | ✅ Off-the-shelf | | White-Box Scorers | ⚡ Minimal* (token probabilities already returned) | ✔️ None* (no extra LLM calls) | 🔒 Limited (requires access to token probabilities) | ✅ Off-the-shelf | | LLM-as-a-Judge Scorers | ⏳ Low-Medium (additional judge calls add latency) | 💵 Low-High (depends on number of judges)| 🌍 Universal (any LLM can serve as judge) |✅ Off-the-shelf | | Ensemble Scorers | 🔀 Flexible (combines various scorers) | 🔀 Flexible (combines various scorers) | 🔀 Flexible (combines various scorers) | ✅ Off-the-shelf (beginner-friendly); 🛠️ Can be tuned (best for advanced users) | | Long-Text Scorers | ⏱️ High-Very high (multiple generations & claim-level comparisons) | 💸 High (multiple LLM calls) | 🌍 Universal | ✅ Off-the-shelf |

*Does not apply to multi-generation white-box scorers, which have higher cost and latency.

Below we provide illustrative code snippets and details about available scorers for each type.

Black-Box Scorers (Consistency-Based)

These scorers assess uncertainty by measuring the consistency of multiple responses generated from the same prompt. They are compatible with any LLM, intuitive to use, and don't require access to internal model states or token probabilities.

Example Usage: Below is a sample of code illustrating how to use the BlackBoxUQ class to conduct hallucination detection.

from langchain_openai import ChatOpenAI
llm = ChatOpenAI(model="gpt-4o-mini")

from uqlm import BlackBoxUQ
bbuq = BlackBoxUQ(llm=llm, scorers=["semantic_negentropy"], use_best=True)

results = await bbuq.generate_and_score(prompts=prompts, num_responses=5)
results.to_df()

Above, use_best=True implements mitigation so that the uncertainty-minimized responses is selected. Note that although we use ChatOpenAI in this example, any LangChain Chat Model may be used. For a more detailed demo, refer to our Black-Box UQ Demo.

Available Scorers:

Discrete Semantic Entropy (Farquhar et al., 2024; Bouchard & Chauhan, 2025)
Number of Semantic Sets (Lin et al., 2024; Vashurin et al., 2025; Kuhn et al., 2023)
Non-Contradiction Probability (Chen & Mueller, 2023; Lin et al., 2024; Manakul et al., 2023)
Entailment Probability (Chen & Mueller, 2023; Lin et al., 2024; Manakul et al., 2023)
Exact Match (Cole et al., 2023; Chen & Mueller, 2023)
BERTScore (Manakul et al., 2023; Zheng et al., 2020)
Cosine Similarity (Shorinwa et al., 2024; HuggingFace)

White-Box Scorers (Token-Probability-Based)

These scorers leverage token probabilities to estimate uncertainty. They offer single-generation scoring, which is significantly faster and cheaper than black-box methods, but require access to the LLM's internal probabilities, meaning they are not necessarily compatible with all LLMs/APIs.

Example Usage: Below is a sample of code illustrating how to use the WhiteBoxUQ class to conduct hallucination detection.

from langchain_google_vertexai import ChatVertexAI
llm = ChatVertexAI(model='gemini-2.5-pro')

from uqlm import WhiteBoxUQ
wbuq = WhiteBoxUQ(llm=llm, scorers=["min_probability"])

results = await wbuq.generate_and_score(prompts=prompts)
results.to_df()

Again, any LangChain Chat Model may be used in place of ChatVertexAI. For more detailed examples, refer to our demo notebooks on Single-Generation White-Box UQ and/or Multi-Generation White-Box UQ.

Single-Generation Scorers (minimal latency, zero extra cost):

Minimum token probability (Manakul et al., 2023)
Length-Normalized Sequence Probability (Malinin & Gales, 2021)
Sequence Probability (Vashurin et al., 2024)
Mean Top-K Token Negentropy (Scalena et al., 2025; Manakul et al., 2023)
Min Top-K Token Negentropy (Scalena et al., 2025; Manakul et al., 2023)
Probability Margin (Farr et al., 2024)

Self-Reflection Scorers (one additional generation per response):

P(True) (Kadavath et al., 2022)

Multi-Generation Scorers (several additional generations per response):

Monte carlo seq

Related Skills

node-connect

325.6k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

80.2k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

325.6k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

commit-push-pr

80.2k

Commit, push, and open a PR