Uqlm
UQLM: Uncertainty Quantification for Language Models, is a Python package for UQ-based LLM hallucination detection
Install / Use
/learn @cvs-health/UqlmREADME
UQLM is a Python library for Large Language Model (LLM) hallucination detection using state-of-the-art uncertainty quantification techniques.
Installation
The latest version can be installed from PyPI:
pip install uqlm
Hallucination Detection
UQLM provides a suite of response-level scorers for quantifying the uncertainty of Large Language Model (LLM) outputs. Each scorer returns a confidence score between 0 and 1, where higher scores indicate a lower likelihood of errors or hallucinations. We categorize these scorers into different types:
| Scorer Type | Added Latency | Added Cost | Compatibility | Off-the-Shelf / Effort | |------------------------|----------------------------------------------------|------------------------------------------|-----------------------------------------------------------|---------------------------------------------------------| | Black-Box Scorers | ⏱️ Medium-High (multiple generations & comparisons) | 💸 High (multiple LLM calls) | 🌍 Universal (works with any LLM) | ✅ Off-the-shelf | | White-Box Scorers | ⚡ Minimal* (token probabilities already returned) | ✔️ None* (no extra LLM calls) | 🔒 Limited (requires access to token probabilities) | ✅ Off-the-shelf | | LLM-as-a-Judge Scorers | ⏳ Low-Medium (additional judge calls add latency) | 💵 Low-High (depends on number of judges)| 🌍 Universal (any LLM can serve as judge) |✅ Off-the-shelf | | Ensemble Scorers | 🔀 Flexible (combines various scorers) | 🔀 Flexible (combines various scorers) | 🔀 Flexible (combines various scorers) | ✅ Off-the-shelf (beginner-friendly); 🛠️ Can be tuned (best for advanced users) | | Long-Text Scorers | ⏱️ High-Very high (multiple generations & claim-level comparisons) | 💸 High (multiple LLM calls) | 🌍 Universal | ✅ Off-the-shelf |
<sup><sup> *Does not apply to multi-generation white-box scorers, which have higher cost and latency. </sup></sup>
Below we provide illustrative code snippets and details about available scorers for each type.
Black-Box Scorers (Consistency-Based)
These scorers assess uncertainty by measuring the consistency of multiple responses generated from the same prompt. They are compatible with any LLM, intuitive to use, and don't require access to internal model states or token probabilities.
<p align="center"> <picture> <source media="(prefers-color-scheme: dark)" srcset="assets/images/black_box_graphic_dark.png"> <source media="(prefers-color-scheme: light)" srcset="assets/images/black_box_graphic.png"> <img src="assets/images/black_box_graphic.png" alt="Black Box Graphic" /> </picture> </p>Example Usage:
Below is a sample of code illustrating how to use the BlackBoxUQ class to conduct hallucination detection.
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(model="gpt-4o-mini")
from uqlm import BlackBoxUQ
bbuq = BlackBoxUQ(llm=llm, scorers=["semantic_negentropy"], use_best=True)
results = await bbuq.generate_and_score(prompts=prompts, num_responses=5)
results.to_df()
<p align="center">
<img src="https://raw.githubusercontent.com/cvs-health/uqlm/main/assets/images/black_box_output4.png" />
</p>
Above, use_best=True implements mitigation so that the uncertainty-minimized responses is selected. Note that although we use ChatOpenAI in this example, any LangChain Chat Model may be used. For a more detailed demo, refer to our Black-Box UQ Demo.
Available Scorers:
- Discrete Semantic Entropy (Farquhar et al., 2024; Bouchard & Chauhan, 2025)
- Number of Semantic Sets (Lin et al., 2024; Vashurin et al., 2025; Kuhn et al., 2023)
- Non-Contradiction Probability (Chen & Mueller, 2023; Lin et al., 2024; Manakul et al., 2023)
- Entailment Probability (Chen & Mueller, 2023; Lin et al., 2024; Manakul et al., 2023)
- Exact Match (Cole et al., 2023; Chen & Mueller, 2023)
- BERTScore (Manakul et al., 2023; Zheng et al., 2020)
- Cosine Similarity (Shorinwa et al., 2024; HuggingFace)
White-Box Scorers (Token-Probability-Based)
These scorers leverage token probabilities to estimate uncertainty. They offer single-generation scoring, which is significantly faster and cheaper than black-box methods, but require access to the LLM's internal probabilities, meaning they are not necessarily compatible with all LLMs/APIs.
<p align="center"> <picture> <source media="(prefers-color-scheme: dark)" srcset="assets/images/white_box_graphic_dark.png"> <source media="(prefers-color-scheme: light)" srcset="assets/images/white_box_graphic.png"> <img src="assets/images/white_box_graphic.png" alt="White Box Graphic"/> </picture> </p>Example Usage:
Below is a sample of code illustrating how to use the WhiteBoxUQ class to conduct hallucination detection.
from langchain_google_vertexai import ChatVertexAI
llm = ChatVertexAI(model='gemini-2.5-pro')
from uqlm import WhiteBoxUQ
wbuq = WhiteBoxUQ(llm=llm, scorers=["min_probability"])
results = await wbuq.generate_and_score(prompts=prompts)
results.to_df()
<p align="center">
<img src="https://raw.githubusercontent.com/cvs-health/uqlm/main/assets/images/white_box_output2.png" />
</p>
Again, any LangChain Chat Model may be used in place of ChatVertexAI. For more detailed examples, refer to our demo notebooks on Single-Generation White-Box UQ and/or Multi-Generation White-Box UQ.
Single-Generation Scorers (minimal latency, zero extra cost):
- Minimum token probability (Manakul et al., 2023)
- Length-Normalized Sequence Probability (Malinin & Gales, 2021)
- Sequence Probability (Vashurin et al., 2024)
- Mean Top-K Token Negentropy (Scalena et al., 2025; Manakul et al., 2023)
- Min Top-K Token Negentropy (Scalena et al., 2025; Manakul et al., 2023)
- Probability Margin (Farr et al., 2024)
Self-Reflection Scorers (one additional generation per response):
- P(True) (Kadavath et al., 2022)
Multi-Generation Scorers (several additional generations per response):
- Monte carlo seq
Related Skills
node-connect
325.6kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
80.2kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
325.6kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
commit-push-pr
80.2kCommit, push, and open a PR
