SkillAgentSearch skills...

Uqlm

UQLM: Uncertainty Quantification for Language Models, is a Python package for UQ-based LLM hallucination detection

Install / Use

/learn @cvs-health/Uqlm

README

<p align="center"> <picture> <source media="(prefers-color-scheme: dark)" srcset="assets/images/uqlm_flow_ds_dark.png"> <source media="(prefers-color-scheme: light)" srcset="assets/images/uqlm_flow_ds.png"> <img src="assets/images/uqlm_flow_ds.png" alt="UQLM Flow Diagram" /> </picture> </p> <h1 align="center">uqlm: Uncertainty Quantification for Language Models</h1> <p align="center"> <a href="https://github.com/cvs-health/uqlm/actions"> <img src="https://github.com/cvs-health/uqlm/actions/workflows/ci.yaml/badge.svg" alt="Build Status"> </a> <a href="https://pypi.org/project/uqlm/"> <img src="https://badge.fury.io/py/uqlm.svg" alt="PyPI version"> </a> <a href="https://cvs-health.github.io/uqlm/latest/index.html"> <img src="https://img.shields.io/badge/docs-latest-blue.svg" alt="Documentation Status"> </a> <a href="https://pypi.org/project/uqlm/"> <img src="https://img.shields.io/badge/python-3.10%2B-blue" alt="Python Versions"> </a> <a href="https://opensource.org/licenses/Apache-2.0"> <img src="https://img.shields.io/badge/License-Apache%202.0-blue.svg" alt="License"> </a> <a href="https://github.com/astral-sh/uv"> <img src="https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/astral-sh/uv/main/assets/badge/v0.json" alt="uv"> </a> <a href="https://github.com/astral-sh/ruff"> <img src="https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/astral-sh/ruff/main/assets/badge/v2.json" alt="Ruff"> </a> </p> <p align="center"> <a href="https://www.jmlr.org/papers/v27/25-1557.html"> <img src="https://img.shields.io/badge/JMLR-Published-112467?style=flat&style=for-the-badge&logo=semantic-scholar&logoColor=white" alt="JMLR Publication"> </a> <a href="https://openreview.net/pdf?id=WOFspd4lq5"> <img src="https://img.shields.io/badge/TMLR-Published-4FA1CA?style=flat&logo=semantic-scholar&logoColor=white" alt="TMLR Publication"> </a> <a href="https://arxiv.org/abs/2602.17431"> <img src="https://img.shields.io/badge/arXiv-LongTextUQ-B31B1B?logo=arXiv&logoColor=white" alt="arXiv"> </a> </p>

UQLM is a Python library for Large Language Model (LLM) hallucination detection using state-of-the-art uncertainty quantification techniques.

Installation

The latest version can be installed from PyPI:

pip install uqlm

Hallucination Detection

UQLM provides a suite of response-level scorers for quantifying the uncertainty of Large Language Model (LLM) outputs. Each scorer returns a confidence score between 0 and 1, where higher scores indicate a lower likelihood of errors or hallucinations. We categorize these scorers into different types:

| Scorer Type | Added Latency | Added Cost | Compatibility | Off-the-Shelf / Effort | |------------------------|----------------------------------------------------|------------------------------------------|-----------------------------------------------------------|---------------------------------------------------------| | Black-Box Scorers | ⏱️ Medium-High (multiple generations & comparisons) | 💸 High (multiple LLM calls) | 🌍 Universal (works with any LLM) | ✅ Off-the-shelf | | White-Box Scorers | ⚡ Minimal* (token probabilities already returned) | ✔️ None* (no extra LLM calls) | 🔒 Limited (requires access to token probabilities) | ✅ Off-the-shelf | | LLM-as-a-Judge Scorers | ⏳ Low-Medium (additional judge calls add latency) | 💵 Low-High (depends on number of judges)| 🌍 Universal (any LLM can serve as judge) |✅ Off-the-shelf | | Ensemble Scorers | 🔀 Flexible (combines various scorers) | 🔀 Flexible (combines various scorers) | 🔀 Flexible (combines various scorers) | ✅ Off-the-shelf (beginner-friendly); 🛠️ Can be tuned (best for advanced users) | | Long-Text Scorers | ⏱️ High-Very high (multiple generations & claim-level comparisons) | 💸 High (multiple LLM calls) | 🌍 Universal | ✅ Off-the-shelf |

<sup><sup> *Does not apply to multi-generation white-box scorers, which have higher cost and latency. </sup></sup>

Below we provide illustrative code snippets and details about available scorers for each type.

Black-Box Scorers (Consistency-Based)

These scorers assess uncertainty by measuring the consistency of multiple responses generated from the same prompt. They are compatible with any LLM, intuitive to use, and don't require access to internal model states or token probabilities.

<p align="center"> <picture> <source media="(prefers-color-scheme: dark)" srcset="assets/images/black_box_graphic_dark.png"> <source media="(prefers-color-scheme: light)" srcset="assets/images/black_box_graphic.png"> <img src="assets/images/black_box_graphic.png" alt="Black Box Graphic" /> </picture> </p>

Example Usage: Below is a sample of code illustrating how to use the BlackBoxUQ class to conduct hallucination detection.

from langchain_openai import ChatOpenAI
llm = ChatOpenAI(model="gpt-4o-mini")

from uqlm import BlackBoxUQ
bbuq = BlackBoxUQ(llm=llm, scorers=["semantic_negentropy"], use_best=True)

results = await bbuq.generate_and_score(prompts=prompts, num_responses=5)
results.to_df()
<p align="center"> <img src="https://raw.githubusercontent.com/cvs-health/uqlm/main/assets/images/black_box_output4.png" /> </p>

Above, use_best=True implements mitigation so that the uncertainty-minimized responses is selected. Note that although we use ChatOpenAI in this example, any LangChain Chat Model may be used. For a more detailed demo, refer to our Black-Box UQ Demo.

Available Scorers:

White-Box Scorers (Token-Probability-Based)

These scorers leverage token probabilities to estimate uncertainty. They offer single-generation scoring, which is significantly faster and cheaper than black-box methods, but require access to the LLM's internal probabilities, meaning they are not necessarily compatible with all LLMs/APIs.

<p align="center"> <picture> <source media="(prefers-color-scheme: dark)" srcset="assets/images/white_box_graphic_dark.png"> <source media="(prefers-color-scheme: light)" srcset="assets/images/white_box_graphic.png"> <img src="assets/images/white_box_graphic.png" alt="White Box Graphic"/> </picture> </p>

Example Usage: Below is a sample of code illustrating how to use the WhiteBoxUQ class to conduct hallucination detection.

from langchain_google_vertexai import ChatVertexAI
llm = ChatVertexAI(model='gemini-2.5-pro')

from uqlm import WhiteBoxUQ
wbuq = WhiteBoxUQ(llm=llm, scorers=["min_probability"])

results = await wbuq.generate_and_score(prompts=prompts)
results.to_df()
<p align="center"> <img src="https://raw.githubusercontent.com/cvs-health/uqlm/main/assets/images/white_box_output2.png" /> </p>

Again, any LangChain Chat Model may be used in place of ChatVertexAI. For more detailed examples, refer to our demo notebooks on Single-Generation White-Box UQ and/or Multi-Generation White-Box UQ.

Single-Generation Scorers (minimal latency, zero extra cost):

Self-Reflection Scorers (one additional generation per response):

Multi-Generation Scorers (several additional generations per response):

  • Monte carlo seq

Related Skills

View on GitHub
GitHub Stars1.1k
CategoryDevelopment
Updated1d ago
Forks117

Languages

Python

Security Score

100/100

Audited on Mar 18, 2026

No findings