Guidellm

Evaluate and Enhance Your LLM Deployments for Real-World Inference Needs

Generate Convert Improve

Install / Use

/learn @vllm-project/Guidellm

About this skill

Quality Score

0/100

README

<p align="center"> <picture> <source media="(prefers-color-scheme: dark)" srcset="https://raw.githubusercontent.com/vllm-project/guidellm/main/docs/assets/guidellm-logo-light.png"> <img alt="GuideLLM Logo" src="https://raw.githubusercontent.com/vllm-project/guidellm/main/docs/assets/guidellm-logo-dark.png" width=55%> </picture> </p> <h3 align="center"> SLO-aware Benchmarking and Evaluation Platform for Optimizing Real-World LLM Inference </h3>

Overview

GuideLLM is a platform for evaluating how language models perform under real workloads and configurations. It simulates end-to-end interactions with OpenAI-compatible and vLLM-native servers, generates workload patterns that reflect production usage, and produces detailed reports that help teams understand system behavior, resource needs, and operational limits. GuideLLM supports real and synthetic datasets, multimodal inputs, and flexible execution profiles, giving engineering and ML teams a consistent framework for assessing model behavior, tuning deployments, and planning capacity as their systems evolve.

Why GuideLLM?

GuideLLM gives teams a clear picture of performance, efficiency, and reliability when deploying LLMs in production-like environments.

Captures complete latency and token-level statistics for SLO-driven evaluation, including full distributions for TTFT, ITL, and end-to-end behavior.
Generates realistic, configurable traffic patterns across synchronous, concurrent, and rate-based modes, including reproducible sweeps to identify safe operating ranges.
Supports both real and synthetic multimodal datasets, enabling controlled experiments and production-style evaluations in one framework.
Produces standardized, exportable reports for dashboards, analysis, and regression tracking, ensuring consistency across teams and workflows.
Delivers high-throughput, extensible benchmarking with multiprocessing, threading, async execution, and a flexible CLI/API for customization or quickstarts.

Comparisons

Many tools benchmark endpoints, not models, and miss the details that matter for LLMs. GuideLLM focuses exclusively on LLM-specific workloads, measuring TTFT, ITL, output distributions, and dataset-driven variation. It fits into everyday engineering tasks by using standard Python interfaces and HuggingFace datasets instead of custom formats or research-only pipelines. It is also built for performance, supporting high-rate load generation and accurate scheduling far beyond simple scripts or example benchmarks. The table below highlights how this approach compares to other options.

| Tool | CLI | API | High Perf | Full Metrics | Data Modalities | Data Sources | Profiles | Backends | Endpoints | Output Types | | ---------------------------------------------------------------------------- | --- | --- | --------- | ------------ | ------------------------------ | ------------------------------------- | ------------------------------------------------------------- | ------------------------------- | ------------------------------------------------------------------------- | ------------------------ | | GuideLLM | ✅ | ✅ | ✅ | ✅ | Text, Image, Audio, Video | HuggingFace, Files, Synthetic, Custom | Synchronous, Concurrent, Throughput, Constant, Poisson, Sweep | OpenAI-compatible | /completions, /chat/completions, /audio/translation, /audio/transcription | console, json, csv, html | | inference-perf | ✅ | ❌ | ✅ | ❌ | Text | Synthetic, Specific Datasets | Concurrent, Constant, Poisson, Sweep | OpenAI-compatible | /completions, /chat/completions | json, png | | genai-bench | ✅ | ❌ | ❌ | ❌ | Text, Image, Embedding, ReRank | Synthetic, File | Concurrent | OpenAI-compatible, Hosted Cloud | /chat/completions, /embeddings | console, xlsx, png | | llm-perf | ❌ | ❌ | ✅ | ❌ | Text | Synthetic | Concurrent | OpenAI-compatible, Hosted Cloud | /chat/completions | json | | ollama-benchmark | ✅ | ❌ | ❌ | ❌ | Text | Synthetic | Synchronous | Ollama | /completions | console, json | | vllm/benchmarks | ✅ | ❌ | ❌ | ❌ | Text | Synthetic, Specific Datasets | Synchronous, Throughput, Constant, Sweep | OpenAI-compatible, vLLM API | /completions, /chat/completions | console, png |

What's New

This section summarizes the newest capabilities available to users and outlines the current areas of development. It helps readers understand how the platform is evolving and what to expect next.

Recent Additions

New refactored architecture enabling high-rate load generation at scale and a more extensible interface for additional backends, data pipelines, load generation schedules, benchmarking constraints, and output formats.
Added multimodal benchmarking support for image, video, and audio workloads across chat completions, transcription, and translation APIs.
Broader metrics collection, including richer statistics for visual, audio, and text inputs such as image sizes, audio lengths, video frame counts, and word-level data.

Active Development

Generation of synthetic multimodal datasets for controlled experimentation across images, audio, and video.
Extended prefixing options for testing system-prompt and user-prompt variations.
Multi-turn conversation capabilities for benchmarking chat agents and dialogue systems.
Speculative decoding specific views and outputs.

Quick Start

The Quick Start shows how to install GuideLLM, launch a server, and run your first benchmark in a few minutes.

Install GuideLLM

Before installing, ensure you have the following prerequisites:

OS: Linux or MacOS
Python: 3.10 - 3.13

Install the latest GuideLLM release from PyPi using pip :

pip install guidellm[recommended]

Or install from source:

pip install git+https://github.com/vllm-project/guidellm.git

Or run the latest container from ghcr.io/vllm-project/guidellm:

podman run \
  --rm -it \
  -v "./results:/results:rw" \
  -e GUIDELLM_TARGET=http://localhost:8000 \
  -e GUIDELLM_PROFILE=sweep \
  -e GUIDELLM_MAX_SECONDS=30 \
  -e GUIDELLM_DATA="prompt_tokens=256,output_tokens=128" \
  ghcr.io/vllm-project/guidellm:latest

Launch an Inference Server

Start any OpenAI-compatible endpoint. For vLLM:

vllm serve "neuralmagic/Meta-Llama-3.1-8B-Instruct-quantized.w4a16"

Verify the server is running at http://localhost:8000.

Run Your First Benchmark

Run a sweep that identifies the maximum performance and maximum rates for the model:

guidellm benchmark \
  --target "http://localhost:8000" \
  --profile sweep \
  --max-seconds 30 \
  --data "prompt_tokens=256,output_tokens=128"

You will see progress updates and per-benchmark summaries during the run, as given below:

Inspect Outputs

After the benchmark completes, GuideLLM saves all results into the output directory you specified (default: the current directory). You'll see a summary printed in the console along with a set of file locations (.json, .csv, .html) that contain th

Related Skills

tmux

325.6k

Remote-control tmux sessions for interactive CLIs by sending keystrokes and scraping pane output.

blogwatcher

325.6k

Monitor blogs and RSS/Atom feeds for updates using the blogwatcher CLI.

Unla

2.1k

🧩 MCP Gateway - A lightweight gateway service that instantly transforms existing MCP Servers and APIs into MCP servers with zero code changes. Features Docker deployment and management UI, requiring no infrastructure modifications.

cursorrules-collection

110+ tested .mdc and .cursorrules files for Cursor AI. Validate with cursor-doctor, generate with rule-gen, convert with rule-porter.