SkillAgentSearch skills...

Aiperf

AIPerf is a comprehensive benchmarking tool that measures the performance of generative AI models served by your preferred inference solution.

Install / Use

/learn @ai-dynamo/Aiperf
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

<!-- SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. SPDX-License-Identifier: Apache-2.0 -->

AIPerf

PyPI version License Codecov Discord Ask DeepWiki

AIPerf is a comprehensive benchmarking tool that measures the performance of generative AI models served by your preferred inference solution. It provides detailed metrics using a command line display as well as extensive benchmark performance reports.

<img width="1724" height="670" alt="AIPerf UI Dashboard" src="https://github.com/user-attachments/assets/7eb40867-b1c1-4ebe-bd57-7619f2154bba" />

Quick Start

This quick start guide leverages Ollama via Docker Desktop.

Setting up a Local Server

In order to set up an Ollama server, run granite4:350m using the following commands:

docker run -d \
  --name ollama \
  -p 11434:11434 \
  -v ollama-data:/root/.ollama \
  ollama/ollama:latest
docker exec -it ollama ollama pull granite4:350m

Basic Usage

Create a virtual environment and install AIPerf:

python3 -m venv venv
source venv/bin/activate
pip install aiperf

To run a simple benchmark against your Ollama server:

aiperf profile \
  --model "granite4:350m" \
  --streaming \
  --endpoint-type chat \
  --tokenizer ibm-granite/granite-4.0-micro \
  --url http://localhost:11434

Example with Custom Configuration

aiperf profile \
  --model "granite4:350m" \
  --streaming \
  --endpoint-type chat \
  --tokenizer ibm-granite/granite-4.0-micro \
  --url http://localhost:11434
  --concurrency 5 \
  --request-count 10

Example output:

NOTE: The example performance is reflective of a CPU-only run and does not represent an official benchmark.

                                               NVIDIA AIPerf | LLM Metrics
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━┓
┃                               Metric ┃       avg ┃      min ┃       max ┃       p99 ┃       p90 ┃       p50 ┃      std ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━┩
│             Time to First Token (ms) │  7,463.28 │ 7,125.81 │  9,484.24 │  9,295.48 │  7,596.62 │  7,240.23 │   677.23 │
│            Time to Second Token (ms) │     68.73 │    32.01 │    102.86 │    102.55 │     99.80 │     67.37 │    24.95 │
│      Time to First Output Token (ms) │  7,463.28 │ 7,125.81 │  9,484.24 │  9,295.48 │  7,596.62 │  7,240.23 │   677.23 │
│                 Request Latency (ms) │ 13,829.40 │ 9,029.36 │ 27,905.46 │ 27,237.77 │ 21,228.48 │ 11,338.31 │ 5,614.32 │
│             Inter Token Latency (ms) │     65.31 │    53.06 │     81.31 │     81.24 │     80.64 │     63.79 │     9.09 │
│     Output Token Throughput Per User │     15.60 │    12.30 │     18.85 │     18.77 │     18.08 │     15.68 │     2.05 │
│                    (tokens/sec/user) │           │          │           │           │           │           │          │
│      Output Sequence Length (tokens) │     95.20 │    29.00 │    295.00 │    283.12 │    176.20 │     63.00 │    77.08 │
│       Input Sequence Length (tokens) │    550.00 │   550.00 │    550.00 │    550.00 │    550.00 │    550.00 │     0.00 │
│ Output Token Throughput (tokens/sec) │      6.85 │      N/A │       N/A │       N/A │       N/A │       N/A │      N/A │
│    Request Throughput (requests/sec) │      0.07 │      N/A │       N/A │       N/A │       N/A │       N/A │      N/A │
│             Request Count (requests) │     10.00 │      N/A │       N/A │       N/A │       N/A │       N/A │      N/A │
└──────────────────────────────────────┴───────────┴──────────┴───────────┴───────────┴───────────┴───────────┴──────────┘

CLI Command: aiperf profile --model 'granite4:350m' --streaming --endpoint-type 'chat' --tokenizer 'ibm-granite/granite-4.0-micro' --url 'http://localhost:11434'
Benchmark Duration: 138.89 sec
CSV Export: /home/user/aiperf/artifacts/granite4:350m-openai-chat-concurrency1/profile_export_aiperf.csv
JSON Export: /home/user/Code/aiperf/artifacts/granite4:350m-openai-chat-concurrency1/profile_export_aiperf.json
Log File: /home/user/Code/aiperf/artifacts/granite4:350m-openai-chat-concurrency1/logs/aiperf.log

Features

  • Scalable multiprocess architecture with 9 services communicating via ZMQ
  • 3 UI modes: dashboard (real-time TUI), simple (progress bars), none (headless)
  • Multiple benchmarking modes: concurrency, request-rate, request-rate with max concurrency, trace replay
  • Extensible plugin system for endpoints, datasets, transports, and metrics
  • Public dataset support including ShareGPT and custom formats

Supported APIs

  • OpenAI chat completions, completions, embeddings, audio, images
  • NIM embeddings, rankings

Tutorials and Feature Guides

Getting Started

Load Control and Timing

Workloads and Data

Endpoint Types

Analysis and Monitoring

Documentation

| Document | Purpose | |----------|---------| | Architecture | Three-plane architecture, core components, credit system, data flow | | CLI Options | Complete command and option reference | | Metrics Reference | All metric definitions, formulas, and requirements | | Environment Variables | All AIPERF_* config

View on GitHub
GitHub Stars194
CategoryDevelopment
Updated23h ago
Forks48

Languages

Python

Security Score

95/100

Audited on Mar 29, 2026

No findings