<div align="center"> <h1 align="center">🕹️ Benchmarks</h1> <p align="center">A fully reproducible Performance Comparison of MLOps Engines, Frameworks, and Languages on Mainstream AI Models</p> </div>

alt text Check out our release blog to know more.

</div> <details> <summary>Table of Contents</summary> <ol> <li><a href="#-quick-glance">Quick glance towards performance metrics</a></li> <li><a href="#-ml-engines">ML Engines</a></li> <li><a href="#-why-benchmarks">Why Benchmarks</a></li> <li><a href="#-usage-and-workflow">Usage and workflow</a></li> <li><a href="#-contribute">Contribute</a></li> </ol> </details>

🥽 Quick glance towards performance benchmarks

Take a first glance at Mistral 7B v0.1 Instruct and Llama 2 7B Chat Performance Metrics Across Different Precision and Inference Engines. Here is our run specification that generated this performance benchmark reports.

Environment:

Model: Mistral 7B v0.1 Instruct / Llama 2 7B Chat
CUDA Version: 12.1
Batch size: 1

Command:

./benchmark.sh --repetitions 10 --max_tokens 512 --device cuda --model mistral/llama --prompt 'Write an essay about the transformer model architecture'

Mistral 7B v0.1 Instruct

Performance Metrics: (unit: Tokens/second)

| Engine | float32 | float16 | int8 | int4 | | ------------------------------------------ | ------------- | ------------- | ------------- | ------------- | | transformers (pytorch) | 39.61 ± 0.65 | 37.05 ± 0.49 | 5.08 ± 0.01 | 19.58 ± 0.38 | | AutoAWQ | - | - | - | 63.12 ± 2.19 | | AutoGPTQ | 39.11 ± 0.42 | 42.94 ± 0.80 | | | | DeepSpeed | | 79.88 ± 0.32 | | | | ctransformers | - | - | 86.14 ± 1.40 | 87.22 ± 1.54 | | llama.cpp | - | - | 88.27 ± 0.72 | 95.33 ± 5.54 | | ctranslate | 43.17 ± 2.97 | 68.03 ± 0.27 | 45.14 ± 0.24 | - | | PyTorch Lightning | 32.79 ± 2.74 | 43.01 ± 2.90 | 7.75 ± 0.12 | - | | Nvidia TensorRT-LLM | 117.04 ± 2.16 | 206.59 ± 6.93 | 390.49 ± 4.86 | 427.40 ± 4.84 | | vllm | 84.91 ± 0.27 | 84.89 ± 0.28 | - | 106.03 ± 0.53 | | exllamav2 | - | - | 114.81 ± 1.47 | 126.29 ± 3.05 | | onnx | 15.75 ± 0.15 | 22.39 ± 0.14 | - | - | | Optimum Nvidia | 50.77 ± 0.85 | 50.91 ± 0.19 | - | - |

Performance Metrics: GPU Memory Consumption (unit: MB)

| Engine | float32 | float16 | int8 | int4 | | ------------------------------------------ | -------- | -------- | -------- | -------- | | transformers (pytorch) | 31071.4 | 15976.1 | 10963.91 | 5681.18 | | AutoGPTQ | 13400.80 | 6633.29 | | | | AutoAWQ | - | - | - | 6572.47 | | DeepSpeed | | 80097.34 | | | | ctransformers | - | - | 10255.07 | 6966.74 | | llama.cpp | - | - | 9141.49 | 5880.41 | | ctranslate | 32602.32 | 17523.8 | 10074.72 | - | | PyTorch Lightning | 48783.95 | 18738.05 | 10680.32 | - | | Nvidia TensorRT-LLM | 79536.59 | 78341.21 | 77689.0 | 77311.51 | | vllm | 73568.09 | 73790.39 | - | 74016.88 | | exllamav2 | - | - | 21483.23 | 9460.25 | | onnx | 33629.93 | 19537.07 | - | - | | Optimum Nvidia | 79563.85 | 79496.74 | - | - |

*(Data updated: 30th April 2024)

Llama 2 7B Chat

Performance Metrics: (unit: Tokens / second)

| Engine | float32 | float16 | int8 | int4 | | ------------------------------------------ | ------------- | ------------- | ------------- | ------------- | | transformers (pytorch) | 36.65 ± 0.61 | 34.20 ± 0.51 | 6.91 ± 0.14 | 17.83 ± 0.40 | | AutoAWQ | - | - | - | 63.59 ± 1.86 | | AutoGPTQ | 34.36 ± 0.51 | 36.63 ± 0.61 | | | | DeepSpeed | | 84.60 ± 0.25 | | | | ctransformers | - | - | 85.50 ± 1.00 | 86.66 ± 1.06 | | llama.cpp | - | - | 89.90 ± 2.26 | 97.35 ± 4.71 | | ctranslate | 46.26 ± 1.59 | 79.41 ± 0.37 | 48.20 ± 0.14 | - | | PyTorch Lightning | 38.01 ± 0.09 | 48.09 ± 1.12 | 10.68 ± 0.43 | - | | Nvidia TensorRT-LLM | 104.07 ± 1.61 | 191.00 ± 4.60 | 316.77 ± 2.14 | 358.49 ± 2.38 | | vllm | 89.40 ± 0.22 | 89.43 ± 0.19 | - | 115.52 ± 0.49 | | exllamav2 | - | - | 125.58 ± 1.23 | 159.68 ± 1.85 | | onnx | 14.28 ± 0.12 | 19.42 ± 0.08 | - | - | | Optimum Nvidia | 53.64 ± 0.78 | 53.82 ± 0.11 | - | - |

Performance Metrics: GPU Memory Consumption (unit: MB)

| Engine | float32 | float16 | int8 | int4 | | ------------------------------------------ | -------- | -------- | -------- | -------- | | transformers (pytorch) | 29114.76 | 14931.72 | 8596.23 | 5643.44 | | AutoAWQ | - | - | - | 7149.19 | | AutoGPTQ | 10718.54 | 5706.35 | | | | DeepSpeed | | 80105.13 | | | | ctransformers | - | - | 9774.83 | 6889.14 | | llama.cpp | - | - | 8797.55 | 5783.95 | | ctranslate | 29951.52 | 16282.29 | 9470.74 | - | | PyTorch Lightning | 42748.35 | 14736.69 | 8028.16 | - | | Nvidia TensorRT-LLM | 79421.24 | 78295.07 | 77642.86 | 77256.98 | | vllm | 77928.07 | 77928.07 | - | 77768.69 | | exllamav2 | - | - | 16582.18 | 7201.62 | | onnx | 33072.09 | 19180.55 | - | - | | Optimum Nvidia | 79429.63 | 79295.41 | - | - |

*(Data updated: 30th April 2024)

Our latest version benchmarks Llama 2 7B chat and Mistral 7B v0.1 instruct. The latest version only benchmarks on A100 80 GPU. Because our primary focus is enterprises. Our previous versions benchmarked Llama 2 7B on Cuda and Mac (M1/M2) CPU and metal. You can find those in the archive.md file. Please note that those numbers are old because all the engines are maintained properly continuously with improvements. So those numbers might be a bit outdated.

🛳 ML Engines

In the current market, there are several ML Engines. Here is a quick glance at all the engines used for the benchmark and a quick summary of their support matrix. You can find the details about the nuances here.

| Engine | Float32 | Float16 | Int8 | Int4 | CUDA | ROCM | Mac M1/M2 | Training | | ------------------------------------------ | :-----: | :-----: | :---: | :---: | :---: | :---: | :-------: | :------: | | candle | ⚠️ | ✅ | ⚠️ | ⚠️ | ✅ | ❌ | 🚧 | ❌ | | llama.cpp | ❌ | ❌ | ✅ | ✅ | ✅ | 🚧 | 🚧 | ❌ | | ctranslate | ✅ | ✅ | ✅ | ❌ | ✅ | ❌ | 🚧 | ❌ | | onnx | ✅

Benchmarks

Install / Use

README

🥽 Quick glance towards performance benchmarks

Mistral 7B v0.1 Instruct

Llama 2 7B Chat

🛳 ML Engines