Benchmarks
🕹️ Performance Comparison of MLOps Engines, Frameworks, and Languages on Mainstream AI Models.
Install / Use
/learn @premAI-io/BenchmarksREADME
Check out our release blog to know more.
🥽 Quick glance towards performance benchmarks
Take a first glance at Mistral 7B v0.1 Instruct and Llama 2 7B Chat Performance Metrics Across Different Precision and Inference Engines. Here is our run specification that generated this performance benchmark reports.
Environment:
- Model: Mistral 7B v0.1 Instruct / Llama 2 7B Chat
- CUDA Version: 12.1
- Batch size: 1
Command:
./benchmark.sh --repetitions 10 --max_tokens 512 --device cuda --model mistral/llama --prompt 'Write an essay about the transformer model architecture'
Mistral 7B v0.1 Instruct
Performance Metrics: (unit: Tokens/second)
| Engine | float32 | float16 | int8 | int4 | | ------------------------------------------ | ------------- | ------------- | ------------- | ------------- | | transformers (pytorch) | 39.61 ± 0.65 | 37.05 ± 0.49 | 5.08 ± 0.01 | 19.58 ± 0.38 | | AutoAWQ | - | - | - | 63.12 ± 2.19 | | AutoGPTQ | 39.11 ± 0.42 | 42.94 ± 0.80 | | | | DeepSpeed | | 79.88 ± 0.32 | | | | ctransformers | - | - | 86.14 ± 1.40 | 87.22 ± 1.54 | | llama.cpp | - | - | 88.27 ± 0.72 | 95.33 ± 5.54 | | ctranslate | 43.17 ± 2.97 | 68.03 ± 0.27 | 45.14 ± 0.24 | - | | PyTorch Lightning | 32.79 ± 2.74 | 43.01 ± 2.90 | 7.75 ± 0.12 | - | | Nvidia TensorRT-LLM | 117.04 ± 2.16 | 206.59 ± 6.93 | 390.49 ± 4.86 | 427.40 ± 4.84 | | vllm | 84.91 ± 0.27 | 84.89 ± 0.28 | - | 106.03 ± 0.53 | | exllamav2 | - | - | 114.81 ± 1.47 | 126.29 ± 3.05 | | onnx | 15.75 ± 0.15 | 22.39 ± 0.14 | - | - | | Optimum Nvidia | 50.77 ± 0.85 | 50.91 ± 0.19 | - | - |
Performance Metrics: GPU Memory Consumption (unit: MB)
| Engine | float32 | float16 | int8 | int4 | | ------------------------------------------ | -------- | -------- | -------- | -------- | | transformers (pytorch) | 31071.4 | 15976.1 | 10963.91 | 5681.18 | | AutoGPTQ | 13400.80 | 6633.29 | | | | AutoAWQ | - | - | - | 6572.47 | | DeepSpeed | | 80097.34 | | | | ctransformers | - | - | 10255.07 | 6966.74 | | llama.cpp | - | - | 9141.49 | 5880.41 | | ctranslate | 32602.32 | 17523.8 | 10074.72 | - | | PyTorch Lightning | 48783.95 | 18738.05 | 10680.32 | - | | Nvidia TensorRT-LLM | 79536.59 | 78341.21 | 77689.0 | 77311.51 | | vllm | 73568.09 | 73790.39 | - | 74016.88 | | exllamav2 | - | - | 21483.23 | 9460.25 | | onnx | 33629.93 | 19537.07 | - | - | | Optimum Nvidia | 79563.85 | 79496.74 | - | - |
*(Data updated: 30th April 2024)
Llama 2 7B Chat
Performance Metrics: (unit: Tokens / second)
| Engine | float32 | float16 | int8 | int4 | | ------------------------------------------ | ------------- | ------------- | ------------- | ------------- | | transformers (pytorch) | 36.65 ± 0.61 | 34.20 ± 0.51 | 6.91 ± 0.14 | 17.83 ± 0.40 | | AutoAWQ | - | - | - | 63.59 ± 1.86 | | AutoGPTQ | 34.36 ± 0.51 | 36.63 ± 0.61 | | | | DeepSpeed | | 84.60 ± 0.25 | | | | ctransformers | - | - | 85.50 ± 1.00 | 86.66 ± 1.06 | | llama.cpp | - | - | 89.90 ± 2.26 | 97.35 ± 4.71 | | ctranslate | 46.26 ± 1.59 | 79.41 ± 0.37 | 48.20 ± 0.14 | - | | PyTorch Lightning | 38.01 ± 0.09 | 48.09 ± 1.12 | 10.68 ± 0.43 | - | | Nvidia TensorRT-LLM | 104.07 ± 1.61 | 191.00 ± 4.60 | 316.77 ± 2.14 | 358.49 ± 2.38 | | vllm | 89.40 ± 0.22 | 89.43 ± 0.19 | - | 115.52 ± 0.49 | | exllamav2 | - | - | 125.58 ± 1.23 | 159.68 ± 1.85 | | onnx | 14.28 ± 0.12 | 19.42 ± 0.08 | - | - | | Optimum Nvidia | 53.64 ± 0.78 | 53.82 ± 0.11 | - | - |
Performance Metrics: GPU Memory Consumption (unit: MB)
| Engine | float32 | float16 | int8 | int4 | | ------------------------------------------ | -------- | -------- | -------- | -------- | | transformers (pytorch) | 29114.76 | 14931.72 | 8596.23 | 5643.44 | | AutoAWQ | - | - | - | 7149.19 | | AutoGPTQ | 10718.54 | 5706.35 | | | | DeepSpeed | | 80105.13 | | | | ctransformers | - | - | 9774.83 | 6889.14 | | llama.cpp | - | - | 8797.55 | 5783.95 | | ctranslate | 29951.52 | 16282.29 | 9470.74 | - | | PyTorch Lightning | 42748.35 | 14736.69 | 8028.16 | - | | Nvidia TensorRT-LLM | 79421.24 | 78295.07 | 77642.86 | 77256.98 | | vllm | 77928.07 | 77928.07 | - | 77768.69 | | exllamav2 | - | - | 16582.18 | 7201.62 | | onnx | 33072.09 | 19180.55 | - | - | | Optimum Nvidia | 79429.63 | 79295.41 | - | - |
*(Data updated: 30th April 2024)
Our latest version benchmarks Llama 2 7B chat and Mistral 7B v0.1 instruct. The latest version only benchmarks on A100 80 GPU. Because our primary focus is enterprises. Our previous versions benchmarked Llama 2 7B on Cuda and Mac (M1/M2) CPU and metal. You can find those in the archive.md file. Please note that those numbers are old because all the engines are maintained properly continuously with improvements. So those numbers might be a bit outdated.
🛳 ML Engines
In the current market, there are several ML Engines. Here is a quick glance at all the engines used for the benchmark and a quick summary of their support matrix. You can find the details about the nuances here.
| Engine | Float32 | Float16 | Int8 | Int4 | CUDA | ROCM | Mac M1/M2 | Training | | ------------------------------------------ | :-----: | :-----: | :---: | :---: | :---: | :---: | :-------: | :------: | | candle | ⚠️ | ✅ | ⚠️ | ⚠️ | ✅ | ❌ | 🚧 | ❌ | | llama.cpp | ❌ | ❌ | ✅ | ✅ | ✅ | 🚧 | 🚧 | ❌ | | ctranslate | ✅ | ✅ | ✅ | ❌ | ✅ | ❌ | 🚧 | ❌ | | onnx | ✅
