SkillAgentSearch skills...

Benchmarks

🕹️ Performance Comparison of MLOps Engines, Frameworks, and Languages on Mainstream AI Models.

Install / Use

/learn @premAI-io/Benchmarks
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

<div align="center"> <h1 align="center">🕹️ Benchmarks</h1> <p align="center">A fully reproducible Performance Comparison of MLOps Engines, Frameworks, and Languages on Mainstream AI Models</p> </div>

GitHub contributors GitHub commit activity GitHub last commit GitHub top language GitHub issues License

<br> <div align="center">

alt text Check out our release blog to know more.

</div> <details> <summary>Table of Contents</summary> <ol> <li><a href="#-quick-glance">Quick glance towards performance metrics</a></li> <li><a href="#-ml-engines">ML Engines</a></li> <li><a href="#-why-benchmarks">Why Benchmarks</a></li> <li><a href="#-usage-and-workflow">Usage and workflow</a></li> <li><a href="#-contribute">Contribute</a></li> </ol> </details>

🥽 Quick glance towards performance benchmarks

Take a first glance at Mistral 7B v0.1 Instruct and Llama 2 7B Chat Performance Metrics Across Different Precision and Inference Engines. Here is our run specification that generated this performance benchmark reports.

Environment:

  • Model: Mistral 7B v0.1 Instruct / Llama 2 7B Chat
  • CUDA Version: 12.1
  • Batch size: 1

Command:

./benchmark.sh --repetitions 10 --max_tokens 512 --device cuda --model mistral/llama --prompt 'Write an essay about the transformer model architecture'

Mistral 7B v0.1 Instruct

Performance Metrics: (unit: Tokens/second)

| Engine | float32 | float16 | int8 | int4 | | ------------------------------------------ | ------------- | ------------- | ------------- | ------------- | | transformers (pytorch) | 39.61 ± 0.65 | 37.05 ± 0.49 | 5.08 ± 0.01 | 19.58 ± 0.38 | | AutoAWQ | - | - | - | 63.12 ± 2.19 | | AutoGPTQ | 39.11 ± 0.42 | 42.94 ± 0.80 | | | | DeepSpeed | | 79.88 ± 0.32 | | | | ctransformers | - | - | 86.14 ± 1.40 | 87.22 ± 1.54 | | llama.cpp | - | - | 88.27 ± 0.72 | 95.33 ± 5.54 | | ctranslate | 43.17 ± 2.97 | 68.03 ± 0.27 | 45.14 ± 0.24 | - | | PyTorch Lightning | 32.79 ± 2.74 | 43.01 ± 2.90 | 7.75 ± 0.12 | - | | Nvidia TensorRT-LLM | 117.04 ± 2.16 | 206.59 ± 6.93 | 390.49 ± 4.86 | 427.40 ± 4.84 | | vllm | 84.91 ± 0.27 | 84.89 ± 0.28 | - | 106.03 ± 0.53 | | exllamav2 | - | - | 114.81 ± 1.47 | 126.29 ± 3.05 | | onnx | 15.75 ± 0.15 | 22.39 ± 0.14 | - | - | | Optimum Nvidia | 50.77 ± 0.85 | 50.91 ± 0.19 | - | - |

Performance Metrics: GPU Memory Consumption (unit: MB)

| Engine | float32 | float16 | int8 | int4 | | ------------------------------------------ | -------- | -------- | -------- | -------- | | transformers (pytorch) | 31071.4 | 15976.1 | 10963.91 | 5681.18 | | AutoGPTQ | 13400.80 | 6633.29 | | | | AutoAWQ | - | - | - | 6572.47 | | DeepSpeed | | 80097.34 | | | | ctransformers | - | - | 10255.07 | 6966.74 | | llama.cpp | - | - | 9141.49 | 5880.41 | | ctranslate | 32602.32 | 17523.8 | 10074.72 | - | | PyTorch Lightning | 48783.95 | 18738.05 | 10680.32 | - | | Nvidia TensorRT-LLM | 79536.59 | 78341.21 | 77689.0 | 77311.51 | | vllm | 73568.09 | 73790.39 | - | 74016.88 | | exllamav2 | - | - | 21483.23 | 9460.25 | | onnx | 33629.93 | 19537.07 | - | - | | Optimum Nvidia | 79563.85 | 79496.74 | - | - |

*(Data updated: 30th April 2024)

Llama 2 7B Chat

Performance Metrics: (unit: Tokens / second)

| Engine | float32 | float16 | int8 | int4 | | ------------------------------------------ | ------------- | ------------- | ------------- | ------------- | | transformers (pytorch) | 36.65 ± 0.61 | 34.20 ± 0.51 | 6.91 ± 0.14 | 17.83 ± 0.40 | | AutoAWQ | - | - | - | 63.59 ± 1.86 | | AutoGPTQ | 34.36 ± 0.51 | 36.63 ± 0.61 | | | | DeepSpeed | | 84.60 ± 0.25 | | | | ctransformers | - | - | 85.50 ± 1.00 | 86.66 ± 1.06 | | llama.cpp | - | - | 89.90 ± 2.26 | 97.35 ± 4.71 | | ctranslate | 46.26 ± 1.59 | 79.41 ± 0.37 | 48.20 ± 0.14 | - | | PyTorch Lightning | 38.01 ± 0.09 | 48.09 ± 1.12 | 10.68 ± 0.43 | - | | Nvidia TensorRT-LLM | 104.07 ± 1.61 | 191.00 ± 4.60 | 316.77 ± 2.14 | 358.49 ± 2.38 | | vllm | 89.40 ± 0.22 | 89.43 ± 0.19 | - | 115.52 ± 0.49 | | exllamav2 | - | - | 125.58 ± 1.23 | 159.68 ± 1.85 | | onnx | 14.28 ± 0.12 | 19.42 ± 0.08 | - | - | | Optimum Nvidia | 53.64 ± 0.78 | 53.82 ± 0.11 | - | - |

Performance Metrics: GPU Memory Consumption (unit: MB)

| Engine | float32 | float16 | int8 | int4 | | ------------------------------------------ | -------- | -------- | -------- | -------- | | transformers (pytorch) | 29114.76 | 14931.72 | 8596.23 | 5643.44 | | AutoAWQ | - | - | - | 7149.19 | | AutoGPTQ | 10718.54 | 5706.35 | | | | DeepSpeed | | 80105.13 | | | | ctransformers | - | - | 9774.83 | 6889.14 | | llama.cpp | - | - | 8797.55 | 5783.95 | | ctranslate | 29951.52 | 16282.29 | 9470.74 | - | | PyTorch Lightning | 42748.35 | 14736.69 | 8028.16 | - | | Nvidia TensorRT-LLM | 79421.24 | 78295.07 | 77642.86 | 77256.98 | | vllm | 77928.07 | 77928.07 | - | 77768.69 | | exllamav2 | - | - | 16582.18 | 7201.62 | | onnx | 33072.09 | 19180.55 | - | - | | Optimum Nvidia | 79429.63 | 79295.41 | - | - |

*(Data updated: 30th April 2024)

Our latest version benchmarks Llama 2 7B chat and Mistral 7B v0.1 instruct. The latest version only benchmarks on A100 80 GPU. Because our primary focus is enterprises. Our previous versions benchmarked Llama 2 7B on Cuda and Mac (M1/M2) CPU and metal. You can find those in the archive.md file. Please note that those numbers are old because all the engines are maintained properly continuously with improvements. So those numbers might be a bit outdated.

🛳 ML Engines

In the current market, there are several ML Engines. Here is a quick glance at all the engines used for the benchmark and a quick summary of their support matrix. You can find the details about the nuances here.

| Engine | Float32 | Float16 | Int8 | Int4 | CUDA | ROCM | Mac M1/M2 | Training | | ------------------------------------------ | :-----: | :-----: | :---: | :---: | :---: | :---: | :-------: | :------: | | candle | ⚠️ | ✅ | ⚠️ | ⚠️ | ✅ | ❌ | 🚧 | ❌ | | llama.cpp | ❌ | ❌ | ✅ | ✅ | ✅ | 🚧 | 🚧 | ❌ | | ctranslate | ✅ | ✅ | ✅ | ❌ | ✅ | ❌ | 🚧 | ❌ | | onnx | ✅

View on GitHub
GitHub Stars138
CategoryDevelopment
Updated2mo ago
Forks5

Languages

Shell

Security Score

100/100

Audited on Jan 10, 2026

No findings