Igniter
iGniter, an interference-aware GPU resource provisioning framework for achieving predictable performance of DNN inference in the cloud.
Install / Use
/learn @icloud-ecnu/IgniterREADME
iGniter
iGniter, an interference-aware GPU resource provisioning framework for achieving predictable performance of DNN inference in the cloud.
Prototype of iGniter
Our iGniter framework comprises three pieces of modules: an inference workload placer and a GPU resource allocator as well as an inference performance predictor. With the profiled model coefficients, the inference performance predictor first estimates the inference latency using our performance model. It then guides our GPU resource allocator and inference workload placer to identify an appropriate GPU device with the least performance interference and the guaranteed SLOs from candidate GPUs for each inference workload. According to the cost-efficient GPU resource provisioning plan generated by our algorithm, the GPU device launcher finally builds a GPU cluster and launches the Triton inference serving process for each DNN inference workload on the provisioned GPU devices.

Model the Inference Performance
The execution of DNN inference on the GPU can be divided into three sequential steps: data loading, GPU execution, and result feedback. Accordingly, the DNN inference latency can be calculated by summing up the data loading latency, the GPU execution latency, and the result feedback latency, which is formulated as
$t_{i n f}^{i j}=t_{l o a d}^{i}+t_{g p u}^{i j}+t_{f e e d b a c k}^{i}$
To improve the GPU resource utilization, the data loading phase overlaps with the GPU execution and result feedback phases in the mainstream DNN inference servers (e.g., Triton). Accordingly, we estimate the DNN inference throughput as
$h^{i j}=b^{i} / (t_{g p u}^{i j}+t_{f e e d b a c k}^{i})$
We calculate the data loading latency and the result feedback latency as
$t_{load}^{i}=(d_{load}^{i} \cdot b^{i}) / B_{pcie} \quad$ and $\quad t_{feedback}^{i}=(d_{feedback}^{i} \cdot b^{i}) / B_{pcie}$
The GPU execution phase consists of the GPU scheduling delay and kernels running on the allocated SMs. Furthermore, the performance interference can be caused by the reduction of GPU frequency due to the inference workload co-location, which inevitably prolongs the GPU execution phase. Accordingly, we formulate the GPU execution latency as
$t_{g p u}^{i j}=(t_{s c h}^{i j}+t_{a c t}^{i j}) / (f^{j} / F)$
The GPU scheduling delay is roughly linear to the number of kernels for a DNN inference workload and there is increased scheduling delay caused by the performance interference on the GPU resource scheduler, which can be estimated as
$t_{s c h}^{i j}=\left(k_{s c h}^{i}+\Delta_{s c h}^{j}\right) \cdot n_{k}^{i}$
Given a fixed supply of L2 cache space on a GPU device, a higher GPU L2 cache utilization (i.e., demand) indicates severer contention on the GPU L2 cache space, thereby resulting in a longer GPU active time. Accordingly, we estimate the GPU active time as
$t_{\text {act }}^{i j}=k_{\text {act }}^{i} \cdot\left(1+\alpha_{\text {cache }}^{i} \cdot \sum_{i \in \mathcal{I} \backslash i}\left(c^{i} \cdot v^{i j}\right)\right)$
Dependencies and Requirements
- Description of required hardware resources:
- We set up a GPU cluster of 10 p3.2xlarge EC2 instances, each equipped with 1 NVIDIA V100 GPU card, 8 vCPUs, and 61 GB memory.
- Description of the required operating system:
- Ubuntu 18.04
- Required software libraries: Triton, NVIDIA Driver, cuDNN, CUDA, Python3, TensorRT, Docker, NVIDIA Container Toolkit, Torchvision, Torch, Pandas, Scikit-image, Numpy, Scipy, Pillow.
- Input dataset required for executing code or generating input data: ImageNet dataset and VOC2012 dataset.
Installation and Deployment Process
-
Installation of Libraries and Software:
-
Install Python3 pip.
apt-get update sudo apt install python3-pip python3 -m pip install --upgrade pip -
Install NVIDIA Driver.
sudo add-apt-repository ppa:graphics-drivers/ppa sudo apt-get update sudo apt install nvidia-driver-470 # Install version 470 driver sudo reboot -
Install CUDA [Note: Do Not Install Driver].
wget https://developer.download.nvidia.com/compute/cuda/11.3.0/local_installers/cuda_11.3.0_465.19.01_linux.run sudo sh cuda_11.3.0_465.19.01_linux.run
-
Install TensorRT.
version="8.0.1.6" tar xzvf tensorrt-8.0.1.6.linux.x86_64-gnu.cuda-11.3.cudnn8.2.tar.gz pos=$PWD cd ${pos}/TensorRT-${version}/python sudo pip3 install tensorrt-*-cp36-none-linux_x86_64.whl # Because I downloaded python version 3.6, it is cp36 cd ${pos}/TensorRT-${version}/uff sudo pip3 install uff-0.6.5-py2.py3-none-any.whl # Refer to the specific name of the file in this directory cd ${pos}/TensorRT-${version}/graphsurgeon sudo pip3 install graphsurgeon-0.4.1-py2.py3-none-any.whl # Refer to the specific name of the file in this directory -
Install cuDNN.
tar -xzvf cudnn-11.3-linux-x64-v8.2.0.53.tgz # Refer to the specific name of the file in this directory cudnnversion="11.3" sudo cp cuda/include/cudnn.h /usr/local/cuda-${cudnnversion}/include sudo cp cuda/lib64/libcudnn* /usr/local/cuda-${cudnnversion}/lib64 sudo chmod a+r /usr/local/cuda-${cudnnversion}/include/cudnn.h sudo chmod a+r /usr/local/cuda-${cudnnversion}/lib64/libcudnn* -
Configure
~/.bashrc.export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda-11.3/lib64:/root/TensorRT-8.0.1.6/lib export PATH=$PATH:/usr/local/cuda-11.3/bin:/root/TensorRT-8.0.1.6/bin export CUDA_HOME=$CUDA_HOME:/usr/local/cuda-11.3 -
Test if
trtexcinstall successful.trtexec -
Install docker.
apt install docker.io -
Install NVIDIA Container Toolkit.
distribution=$(. /etc/os-release;echo $ID$VERSION_ID) \ && curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add - \ && curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list curl -s -L https://nvidia.github.io/nvidia-container-runtime/experimental/$distribution/nvidia-container-runtime.list | sudo tee /etc/apt/sources.list.d/nvidia-container-runtime.list sudo apt-get update sudo apt-get install -y nvidia-docker2 sudo systemctl restart docker sudo docker run --rm --gpus all nvidia/cuda:11.0.3-base-ubuntu20.04 nvidia-smi
-
-
Deployment of Code:
-
Pull iGniter code.
git clone https://github.com/Echozqn/igniter.git cd igniter/i-Gniter pip install -r requirements.txt -
Pull triton images.
docker pull nvcr.io/nvidia/tritonserver:21.07-py3 docker pull nvcr.io/nvidia/tritonserver:21.07-py3-sdk -
Convert model.
-
If you are using V100, we have prepared the model files required for your experiments. Please perform the following actions.
-
Download the model file.
cd i-Gniter/Launch/model/ ./fetch_models.sh
-
-
If not, you need to follow the steps below to generate the model files needed for the experiment.
-
Use the
trtexectool to convert the ONNX to a trt engine model.cd i-Gniter/Launch/model/ ./fetch_models.sh cd i-Gniter/Profile python3 model_onnx.py # Generate Model sh onnxTOtrt.sh # Convert model -
Replace
model.planin the$path/igniter/i-Gniter/Launch/model/modeldirectory with the generated model, the directory structure is shown below.. └── igniter └── i-Gniter └── Launch └── model └── model ├── alexnet_dynamic │ ├── 1 │ │ └── model.plan │ └── config.pbtxt ├── resnet50_dynamic │ ├── 1 │ │ └── model.plan │ └── config.pbtxt ├── ssd_dynamic │ ├── 1 │ │ └── model.plan │ └── config.pbtxt └── vgg19_dynamic ├── 1 │ └── model.plan └── config.pbtxt -
Modify the input and output name in the configuration files (
config.pbtxt) of the four models.The inputs name are modified toactual_input_1and the outputs name tooutput1.Here is the
config.pbtxtfile for the alexnet model.name: "alexnet_dynamic" platform: "tensorrt_plan" max_batch_size: 5 dynamic_batching { preferred_batch_size: [5] max_queue_delay_microseconds: 100000 } input [ { name: "actual_input_alexnet" data_type: TYPE_FP32 dims: [3, 224, 224] } ] output [ { name: "output_alexnet" data_type: TYPE_FP32 dims: [1000] } ]
-
-
-
Getting Started
Profiler
The profiler is only tested on T4 and V100 now. If you want to use it on other GPUs, you may need to pay attention to the hardware parameters such as activetime_2 , activetime_1 and idletime_1. If the GPU is V100, you can skip this part. We have already provided a config file profiled on the V100.
Initializing:
source start.sh
Profiling hardware parameters:
cd tools
python3 computeBandwidth.py
./power_t_freq 1530 # 1530 is the highest frequency of the V100
