iGniter

iGniter, an interference-aware GPU resource provisioning framework for achieving predictable performance of DNN inference in the cloud.

Prototype of iGniter

Our iGniter framework comprises three pieces of modules: an inference workload placer and a GPU resource allocator as well as an inference performance predictor. With the profiled model coefficients, the inference performance predictor first estimates the inference latency using our performance model. It then guides our GPU resource allocator and inference workload placer to identify an appropriate GPU device with the least performance interference and the guaranteed SLOs from candidate GPUs for each inference workload. According to the cost-efficient GPU resource provisioning plan generated by our algorithm, the GPU device launcher finally builds a GPU cluster and launches the Triton inference serving process for each DNN inference workload on the provisioned GPU devices.

Model the Inference Performance

The execution of DNN inference on the GPU can be divided into three sequential steps: data loading, GPU execution, and result feedback. Accordingly, the DNN inference latency can be calculated by summing up the data loading latency, the GPU execution latency, and the result feedback latency, which is formulated as

$t_{i n f}^{i j}=t_{l o a d}^{i}+t_{g p u}^{i j}+t_{f e e d b a c k}^{i}$

To improve the GPU resource utilization, the data loading phase overlaps with the GPU execution and result feedback phases in the mainstream DNN inference servers (e.g., Triton). Accordingly, we estimate the DNN inference throughput as

$h^{i j}=b^{i} / (t_{g p u}^{i j}+t_{f e e d b a c k}^{i})$

We calculate the data loading latency and the result feedback latency as

$t_{load}^{i}=(d_{load}^{i} \cdot b^{i}) / B_{pcie} \quad$ and $\quad t_{feedback}^{i}=(d_{feedback}^{i} \cdot b^{i}) / B_{pcie}$

The GPU execution phase consists of the GPU scheduling delay and kernels running on the allocated SMs. Furthermore, the performance interference can be caused by the reduction of GPU frequency due to the inference workload co-location, which inevitably prolongs the GPU execution phase. Accordingly, we formulate the GPU execution latency as

$t_{g p u}^{i j}=(t_{s c h}^{i j}+t_{a c t}^{i j}) / (f^{j} / F)$

The GPU scheduling delay is roughly linear to the number of kernels for a DNN inference workload and there is increased scheduling delay caused by the performance interference on the GPU resource scheduler, which can be estimated as

$t_{s c h}^{i j}=\left(k_{s c h}^{i}+\Delta_{s c h}^{j}\right) \cdot n_{k}^{i}$

Given a fixed supply of L2 cache space on a GPU device, a higher GPU L2 cache utilization (i.e., demand) indicates severer contention on the GPU L2 cache space, thereby resulting in a longer GPU active time. Accordingly, we estimate the GPU active time as

$t_{\text {act }}^{i j}=k_{\text {act }}^{i} \cdot\left(1+\alpha_{\text {cache }}^{i} \cdot \sum_{i \in \mathcal{I} \backslash i}\left(c^{i} \cdot v^{i j}\right)\right)$

Dependencies and Requirements

Description of required hardware resources:
- We set up a GPU cluster of 10 p3.2xlarge EC2 instances, each equipped with 1 NVIDIA V100 GPU card, 8 vCPUs, and 61 GB memory.
Description of the required operating system:
- Ubuntu 18.04
Required software libraries: Triton, NVIDIA Driver, cuDNN, CUDA, Python3, TensorRT, Docker, NVIDIA Container Toolkit, Torchvision, Torch, Pandas, Scikit-image, Numpy, Scipy, Pillow.
Input dataset required for executing code or generating input data: ImageNet dataset and VOC2012 dataset.

Installation and Deployment Process

Installation of Libraries and Software:

Install Python3 pip.

apt-get update
sudo apt install python3-pip
python3 -m pip install --upgrade pip

Install NVIDIA Driver.

sudo add-apt-repository ppa:graphics-drivers/ppa 
sudo apt-get update 
sudo apt install nvidia-driver-470 # Install version 470 driver 
sudo reboot

Install CUDA [Note: Do Not Install Driver].

wget https://developer.download.nvidia.com/compute/cuda/11.3.0/local_installers/cuda_11.3.0_465.19.01_linux.run
sudo sh cuda_11.3.0_465.19.01_linux.run

Install TensorRT.

version="8.0.1.6"
tar xzvf tensorrt-8.0.1.6.linux.x86_64-gnu.cuda-11.3.cudnn8.2.tar.gz
pos=$PWD
cd ${pos}/TensorRT-${version}/python
sudo pip3 install tensorrt-*-cp36-none-linux_x86_64.whl # Because I downloaded python version 3.6, it is cp36
cd ${pos}/TensorRT-${version}/uff
sudo pip3 install uff-0.6.5-py2.py3-none-any.whl # Refer to the specific name of the file in this directory
cd ${pos}/TensorRT-${version}/graphsurgeon
sudo pip3 install graphsurgeon-0.4.1-py2.py3-none-any.whl # Refer to the specific name of the file in this directory

Install cuDNN.

tar -xzvf cudnn-11.3-linux-x64-v8.2.0.53.tgz # Refer to the specific name of the file in this directory
cudnnversion="11.3"
sudo cp cuda/include/cudnn.h /usr/local/cuda-${cudnnversion}/include
sudo cp cuda/lib64/libcudnn* /usr/local/cuda-${cudnnversion}/lib64
sudo chmod a+r /usr/local/cuda-${cudnnversion}/include/cudnn.h 
sudo chmod a+r /usr/local/cuda-${cudnnversion}/lib64/libcudnn*

Configure ~/.bashrc .

export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda-11.3/lib64:/root/TensorRT-8.0.1.6/lib
export PATH=$PATH:/usr/local/cuda-11.3/bin:/root/TensorRT-8.0.1.6/bin
export CUDA_HOME=$CUDA_HOME:/usr/local/cuda-11.3

Test if trtexc install successful.
```
trtexec
```
Install docker.
```
apt install docker.io
```

Install NVIDIA Container Toolkit.

distribution=$(. /etc/os-release;echo $ID$VERSION_ID) \
   && curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add - \
   && curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
curl -s -L https://nvidia.github.io/nvidia-container-runtime/experimental/$distribution/nvidia-container-runtime.list | sudo tee /etc/apt/sources.list.d/nvidia-container-runtime.list
sudo apt-get update
sudo apt-get install -y nvidia-docker2
sudo systemctl restart docker
sudo docker run --rm --gpus all nvidia/cuda:11.0.3-base-ubuntu20.04 nvidia-smi

Deployment of Code:

Pull iGniter code.

git clone https://github.com/Echozqn/igniter.git
cd igniter/i-Gniter
pip install -r requirements.txt

Pull triton images.

docker pull nvcr.io/nvidia/tritonserver:21.07-py3
docker pull nvcr.io/nvidia/tritonserver:21.07-py3-sdk

Convert model.

If you are using V100, we have prepared the model files required for your experiments. Please perform the following actions.
- Download the model file.
```
cd i-Gniter/Launch/model/
./fetch_models.sh
```

If not, you need to follow the steps below to generate the model files needed for the experiment.

Use the trtexec tool to convert the ONNX to a trt engine model.

cd i-Gniter/Launch/model/
./fetch_models.sh
cd i-Gniter/Profile
python3 model_onnx.py # Generate Model
sh onnxTOtrt.sh # Convert model

Replace model.plan in the $path/igniter/i-Gniter/Launch/model/model directory with the generated model, the directory structure is shown below.

.
└── igniter
    └── i-Gniter
        └── Launch
            └── model
                └── model
                    ├── alexnet_dynamic
                    │   ├── 1
                    │   │   └── model.plan
                    │   └── config.pbtxt
                    ├── resnet50_dynamic
                    │   ├── 1
                    │   │   └── model.plan
                    │   └── config.pbtxt
                    ├── ssd_dynamic
                    │   ├── 1
                    │   │   └── model.plan
                    │   └── config.pbtxt
                    └── vgg19_dynamic
                        ├── 1
                        │   └── model.plan
                        └── config.pbtxt

Modify the input and output name in the configuration files (config.pbtxt) of the four models.The inputs name are modified to actual_input_1 and the outputs name to output1.

Here is the config.pbtxt file for the alexnet model.

name: "alexnet_dynamic"
platform: "tensorrt_plan"
max_batch_size: 5
dynamic_batching {
preferred_batch_size: [5]
  max_queue_delay_microseconds: 100000
}
input [
{
  name: "actual_input_alexnet"
  data_type: TYPE_FP32
  dims: [3, 224, 224]
}
]
output [
{
  name: "output_alexnet"
  data_type: TYPE_FP32
  dims: [1000]
}
]

Getting Started

Profiler

The profiler is only tested on T4 and V100 now. If you want to use it on other GPUs, you may need to pay attention to the hardware parameters such as activetime_2 , activetime_1 and idletime_1. If the GPU is V100, you can skip this part. We have already provided a config file profiled on the V100.

Initializing:

source start.sh

Profiling hardware parameters:

cd tools
python3 computeBandwidth.py
./power_t_freq 1530 # 1530 is the highest frequency of the V100

Igniter

Install / Use

README