SkillAgentSearch skills...

Igniter

iGniter, an interference-aware GPU resource provisioning framework for achieving predictable performance of DNN inference in the cloud.

Install / Use

/learn @icloud-ecnu/Igniter
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

iGniter

iGniter, an interference-aware GPU resource provisioning framework for achieving predictable performance of DNN inference in the cloud.

Prototype of iGniter

Our iGniter framework comprises three pieces of modules: an inference workload placer and a GPU resource allocator as well as an inference performance predictor. With the profiled model coefficients, the inference performance predictor first estimates the inference latency using our performance model. It then guides our GPU resource allocator and inference workload placer to identify an appropriate GPU device with the least performance interference and the guaranteed SLOs from candidate GPUs for each inference workload. According to the cost-efficient GPU resource provisioning plan generated by our algorithm, the GPU device launcher finally builds a GPU cluster and launches the Triton inference serving process for each DNN inference workload on the provisioned GPU devices.

Model the Inference Performance

The execution of DNN inference on the GPU can be divided into three sequential steps: data loading, GPU execution, and result feedback. Accordingly, the DNN inference latency can be calculated by summing up the data loading latency, the GPU execution latency, and the result feedback latency, which is formulated as

$t_{i n f}^{i j}=t_{l o a d}^{i}+t_{g p u}^{i j}+t_{f e e d b a c k}^{i}$

To improve the GPU resource utilization, the data loading phase overlaps with the GPU execution and result feedback phases in the mainstream DNN inference servers (e.g., Triton). Accordingly, we estimate the DNN inference throughput as

$h^{i j}=b^{i} / (t_{g p u}^{i j}+t_{f e e d b a c k}^{i})$

We calculate the data loading latency and the result feedback latency as

$t_{load}^{i}=(d_{load}^{i} \cdot b^{i}) / B_{pcie} \quad$ and $\quad t_{feedback}^{i}=(d_{feedback}^{i} \cdot b^{i}) / B_{pcie}$

The GPU execution phase consists of the GPU scheduling delay and kernels running on the allocated SMs. Furthermore, the performance interference can be caused by the reduction of GPU frequency due to the inference workload co-location, which inevitably prolongs the GPU execution phase. Accordingly, we formulate the GPU execution latency as

$t_{g p u}^{i j}=(t_{s c h}^{i j}+t_{a c t}^{i j}) / (f^{j} / F)$

The GPU scheduling delay is roughly linear to the number of kernels for a DNN inference workload and there is increased scheduling delay caused by the performance interference on the GPU resource scheduler, which can be estimated as

$t_{s c h}^{i j}=\left(k_{s c h}^{i}+\Delta_{s c h}^{j}\right) \cdot n_{k}^{i}$

Given a fixed supply of L2 cache space on a GPU device, a higher GPU L2 cache utilization (i.e., demand) indicates severer contention on the GPU L2 cache space, thereby resulting in a longer GPU active time. Accordingly, we estimate the GPU active time as

$t_{\text {act }}^{i j}=k_{\text {act }}^{i} \cdot\left(1+\alpha_{\text {cache }}^{i} \cdot \sum_{i \in \mathcal{I} \backslash i}\left(c^{i} \cdot v^{i j}\right)\right)$

Dependencies and Requirements

  • Description of required hardware resources:
    • We set up a GPU cluster of 10 p3.2xlarge EC2 instances, each equipped with 1 NVIDIA V100 GPU card, 8 vCPUs, and 61 GB memory.
  • Description of the required operating system:
    • Ubuntu 18.04
  • Required software libraries: Triton, NVIDIA Driver, cuDNN, CUDA, Python3, TensorRT, Docker, NVIDIA Container Toolkit, Torchvision, Torch, Pandas, Scikit-image, Numpy, Scipy, Pillow.
  • Input dataset required for executing code or generating input data: ImageNet dataset and VOC2012 dataset.

Installation and Deployment Process

  • Installation of Libraries and Software:

    • Install Python3 pip.

      apt-get update
      sudo apt install python3-pip
      python3 -m pip install --upgrade pip
      
    • Install NVIDIA Driver.

      sudo add-apt-repository ppa:graphics-drivers/ppa 
      sudo apt-get update 
      sudo apt install nvidia-driver-470 # Install version 470 driver 
      sudo reboot 
      
    • Install CUDA [Note: Do Not Install Driver].

      wget https://developer.download.nvidia.com/compute/cuda/11.3.0/local_installers/cuda_11.3.0_465.19.01_linux.run
      sudo sh cuda_11.3.0_465.19.01_linux.run
      
    • Install TensorRT.

      version="8.0.1.6"
      tar xzvf tensorrt-8.0.1.6.linux.x86_64-gnu.cuda-11.3.cudnn8.2.tar.gz
      pos=$PWD
      cd ${pos}/TensorRT-${version}/python
      sudo pip3 install tensorrt-*-cp36-none-linux_x86_64.whl # Because I downloaded python version 3.6, it is cp36
      cd ${pos}/TensorRT-${version}/uff
      sudo pip3 install uff-0.6.5-py2.py3-none-any.whl # Refer to the specific name of the file in this directory
      cd ${pos}/TensorRT-${version}/graphsurgeon
      sudo pip3 install graphsurgeon-0.4.1-py2.py3-none-any.whl # Refer to the specific name of the file in this directory
      
    • Install cuDNN.

      tar -xzvf cudnn-11.3-linux-x64-v8.2.0.53.tgz # Refer to the specific name of the file in this directory
      cudnnversion="11.3"
      sudo cp cuda/include/cudnn.h /usr/local/cuda-${cudnnversion}/include
      sudo cp cuda/lib64/libcudnn* /usr/local/cuda-${cudnnversion}/lib64
      sudo chmod a+r /usr/local/cuda-${cudnnversion}/include/cudnn.h 
      sudo chmod a+r /usr/local/cuda-${cudnnversion}/lib64/libcudnn*
      
    • Configure ~/.bashrc .

      export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda-11.3/lib64:/root/TensorRT-8.0.1.6/lib
      export PATH=$PATH:/usr/local/cuda-11.3/bin:/root/TensorRT-8.0.1.6/bin
      export CUDA_HOME=$CUDA_HOME:/usr/local/cuda-11.3
      
    • Test if trtexc install successful.

      trtexec
      
    • Install docker.

      apt install docker.io
      
    • Install NVIDIA Container Toolkit.

      distribution=$(. /etc/os-release;echo $ID$VERSION_ID) \
         && curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add - \
         && curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
      curl -s -L https://nvidia.github.io/nvidia-container-runtime/experimental/$distribution/nvidia-container-runtime.list | sudo tee /etc/apt/sources.list.d/nvidia-container-runtime.list
      sudo apt-get update
      sudo apt-get install -y nvidia-docker2
      sudo systemctl restart docker
      sudo docker run --rm --gpus all nvidia/cuda:11.0.3-base-ubuntu20.04 nvidia-smi
      
  • Deployment of Code:

    • Pull iGniter code.

      git clone https://github.com/Echozqn/igniter.git
      cd igniter/i-Gniter
      pip install -r requirements.txt
      
    • Pull triton images.

      docker pull nvcr.io/nvidia/tritonserver:21.07-py3
      docker pull nvcr.io/nvidia/tritonserver:21.07-py3-sdk
      
    • Convert model.

      • If you are using V100, we have prepared the model files required for your experiments. Please perform the following actions.

        • Download the model file.

          cd i-Gniter/Launch/model/
          ./fetch_models.sh
          
      • If not, you need to follow the steps below to generate the model files needed for the experiment.

        • Use the trtexec tool to convert the ONNX to a trt engine model.

          cd i-Gniter/Launch/model/
          ./fetch_models.sh
          cd i-Gniter/Profile
          python3 model_onnx.py # Generate Model
          sh onnxTOtrt.sh # Convert model
          
        • Replace model.plan in the $path/igniter/i-Gniter/Launch/model/model directory with the generated model, the directory structure is shown below.

          .
          └── igniter
              └── i-Gniter
                  └── Launch
                      └── model
                          └── model
                              ├── alexnet_dynamic
                              │   ├── 1
                              │   │   └── model.plan
                              │   └── config.pbtxt
                              ├── resnet50_dynamic
                              │   ├── 1
                              │   │   └── model.plan
                              │   └── config.pbtxt
                              ├── ssd_dynamic
                              │   ├── 1
                              │   │   └── model.plan
                              │   └── config.pbtxt
                              └── vgg19_dynamic
                                  ├── 1
                                  │   └── model.plan
                                  └── config.pbtxt
          
        • Modify the input and output name in the configuration files (config.pbtxt) of the four models.The inputs name are modified to actual_input_1 and the outputs name to output1.

          Here is the config.pbtxt file for the alexnet model.

          name: "alexnet_dynamic"
          platform: "tensorrt_plan"
          max_batch_size: 5
          dynamic_batching {
          preferred_batch_size: [5]
            max_queue_delay_microseconds: 100000
          }
          input [
          {
            name: "actual_input_alexnet"
            data_type: TYPE_FP32
            dims: [3, 224, 224]
          }
          ]
          output [
          {
            name: "output_alexnet"
            data_type: TYPE_FP32
            dims: [1000]
          }
          ]
          

Getting Started

Profiler

The profiler is only tested on T4 and V100 now. If you want to use it on other GPUs, you may need to pay attention to the hardware parameters such as activetime_2 , activetime_1 and idletime_1. If the GPU is V100, you can skip this part. We have already provided a config file profiled on the V100.

Initializing:

source start.sh

Profiling hardware parameters:

cd tools
python3 computeBandwidth.py
./power_t_freq 1530 # 1530 is the highest frequency of the V100 
View on GitHub
GitHub Stars39
CategoryDevelopment
Updated4mo ago
Forks8

Languages

Python

Security Score

72/100

Audited on Nov 18, 2025

No findings