TensorRT-7 Network Lib

Introduction

Python ===> Onnx ===> tensorRT ===> .h/.so 支持FP32，FP16，INT8量化。支持serialize，deserialize 基于线程池实现多线程并发，提升预处理和后处理的速度 重写或融合部分Opencv算子，提升Cache使用率以及避免不必要的扫描操作 支持infer时GPU和CPU端异步进行实现延迟隐藏 支持剪枝、蒸馏、量化、换轻量级backbone 推荐搭配https://github.com/Syencil/mobile-yolov5-pruning-distillation使用

Model Zoo

|Model|Training git|Infer Time|Total Time| |----|----|----|----| |Yolov5x|https://github.com/ultralytics/yolov5 https://github.com/Syencil/mobile-yolov5-pruning-distillation|32.5ms|58ms| |PANNet(Pse++)|https://github.com/WenmuZhou/PAN.pytorch|18.5ms|45ms| |PSENet|https://github.com/WenmuZhou/PSENet.pytorch|22ms|48ms| |Yolov3|https://github.com/YunYang1994/tensorflow-yolov3|14.5ms|29.5ms| |Retinaface|https://github.com/biubug6/Pytorch_Retinaface https://github.com/Syencil/Pytorch_Retinaface|2.3ms|12.3ms| |Retinanet|mmdetection + configs/nas_fpn/retinanet_r50_fpn_crop640_50e_coco.py|22.9ms|333ms| |Fcos|mmdetection + configs/fcos/fcos_r50_caffe_fpn_4x4_1x_coco.py|-|-| |ResNet|-|-|-| |Hourglass|https://github.com/Syencil/Keypoints|28ms|37ms| |SimplePose|https://github.com/microsoft/human-pose-estimation.pytorch|3ms|7ms|

测试环境为Tesla P40 + 4个CPU线程。

Quick Start

Code -> Onnx

| |git|Convert| |----|----|---| |tensorflow|https://github.com/onnx/tensorflow-onnx|python -m tf2onnx.convert | |pytorch|-|torch.onnx.export(model, img, weights, verbose=False, opset_version=11, input_names=['images'], output_names=['output'])| |Onnx|onnx-simplifier|python3 -m onnxsim in.onnx out.onnx|

C++

mkdir build
cd build
cmake -DCMAKE_BUILD_TYPE=Release ..
make + project_lib
make + project_name
./bin/project_name

Tips

Onnx必须指定为输入全尺寸，再实际中trt也不存在理想上的动态输入，所以必须在freeze阶段指明输入大小。
构建新项目时，通常只需要需要继承TensorRT类下面的DetectionTRT/SegmentationTRT/KeypointTRT类。只需要实现postProcess就可以用了。上层暴露出来的接口为initSession和predOneImage两个方法，方便调用。
由于ONNX和TRT目前算子实现的比较完善，大多数时候只需要实现相应后处理即可，针对特定算子通常可以再python代码中用一些trick进行替换，实在不行可以考虑自定义plugin
关于CHW和HWC的数据格式
- CHW: 对于GPU更优。使用CUDA做infer或者后处理的话，由于硬件DRAM的原因，CHW可以保证线程是以coalescing的方式读取。具体性能对比参考Programming_Massively_Parallel_Processors
- HWC: 对于CPU更优。使用CPU进行处理的时候，HWC格式可以保证单个线程处理的数据具有连续的内存地址。而CPU缓存具有空间局部性，这样能极大的提升效率。
- 综上：如果后处理使用CPU进行decode，建议在onnx输出HWC格式，如果使用GPU进行decode，建议在onnx输出CHW格式。对于输入端则没有具体测试，主要是相信tensorflow虽然按照之前的HWC格式，但是在操作中肯定也是做了优化

Darknet

简介

位置：yolov3_darknet_main.cpp

注意事项

利用pytorch-yolov4将darknet模型转换成onnx之后使用

SimplePose

简介

位置：simplePose_main.cpp
python训练代码git：https://github.com/microsoft/human-pose-estimation.pytorch

注意事项

转出onnx之后，在解析onnx时，需要将tmp Cuda的空间设置大一点，不然解析deconv的时候会报错。

StreamProcess

简介

位置：stream_main.cpp
此项目为基于yolov5的GPU和CPU端分离之后进行延迟隐藏的简单demo
以对视频进行推理和渲染为基础示例，可以自由更改或重写preFunc和postFunc来实现不同的需求

PanNet (PseNet V2)

简介

位置：psenetv2_main.cpp
python训练原版代码git：https://github.com/WenmuZhou/PAN.pytorch
适配TensorRT修改后的代码git：https://github.com/Syencil/PAN.pytorch

注意事项

pan和pse代码其实高度相似，导出的方法可以参考PseNet也可以参考我fork后改的代码。
pan网络中转出onnx的结果是没有经过sigmoid的(尝试一下加在后处理)
sigmoid在CPU中计算耗时比较大，可以参考fast-sigmoid-algorithm。 CPU上性能对比结果100000 times sigmoid ==> 2.81878ms fast sigmoid ==> 0.589737ms，而GPU上两者差异忽略不记。

    fast_sigmoid(x) = (x / (1 + |x|)) * 0.5 + 0.5

PseNet

简介

位置：psenet_main.cpp
python训练原版代码git：https://github.com/WenmuZhou/PSENet.pytorch

注意事项

torch转onnx的代码可以加在predict.py中，只需要在Pytorch_model这个类里面加一个成员函数即可

    def export(self, onnx_path, input_size):
        assert isinstance(input_size, list) or isinstance(input_size, tuple)
        self.net.export = True
        img = torch.zeros((1, 3, input_size[0], input_size[1])).to(self.device)
        with torch.no_grad():
            torch.onnx.export(self.net, img, onnx_path, verbose=True, opset_version=11, export_params=True, do_constant_folding=True)
        print("Onnx Simplify...")
        os.system("python3 -m onnxsim {} {}".format(onnx_path, onnx_path))
        print('Export complete. ONNX model saved to %s\nView with https://github.com/lutzroeder/netron' % onnx_path)

为了方便trt的处理，我把sigmoid加入到了torch的代码中。在models/model.py中修改PSENet的forward代码，同时__init__中加入成员变量export=False来控制

        if self.export:
            x = torch.sigmoid(x)
        return x

在onnx转换为trt的时候可能会出现This version of TensorRT only supports asymmetric这个问题，bilinear的上采样方式可能会存在问题，解决方式是将所有的F.interpolate中的align_corners=True，同时修改onnx-tensorrt中对应的cpp然后重新编译替换trt的lib
如果需要看每一个kernel的特征图，只需要在psenet.cpp里面把注释打开即可。

Yolov5

简介

位置：yolov5_main.cpp
python训练原版代码git：https://github.com/ultralytics/yolov5
模型压缩加速git：https://github.com/Syencil/mobile-yolov5-pruning-distillation

注意事项

trt的decode针对的是BxHxWxAC的格式（方便按height方向并行化以及其他嵌入式接入）。原版yolov5导出的onnx是BxAxHxWxC，需要在models/yolo.py第28行改为

            if self.export:
                x[i] = x[i].permute(0, 2, 3, 1).contiguous()
            else:
                x[i] = x[i].view(bs, self.na, self.no, ny, nx).permute(0, 1, 3, 4, 2).contiguous()

RetinaFace

简介

位置：retinaface_main.cpp
Python训练原版代码git：https://github.com/biubug6/Pytorch_Retinaface

注意事项

执行convert_to_onnx.py的时候需要更改opset_version=11，verbose=True
因为项目不需要关键点，所以把landmark的decode部分去掉了
直接使用阈值0.6（原版0.02 + topK）过滤然后接NMS
支持多线程操作

Yolov3

简介

位置：yolov3_main.cpp
Python训练原版代码git：https://github.com/YunYang1994/tensorflow-yolov3
适配TensorRT修改后的代码git：https://github.com/Syencil/tensorflow-yolov3

注意事项

训练部分同原版git相同，主要在freeze的时候使用了固定尺寸输入，并修改了python中decode的实现方法。修改为core/yolov3.py增加一个decode_full_shape类方法

    def decode_full_shape(self, conv_output, anchors, stride):
        """
        return tensor of shape [batch_size, output_size, output_size, anchor_per_scale, 5 + num_classes]
               contains (x, y, w, h, score, probability)
        """
        conv_shape = conv_output.get_shape().as_list()
        batch_size = conv_shape[0]
        output_size = conv_shape[1]
        anchor_per_scale = len(anchors)

        conv_output = tf.reshape(conv_output, (batch_size, output_size, output_size, anchor_per_scale, 5 + self.num_class), name="reshape")

        conv_raw_dxdy = conv_output[:, :, :, :, 0:2]
        conv_raw_dwdh = conv_output[:, :, :, :, 2:4]
        conv_raw_conf = conv_output[:, :, :, :, 4:5]
        conv_raw_prob = conv_output[:, :, :, :, 5:]

        y_np = np.tile(np.arange(output_size, dtype=np.int32)[..., np.newaxis], [1, output_size])
        x_np = np.tile(np.arange(output_size, dtype=np.int32)[np.newaxis, ...], [output_size, 1])

        xy_grid_np = np.concatenate([np.reshape(x_np, [np.shape(x_np)[0], np.shape(x_np)[1], 1]), np.reshape(y_np, [np.shape(y_np)[0], np.shape(y_np)[1], 1])], axis=2)
        xy_grid_np = np.tile(np.reshape(xy_grid_np, [1, np.shape(xy_grid_np)[0], np.shape(xy_grid_np)[1], 1, np.shape(xy_grid_np)[2]]), [batch_size, 1, 1, anchor_per_scale, 1])

        anchor_np = np.tile(np.reshape(anchors, [1, 1, 1, -1]), [batch_size, output_size, output_size, 1])

        xy_grid = tf.constant(xy_grid_np, dtype=tf.float32)
        stride_tf = tf.constant(shape=[batch_size, output_size, output_size, anchor_per_scale * 2], value=stride, dtype=tf.float32)
        anchor_tf = tf.constant(anchor_np, dtype=tf.float32)

        pred_xy = tf.sigmoid(conv_raw_dxdy)
        pred_wh = tf.exp(conv_raw_dwdh)

        pred_xy = tf.reshape(pred_xy, [batch_size, output_size, output_size, anchor_per_scale * 2])
        pred_wh = tf.reshape(pred_wh, [batch_size, output_size, output_size, anchor_per_scale * 2])
        xy_grid = tf.reshape(xy_grid, [batch_size, output_size, output_size, anchor_per_scale * 2])

        pred_xy = tf.add(pred_xy, xy_grid)
        pred_xy = tf.multiply(pred_xy, stride_tf)
        pred_wh = tf.multiply(pred_wh, anchor_tf)
        pred_wh = tf.multiply(pred_wh, stride_tf)

        pred_xy = tf.reshape(pred_xy, [batch_size, output_size, output_size, anchor_per_scale, 2])
        pred_wh = tf.reshape(pred_wh, [batch_size, output_size, output_size, anchor_per_scale, 2])

        pred_xywh = tf.concat([pred_xy, pred_wh], axis=4)

        pred_conf = tf.sigmoid(conv_raw_conf)
        pred_prob = tf.sigmoid(conv_raw_prob)

        return tf.concat([pred_xywh, pred_conf, pred_prob], axis=4, name="decode")

TensorRT

Install / Use

README

TensorRT-7 Network Lib

Introduction

Model Zoo

Quick Start

Code -> Onnx

C++

Tips

Darknet

简介

注意事项

SimplePose

简介

注意事项

StreamProcess

简介

PanNet (PseNet V2)

简介

注意事项

PseNet

简介

注意事项

Yolov5

简介

注意事项

RetinaFace

简介

注意事项

Yolov3

简介

注意事项