RepVGG: Making VGG-style ConvNets Great Again (CVPR-2021) (PyTorch)

Highlights (Sep. 1st, 2022)

RepVGG and the methodology of re-parameterization have been used in YOLOv6 (paper, code) and YOLOv7 (paper, code).

I have re-organized this repository and released the RepVGGplus-L2pse model with 84.06% ImageNet accuracy. Will release more RepVGGplus models in this month.

Introduction

This is a super simple ConvNet architecture that achieves over 84% top-1 accuracy on ImageNet with a VGG-like architecture! This repo contains the pretrained models, code for building the model, training, and the conversion from training-time model to inference-time, and an example of using RepVGG for semantic segmentation.

The MegEngine version

TensorRT implemention with C++ API by @upczww. Great work!

Another PyTorch implementation by @zjykzj. He also presented detailed benchmarks here. Nice work!

Included in a famous PyTorch model zoo https://github.com/rwightman/pytorch-image-models.

Objax implementation and models by @benjaminjellis. Great work!

Included in the MegEngine Basecls model zoo.

Citation:

@inproceedings{ding2021repvgg,
title={Repvgg: Making vgg-style convnets great again},
author={Ding, Xiaohan and Zhang, Xiangyu and Ma, Ningning and Han, Jungong and Ding, Guiguang and Sun, Jian},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
pages={13733--13742},
year={2021}
}

From RepVGG to RepVGGplus

We have released an improved architecture named RepVGGplus on top of the original version presented in the CVPR-2021 paper.

RepVGGplus is deeper
RepVGGplus has auxiliary classifiers during training, which can also be removed for inference
(Optional) RepVGGplus uses Squeeze-and-Excitation blocks to further improve the performance.

RepVGGplus outperformed several recent visual transformers with a top-1 accuracy of 84.06% and higher throughput. Our training script is based on codebase of Swin Transformer. The throughput is tested with the Swin codebase as well. We would like to thank the authors of Swin for their clean and well-structured code.

| Model | Train image size | Test size | ImageNet top-1 | Throughput (examples/second), 320, batchsize=128, 2080Ti) | | ------------- |:-------------:| -----:| -----:| -----:| | RepVGGplus-L2pse | 256 | 320 | 84.06% |147 | | Swin Transformer | 320 | 320 | 84.0% |102 |

("pse" means Squeeze-and-Excitation blocks after ReLU.)

Download this model: Google Drive or Baidu Cloud.

To train or finetune it, slightly change your training code like this:

        #   Build model and data loader as usual
        for samples, targets in enumerate(train_data_loader):
            #   ......
            outputs = model(samples)                        #   Your original code
            if type(outputs) is dict:                       
                #   A training-time RepVGGplus outputs a dict. The items are:
                    #   'main':     the output of the final layer
                    #   '*aux*':    the output of auxiliary classifiers
                loss = 0
                for name, pred in outputs.items():
                    if 'aux' in name:
                        loss += 0.1 * criterion(pred, targets)          #  Assume "criterion" is cross-entropy for classification
                    else:
                        loss += criterion(pred, targets)
            else:
                loss = criterion(outputs, targets)          #   Your original code
            #   Backward as usual
            #   ......

To use it for downstream tasks like semantic segmentation, just discard the aux classifiers and the final FC layer.

Pleased note that the custom weight decay trick I described last year turned out to be insignificant in our recent experiments (84.16% ImageNet acc and negligible improvements on other tasks), so I decided to stop using it as a new feature of RepVGGplus. You may try it optionally on your task. Please refer to the last part of this page for details.

Use our pretrained model

You may download all of the ImageNet-pretrained models reported in the paper from Google Drive (https://drive.google.com/drive/folders/1Avome4KvNp0Lqh2QwhXO6L5URQjzCjUq?usp=sharing) or Baidu Cloud (https://pan.baidu.com/s/1nCsZlMynnJwbUBKn0ch7dQ, the access code is "rvgg"). For the ease of transfer learning on other tasks, they are all training-time models (with identity and 1x1 branches). You may test the accuracy by running

python -m torch.distributed.launch --nproc_per_node 1 --master_port 12349 main.py --arch [model name] --data-path [/path/to/imagenet] --batch-size 32 --tag test --eval --resume [/path/to/weights/file] --opts DATA.DATASET imagenet DATA.IMG_SIZE [224 or 320]

The valid model names include

RepVGGplus-L2pse, RepVGG-A0, RepVGG-A1, RepVGG-A2, RepVGG-B0, RepVGG-B1, RepVGG-B1g2, RepVGG-B1g4, RepVGG-B2, RepVGG-B2g2, RepVGG-B2g4, RepVGG-B3, RepVGG-B3g2, RepVGG-B3g4

Convert a training-time RepVGG into the inference-time structure

For a RepVGG model or a model with RepVGG as one of its components (e.g., the backbone), you can convert the whole model by simply calling switch_to_deploy of every RepVGG block. This is the recommended way. Examples are shown in tools/convert.py and example_pspnet.py.

    for module in model.modules():
        if hasattr(module, 'switch_to_deploy'):
            module.switch_to_deploy()

We have also released a script for the conversion. For example,

python convert.py RepVGGplus-L2pse-train256-acc84.06.pth RepVGGplus-L2pse-deploy.pth -a RepVGGplus-L2pse

Then you may build the inference-time model with --deploy, load the converted weights and test

python -m torch.distributed.launch --nproc_per_node 1 --master_port 12349 main.py --arch RepVGGplus-L2pse --data-path [/path/to/imagenet] --batch-size 32 --tag test --eval --resume RepVGGplus-L2pse-deploy.pth --deploy --opts DATA.DATASET imagenet DATA.IMG_SIZE [224 or 320]

Except for the final conversion after training, you may want to get the equivalent kernel and bias during training in a differentiable way at any time (get_equivalent_kernel_bias in repvgg.py). This may help training-based pruning or quantization.

Train from scratch

Reproduce RepVGGplus-L2pse (not presented in the paper)

To train the recently released RepVGGplus-L2pse from scratch, activate mixup and use --AUG.PRESET raug15 for RandAug.

python -m torch.distributed.launch --nproc_per_node 8 --master_port 12349 main.py --arch RepVGGplus-L2pse --data-path [/path/to/imagenet] --batch-size 32 --tag train_from_scratch --output-dir /path/to/save/the/log/and/checkpoints --opts TRAIN.EPOCHS 300 TRAIN.BASE_LR 0.1 TRAIN.WEIGHT_DECAY 4e-5 TRAIN.WARMUP_EPOCHS 5 MODEL.LABEL_SMOOTHING 0.1 AUG.PRESET raug15 AUG.MIXUP 0.2 DATA.DATASET imagenet DATA.IMG_SIZE 256 DATA.TEST_SIZE 320

Reproduce original RepVGG results reported in the paper

To reproduce the models reported in the CVPR-2021 paper, use no mixup nor RandAug.

python -m torch.distributed.launch --nproc_per_node 8 --master_port 12349 main.py --arch [model name] --data-path [/path/to/imagenet] --batch-size 32 --tag train_from_scratch --output-dir /path/to/save/the/log/and/checkpoints --opts TRAIN.EPOCHS 300 TRAIN.BASE_LR 0.1 TRAIN.WEIGHT_DECAY 1e-4 TRAIN.WARMUP_EPOCHS 5 MODEL.LABEL_SMOOTHING 0.1 AUG.PRESET weak AUG.MIXUP 0.0 DATA.DATASET imagenet DATA.IMG_SIZE 224

The original RepVGG models were trained in 120 epochs with cosine learning rate decay from 0.1 to 0. We used 8 GPUs, global batch size of 256, weight decay of 1e-4 (no weight decay on fc.bias, bn.bias, rbr_dense.bn.weight and rbr_1x1.bn.weight) (weight decay on rbr_identity.weight makes little difference, and it is better to use it in most of the cases), and the same simple data preprocssing as the PyTorch official example:

            trans = transforms.Compose([
                transforms.RandomResizedCrop(224),
                transforms.RandomHorizontalFlip(),
                transforms.ToTensor(),
                transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])])

Other released models not presented in the paper

Apr 25, 2021 A deeper RepVGG model achieves 83.55% top-1 accuracy on ImageNet with SE blocks and an input resolution of 320x320 (and a wider version achieves 83.67% accuracy without SE). Note that it is trained with 224x224 but tested with 320x320, so that it is still trainable with a global batch size of 256 on a single machine with 8 1080Ti GPUs. If you test it with 224x224, the top-1 accuracy will be 81.82%. It has 1, 8, 14, 24, 1 layers in the 5 stages respectively. The width multipliers are a=2.5 and b=5 (the same as RepVGG-B2). The model name is "RepVGG-D2se". The code for building the model (repvgg.py) and testing with 320x320 (the testing example below) has been updated and the weights have been released at Google Drive and Baidu Cloud. Please check the links below.

RepVGG

Install / Use

README