VarifocalNet: An IoU-aware Dense Object Detector

This repo hosts the code for implementing the VarifocalNet, as presented in our CVPR 2021 oral paper, which is available at: https://arxiv.org/abs/2008.13367:

@inproceedings{zhang2020varifocalnet,
  title={VarifocalNet: An IoU-aware Dense Object Detector},
  author={Zhang, Haoyang and Wang, Ying and Dayoub, Feras and S{\"u}nderhauf, Niko},
  booktitle={CVPR},
  year={2021}
}

Introduction

Accurately ranking the vast number of candidate detections is crucial for dense object detectors to achieve high performance. In this work, we propose to learn IoU-aware classification scores (IACS) that simultaneously represent the object presence confidence and localization accuracy, to produce a more accurate ranking of detections in dense object detectors. In particular, we design a new loss function, named Varifocal Loss (VFL), for training a dense object detector to predict the IACS, and a new efficient star-shaped bounding box feature representation (the features at nine yellow sampling points) for estimating the IACS and refining coarse bounding boxes. Combining these two new components and a bounding box refinement branch, we build a new IoU-aware dense object detector based on the FCOS+ATSS architecture, what we call VarifocalNet or VFNet for short. Extensive experiments on MS COCO benchmark show that our VFNet consistently surpasses the strong baseline by ~2.0 AP with different backbones. Our best model VFNet-X-1200 with Res2Net-101-DCN reaches a single-model single-scale AP of 55.1 on COCO test-dev, achieving the state-of-the-art performance among various object detectors.

<div align="center"> <img src="VFNet.png" width="600px" /> <p>Learning to Predict the IoU-aware Classification Score.</p> </div>

Updates

2021.03.05 Our VarifocalNet is accepted to CVPR 2021 as an oral presentation. Thanks the reviewers and ACs.
2021.03.04 Update to MMDetection v2.10.0, add more results and training scripts, and update the arXiv paper.
2021.01.09 Add SWA training.
2021.01.07 Update to MMDetection v2.8.0.
2020.12.24 We release a new VFNet-X model that can achieve a single-model single-scale 55.1 AP on COCO test-dev at 4.2 FPS.
2020.12.02 Update to MMDetection v2.7.0.
2020.10.29 VarifocalNet has been merged into the official MMDetection repo. Many thanks to @yhcao6, @RyanXLi and @hellock!
2020.10.29 This repo has been refactored so that users can pull the latest updates from the upstream official MMDetection repo. The previous one can be found in the old branch.

Installation

This VarifocalNet implementation is based on MMDetection. Therefore the installation is the same as original MMDetection.
Please check get_started.md for installation. Note that you should change the version of PyTorch and CUDA to yours when installing mmcv in step 3 and clone this repo instead of MMdetection in step 4.

If you run into problems with pycocotools, please install it by:

pip install "git+https://github.com/open-mmlab/cocoapi.git#subdirectory=pycocotools"

A Quick Demo

Once the installation is done, you can follow the steps below to run a quick demo.

Download the model and put it into one folder under the root directory of this project, say, checkpoints/.
Go to the root directory of this project in terminal and activate the corresponding virtual environment.

Run

python demo/image_demo.py demo/demo.jpg configs/vfnet/vfnet_r50_fpn_1x_coco.py checkpoints/vfnet_r50_1x_41.6.pth

and you should see an image with detections.

Usage of MMDetection

Please see exist_data_model.md for the basic usage of MMDetection. They also provide colab tutorial for beginners.

For troubleshooting, please refer to faq.md

Results and Models

For your convenience, we provide the following trained models. These models are trained with a mini-batch size of 16 images on 8 Nvidia V100 GPUs (2 images per GPU).

| Backbone | Style | DCN | MS <br> train | Lr <br> schd |Inf time <br> (fps) | box AP <br> (val) | box AP <br> (test-dev) | Download | |:------------:|:---------:|:-------:|:-------------:|:------------:|:------------------:|:-----------------:|:----------------------:|:--------------------------------------:| | R-50 | pytorch | N | N | 1x | 19.4 | 41.6 | 41.6 | model | log| | R-50 | pytorch | N | Y | 2x | 19.3 | 44.5 | 44.8 | model | log| | R-50 | pytorch | Y | Y | 2x | 16.3 | 47.8 | 48.0 | model | log| | R-101 | pytorch | N | N | 1x | 15.5 | 43.0 | 43.6 | model | log| | R-101 | pytorch | N | N | 2x | 15.6 | 43.5 | 43.9 | model | log| | R-101 | pytorch | N | Y | 2x | 15.6 | 46.2 | 46.7 | model | log| | R-101 | pytorch | Y | Y | 2x | 12.6 | 49.0 | 49.2 | model | log| | X-101-32x4d | pytorch | N | Y | 2x | 13.1 | 47.4 | 47.6 | model | log| | X-101-32x4d | pytorch | Y | Y | 2x | 10.1 | 49.7 | 50.0 | model | log| | X-101-64x4d | pytorch | N | Y | 2x | 9.2 | 48.2 | 48.5 | model | log| | X-101-64x4d | pytorch | Y | Y | 2x | 6.7 | 50.4 | 50.8 | model | log| | R2-101 | pytorch | N | Y | 2x | 13.0 | 49.2 | 49.3 | model | log| | R2-101 | pytorch | Y | Y | 2x | 10.3 | 51.1 | 51.3 | model | log|

Notes:

The MS-train maximum scale range is 1333x[480:960] (range mode) and the inference scale keeps 1333x800.
The R2-101 backbone is Res2Net-101.
DCN means using DCNv2 in both backbone and head.
The inference speed is tested with an Nvidia V100 GPU on HPC (log file).

We also provide the models of RetinaNet, FoveaBox, RepPoints and ATSS trained with the Focal Loss (FL) and our Varifocal Loss (VFL).

| Method | Backbone | MS train | Lr schd | box AP (val) | Download | |:---------------:|:--------:|:--------:|:-------:|:------------:|:--------:| | RetinaNet + FL | R-50 | N | 1x | 36.5 | model | [log](https://drive.go

VarifocalNet

Install / Use

README