Fast R-CNN Object Detection Tutorial for Microsoft Cognitive Toolkit (CNTK)

+ Update V2.0.1 (June 2017):
+ Updated documentation to include Visual Object Tagging Tool as an annotation option.
+ Update v2 (June 2017):
+ Updated code to be compatible with the CNTK 2.0.0 release.
+ Update v1 (Feb 2017):
+ This tutorial was updated to use CNTK's python wrappers. Now all processing happens in-memory during scoring. See script 6_runSingleImage for an example. Furthermore, we switched to a much more accurate and faster implementation of Selective Search.
+ Note that, at the time of writing, CNTK does not support Python 2. If you need Python 2 then please refer to the [previous version](https://github.com/Azure/ObjectDetectionUsingCntk/tree/7edd3276a189bad862dc54e9f73b7cfcec5ae562) of this tutorial.

DESCRIPTION

Object Detection is one of the main problems in Computer Vision. Traditionally, this required expert knowledge to identify and implement so called “features” that highlight the position of objects in the image. Starting in 2012 with the famous AlexNet paper, Deep Neural Networks are used to automatically find these features. This lead to a huge improvement in the field for a large range of problems.

This tutorial uses Microsoft Cognitive Toolkit's (CNTK) fast R-CNN implementation (see the Fast R-CNN section for a description) which was shown to produce state-of-the-art results for Pascal VOC, one of the main object detection challenges in the field.

GOALS

The goal of this tutorial is to show how to train and test your own Deep Learning object detection model using Microsoft Cognitive Toolkit (CNTK). Example data and annotations are provided, but the reader can also bring their own images and train their own, unique, object detector.

The tutorial is split into four parts:

Part 1 shows how to train an object detection model for the example data without retraining the provided Neural Network, but instead training an external classifier on its output. This approach works particularly well with small datasets, and does not require expertise with deep learning.
Part 2 extends this approach to refine the Neural Network directly without the need for an external classifier.
Part 3 illustrates how to annotate your own images and use these to train an object detection model for your specific use case.
Part 4 covers how to reproduce published results on the Pascal VOC dataset.

Previous expertise with Machine Learning while not required to complete this tutorial, however is very helpful to understand the underlying principles. More information on the topic can also be found at CNTK's Fast-RCNN page.

PREREQUISITES

This tutorial was tested using CNTK v2.0.0, and assumes that CNTK was installed with the (default) Anaconda Python interpreter. Note that the code will only run on v2.0 due to breaking changes in other versions.

CNTK can be easily installed by following the instructions on the script-driven installation page. This will also automatically add an Anaconda Python distribution. At the time of writing, the default python version is 3.5.

A dedicated GPU is not required, but recommended for retraining of the Neural Network (part 2). If you lack a strong GPU, don't want to install CNTK yourself, or want to train a model using multiple GPUs, then consider using Azure's Data Science Virtual Machine. See the Cortana Intelligence Gallery for a 1-click deployment solution.

Several Python packages are required to execute the python scripts. These libraries can be installed easily using provided python wheels by opening a command prompt and running:

c:/local/CNTK-2-0/cntk/Scripts/cntkpy35.bat
cd resources/python35_64bit_requirements/
pip.exe install -r requirements.txt

In the code snippet above, we assumed that the CNTK root directory is C:/local/CNTK-2-0/. The python wheels were originally downloaded from this page.

Finally, the file AlexNet.model is too big to be hosted in Github and hence needs to be downloaded manually from here and placed into the subfolder /resources/cntk/AlexNet.model.

FOLDER STRUCTURE

|Folder| Description |---|--- |/| Root directory |/data/| Directory containing images for different object recognition projects |/data/grocery/| Example data for grocery item detection in refrigerators |/data/grocery/positives/| Images and annotations to train the model |/data/grocery/negatives/| Images used as negatives during model training |/data/grocery/testImages/| Test images used to evaluate model accuracy |/doc/| Resources such as images for this readme page |/fastRCNN/| Slightly modified code used in R-CNN publications |/resources/| All provided resources are in here |/resources/cntk/| CNTK configuration file and pre-trained AlexNet model |/resources/python35_64_bit_requirements/| Python wheels and requirements file for 64bit Python version 3.5

All scripts used in this tutorial are located in the root folder.

PART 1

In the first part of this tutorial we will train a classifier which uses, but does not modify, a pre-trained deep neural network. See the Fast R-CNN section for details of the employed approaches. As example data 25 images of grocery items inside refrigerators are provided, split into 20 images for training and the remaining 5 images are used as test set. The training images contain in total 180 annotated objects, these are:

Egg box, joghurt, ketchup, mushroom, mustard, orange, squash, and water.

Note that 20 training images is a very low number and too little train a high-accuracy detector. Nevertheless, even this small dataset is sufficient to return plausible detections as can be seen in step 5.
Every step has to be executed in order, and we recommend after each step to inspect which files are written, where they are written to, and what the content of these files is (mostly the content is written as text file).

STEP 1: Computing Region of Interests

Script: 1_computeRois.py

Region-of-interests (ROIs) are computed for each image independently using a 3-step approach: First, Selective Search is used to generate hundreds of ROIs per Image. These ROIs often fit tightly around some objects but miss other objects in the image (see Selective Search section). Many of the ROIs are bigger, smaller, etc. than the typical grocery item in our dataset. Hence in a second step these ROIs, as well as ROIs which are too similar, are discarded. Finally, to complement the detected ROIs from Selective Search, ROIs that uniform cover the image are added at different scales and aspect ratios.

The final ROIs are written for each image separately to the files [imageName].roi.txt in the proc/grocery/rois/ folder.

For the grocery dataset, selective search typically generates around 1000 ROIs per image, plus on average another 2000 ROIs sampled uniformly from the image. A high number of ROIs typically leads to better object detection performance, at the expense however of longer running time. Hence the parameter cntk_nrRois can be used to only keep a subset of the ROIs (e.g. if cntk_nrRois = 2000 then typically all ROIs from selective search are preserved, plus the 1000 largest ROIs generated using uniform sampling).

The goodness of these ROIs can be measured by counting how many of the ground truth annotated objects in the image are covered by at least one ROI, where "covered" is defined as having an overlap greater than a given threshold. Script B1_evaluateRois.py outputs these counts at different threshold values. For example for a threshold of 0.5 and 2000 ROIs, the recall is around 98%, while with 200 ROIs the recall is around 85%. It is important that the recall at a threshold of 0.5 is close to 100%, since even a perfect classifier cannot find an object in the image if it is not covered by at least one ROI.

ROIs computed using Selective Search (left); ROIs from the image above after discarding ROIs that are too small, too big, etc. (middle); Final set of ROIs after adding ROIs that uniformly cover the image (right).

STEP 2: Computing CNTK inputs

Script: 2_cntkGenerateInputs.py

Each ROI generated in the last step has to run through the CNTK model to compute its 4,096 float Deep Neural Network representation (see the Fast R-CNN section). This requires three CNTK-specific input files to be generated for the training and the test set:

{train,test}.txt: each row contains the path to an image.
{train,test}.rois.txt: each row contains all ROIs for an image in relative (x,y,w,h) co-ordinates.
{train,test}.roilabels.txt: each row contains the labels for the ROIs in one-hot-encoding.

An in-depth understanding of how these files are structured is not necessary to understand this tutorial. However, two points are worth pointing out:

CNTK’s fa

ObjectDetectionUsingCntk

Install / Use

README