ImageSimilarityUsingCntk
Deep Neural Network based Image Similarity ranking on Azure using CNTK
Install / Use
/learn @Azure/ImageSimilarityUsingCntkREADME
+ This repository is not longer maintained. However, state-of-the-art image similarity is supported in
+ this Github repo: https://github.com/microsoft/computervision-recipes
https://github.com/microsoft/computervision-recipes
Image Similarity Ranking using Microsoft Cognitive Toolkit (CNTK)
DESCRIPTION
A large number of problems in the computer vision domain can be solved by ranking images according to their similarity. For example retail companies want to show customers products which are similar to the ones bought in the past. Or companies with large amounts of data want to organize and search their images effectively.
This tutorial will address solving such problems. We will show how to train and test your own image similarity ranker using Microsoft Cognitive Toolkit (CNTK). Example images and annotations are provided, but the reader can also bring their own dataset and train their own unique ranker.
Traditionally, computer vision solutions required expert knowledge to identify and implement so called “features” which highlight desired information in images. This changed in 2012 with the famous AlexNet paper, and at present, Deep Neural Networks (DNN) are used to automatically find these features. DNNs lead to a huge improvement in the field and can be used for a large range of problems, including image classification, object detection, or image similarity.
The tutorial is split into three parts:
- Part 1 shows how to train and evaluate an image similarity ranking system using the provided example data.
- Part 2 covers how to annotate your own dataset by scraping images from the net, how to use these images to train an image similarity model for your specific use case, and what parameters and design decisions are important to achieve good results.
- Part 3 illustrates how to publish the trained model as a web service or Rest API to Azure.
While previous experience with machine learning and CNTK is not required, it is very helpful for understanding the underlying principles covered in this tutorial. Note that accuracy numbers, training time, etc. reported in this tutorial are only for reference, and the actual values you might see when running the code may differ.
PREREQUISITES
This code was tested using CNTK 2.0.0, and assumes that CNTK was installed with the (default) Anaconda Python interpreter using the script-driven installation. Note that the code will not run on previous CNTK versions due to breaking changes.
A dedicated GPU, while technically not being required, is however recommended for refining of the DNN. If you lack a strong GPU, don't want to install CNTK yourself, or want to train on multiple GPUs, then consider using Azure's Data Science Virtual Machine. See the Deep Learning Toolkit for a 1-click deployment solution.
CNTK can be easily installed by following the instructions on the setup page. This will also automatically add an Anaconda Python distribution. At the time of writing, the default python version is 3.5 64bit.
Anaconda comes with many packages already pre-installed. The only missing packages are opencv and scikit-learn. These can be installed easily using pip by opening a command prompt and running:
C:/local/CNTK-2-0/cntk/Scripts/cntkpy35.bat #activate CNTK's python environment
cd resources/python35_64bit_requirements/
pip.exe install -r requirements.txt
In the code snippet above, we assumed that the CNTK root directory is C:/local/CNTK-2-0/. The opencv python wheel was originally downloaded from this page.
Troubleshooting:
- The error "Batch normalization training on CPU is not yet implemented" is thrown when trying to train the DNN using a CPU-only system. This is due to CNTK currently only providing CPU support for running, though not training, batch normalized networks. To skip training and hence avoid this error set
rf_maxEpochsto zero inPARAMETERS.py.
FOLDER STRUCTURE
|Folder| Description |---|--- |/| Root directory |/doc/| Plots, diagrams, etc. used in this readme page |/resources/| Directory containing all provided resources |/resources/cntk/| Pre-trained ResNet model |/resources/python35_64_bit_requirements/| Python wheels and requirements file for 64bit Python version 3.5
<!-- |/data/| Directory containing the image dataset(s) |/data/fashionTexture/| Upper body clothing texture dataset |/resources/python34_64_bit_requirements/| Python wheels and requirements file for 64bit Python version 3.4 -->All scripts used in this tutorial are located in the root directory.
PART 1
In the first part of this tutorial we train and evaluate an image similarity ranking system using the provided example data. First, our approach is explained and the provided image set introduced. This is followed by a description of the scripts which do the actual work. These scripts should be executed in order, and we recommend after each step to inspect which files are written and where they are written to.
Note that all important parameters are specified, and a short explanation provided, in a single place: the PARAMETERS.py file.
Overview
The approach we follow in this tutorial is visualized at a high-level in this diagram:
<p align="center"> <img src="doc/pipeline.jpg" alt="alt text" width="800"/> </p>-
Given two images on the left, our goal is to measure the visual distance between them. Note that we use the terms distance and similarity interchangeably, since similarity can be computed e.g. by simply taking the negative of the distance.
-
In a pre-processing step, one can first detect the object-of-interest in the image and then crop the image to just that area. This step is optional and not described in this tutorial. Please refer to our Fast R-CNN Object Detection tutorial for an in-depth description of how to train and evaluate such a model.
-
Each (possibly cropped) image is then represented using the output of a DNN which was pre-trained on millions of images. The input to the DNN is simply the image itself, and the output is the penultimate layer which, for the ResNet-18 model used in this tutorial, consists of 512-floats.
-
These 512-floats image representations are then scaled to each have an L2 norm of one, and the visual distance between two images is defined as a function of their respective 512-floats vectors. Possible distance metrics include the L1 or L2 norm, or the cosine similarity (which technically is not a metric). The advantage of these metrics is that they are non-parametric and therefore do not require any training. The disadvantage however is that each of the 512 dimensions is assumed to carry the same amount of information which in practice is not true. Hence we use a Linear SVM to train a weighted L2 distance, which can give a higher weight to informative dimensions, and vice versa down-weigh dimensions with little or no information.
-
Finally, the output of the system is the desired visual similarity for the given image pair.
Representing images as the output of a DNN is a powerful approach and shown to give good results on a wide variety of tasks. The best results however can be achieved through fine-tuning the network before computing the image representations. This approach is also known as transfer learning: starting with an original model which was trained on a very large number of generic images, we use a small set of images from our own dataset to fine-tune this network. A more detailed description of transfer learning can be found on Microsoft Cognitive Toolkit's (CNTK) Transfer Learning Tutorial page (we actually borrow some code from there), or in the Embarrassingly Parallel Image Classification blog.
Image data
Script: 0_downloadImages.py
This tutorial uses as running example a small upper body clothing texture dataset consisting of up to 428 images, where each image is annotated as one of 3 different textures (dotted, striped, leopard). All images were scraped using Bing Image Search and hand-annotated as is explained in Part 2. The image URLs with their respective attributes are listed in the /resources/fashionTextureUrls.tsv file.
The script 0_downloadImages.py needs to be executed once and downloads all images to the /data/fashionTexture/ directory. Note that some of the 428 URLs are likely broken - this is not an issue, and just means that we have slightly fewer images for training and testing.
The figure below shows examples for the attributes dotted (left two columns), striped (middle two columns), and leopard (right two columns). Note that annotations were done according to the upper body clothing item, and the classifier needs to learn to focus on the relevant part of the image and to ignore all other areas (e.g. pants, shoes).
<p align="center"> <img src="doc/examples_dotted.jpg" alt="alt text" height="300"/> <img src="doc/examples_striped.jpg" alt="alt text" height="300"/> <img src="doc/examples_leopard.jpg" alt="alt text" height="300"/> </p>STEP 1: Data preparation
Script: 1_prepareData.py
This
