Real-time Hand-Detection using Neural Networks (SSD) on Tensorflow.

This repo documents steps and scripts used to train a hand detector using Tensorflow (Object Detection API). As with any DNN based task, the most expensive (and riskiest) part of the process has to do with finding or creating the right (annotated) dataset. I was interested mainly in detecting hands on a table (egocentric view point). I experimented first with the Oxford Hands Dataset (the results were not good). I then tried the Egohands Dataset which was a much better fit to my requirements.

The goal of this repo/post is to demonstrate how neural networks can be applied to the (hard) problem of tracking hands (egocentric and other views). Better still, provide code that can be adapted to other uses cases.

If you use this tutorial or models in your research or project, please cite this.

Here is the detector in action.

<img src="images/hand1.gif" width="33.3%"><img src="images/hand2.gif" width="33.3%"><img src="images/hand3.gif" width="33.3%"> Realtime detection on video stream from a webcam .

<img src="images/chess1.gif" width="33.3%"><img src="images/chess2.gif" width="33.3%"><img src="images/chess3.gif" width="33.3%"> Detection on a Youtube video.

Both examples above were run on a macbook pro CPU (i7, 2.5GHz, 16GB). Some fps numbers are:

| FPS | Image Size | Device| Comments| | ------------- | ------------- | ------------- | ------------- | | 21 | 320 * 240 | Macbook pro (i7, 2.5GHz, 16GB) | Run without visualizing results| | 16 | 320 * 240 | Macbook pro (i7, 2.5GHz, 16GB) | Run while visualizing results (image above) | | 11 | 640 * 480 | Macbook pro (i7, 2.5GHz, 16GB) | Run while visualizing results (image above) |

Note: The code in this repo is written and tested with Tensorflow 1.4.0-rc0. Using a different version may result in some errors. You may need to generate your own frozen model graph using the model checkpoints in the repo to fit your TF version.

The tensorflow object detection repo has a python file for exporting a checkpoint to frozen graph here. You can copy it to the current directory and use it as follows

python3 export_inference_graph.py \
    --input_type image_tensor \
    --model-checkpoint/ssd_mobilenet_v1_pets.config \
    --model-checkpoint/model.ckpt-200002 \ 
    --output_directory hand_inference_graph

Content of this document

Motivation - Why Track/Detect hands with Neural Networks
Data preparation and network training in Tensorflow (Dataset, Import, Training)
Training the hand detection Model
Using the Detector to Detect/Track hands
Thoughts on Optimizations.

P.S if you are using or have used the models provided here, feel free to reach out on twitter (@vykthur) and share your work!

Update 3/5/19 - You can now use the model in the Browser using Handtrack.js

I exported the model using the Tensorflow.js converter and have it wrapped into an easy to use javascript library - Handtrack.js. You can do hand tracking in 3 lines of code, no installation, no model training, all in the browser.

Learn more below

Blog Post: Hand Tracking Interactions in the Browser using Tensorflow.js and 3 lines of code.
Github: Handtrack.js Github Repo
Live Demo : Handtrack.js Examples in the Browser

Update 15/09/2021 - Android example using a TFLite model

android_sample_1

The trained model checkpoints are converted to the TensorFlow Lite format so that they can used in both Android and iOS apps.

The Android app which uses the hand tracking model from this repo is available here -> shubham0204/Hand_Detection_TFLite_Android

Also, a step-by-step guide on how to convert the model checkpoints to a TFLite model ( .tflite ) is available as a IPYNB notebook ( open it in Google Colab ) -> shubham0204/Google_Colab_Notebooks/Hand_Tracking_Model_TFLite_Conversion.ipynb

Motivation - Why Track/Detect hands with Neural Networks?

There are several existing approaches to tracking hands in the computer vision domain. Incidentally, many of these approaches are rule based (e.g extracting background based on texture and boundary features, distinguishing between hands and background using color histograms and HOG classifiers,) making them not very robust. For example, these algorithms might get confused if the background is unusual or in situations where sharp changes in lighting conditions cause sharp changes in skin color or the tracked object becomes occluded.(see here for a review paper on hand pose estimation from the HCI perspective)

With sufficiently large datasets, neural networks provide opportunity to train models that perform well and address challenges of existing object tracking/detection algorithms - varied/poor lighting, noisy environments, diverse viewpoints and even occlusion. The main drawbacks to usage for real-time tracking/detection is that they can be complex, are relatively slow compared to tracking-only algorithms and it can be quite expensive to assemble a good dataset. But things are changing with advances in fast neural networks.

Furthermore, this entire area of work has been made more approachable by deep learning frameworks (such as the tensorflow object detection api) that simplify the process of training a model for custom object detection. More importantly, the advent of fast neural network models like ssd, faster r-cnn, rfcn (see here ) etc make neural networks an attractive candidate for real-time detection (and tracking) applications. Hopefully, this repo demonstrates this.

If you are not interested in the process of training the detector, you can skip straight to applying the pretrained model I provide in detecting hands.

Training a model is a multi-stage process (assembling dataset, cleaning, splitting into training/test partitions and generating an inference graph). While I lightly touch on the details of these parts, there are a few other tutorials cover training a custom object detector using the tensorflow object detection api in more detail[ see here and here ]. I recommend you walk through those if interested in training a custom object detector from scratch.

Data preparation and network training in Tensorflow (Dataset, Import, Training)

The Egohands Dataset

The hand detector model is built using data from the Egohands Dataset dataset. This dataset works well for several reasons. It contains high quality, pixel level annotations (>15000 ground truth labels) where hands are located across 4800 images. All images are captured from an egocentric view (Google glass) across 48 different environments (indoor, outdoor) and activities (playing cards, chess, jenga, solving puzzles etc).

If you will be using the Egohands dataset, you can cite them as follows:

Bambach, Sven, et al. "Lending a hand: Detecting hands and recognizing activities in complex egocentric interactions." Proceedings of the IEEE International Conference on Computer Vision. 2015.

The Egohands dataset (zip file with labelled data) contains 48 folders of locations where video data was collected (100 images per folder).

-- LOCATION_X
  -- frame_1.jpg
  -- frame_2.jpg
  ...
  -- frame_100.jpg
  -- polygons.mat  // contains annotations for all 100 images in current folder
-- LOCATION_Y
  -- frame_1.jpg
  -- frame_2.jpg
  ...
  -- frame_100.jpg
  -- polygons.mat  // contains annotations for all 100 images in current folder

Converting data to Tensorflow Format

Some initial work needs to be done to the Egohands dataset to transform it into the format (tfrecord) which Tensorflow needs to train a model. This repo contains egohands_dataset_clean.py a script that will help you generate these csv files.

Downloads the egohands datasets
Renames all files to include their directory names to ensure each filename is unique
Splits the dataset into train (80%), test (10%) and eval (10%) folders.
Reads in polygons.mat for each folder, generates bounding boxes and visualizes them to ensure correctness (see image above).
Once the script is done running, you should have an images folder containing three folders - train, test and eval. Each of these folders should also contain a csv label document each - train_labels.csv, test_labels.csv that can be used to generate tfrecords

python egohands_dataset_clean.py

Note: While the egohands dataset provides four separate labels for hands (own left, own right, other left, and other

Handtracking

Install / Use

README

Real-time Hand-Detection using Neural Networks (SSD) on Tensorflow.

Update 3/5/19 - You can now use the model in the Browser using Handtrack.js

Update 15/09/2021 - Android example using a TFLite model

Motivation - Why Track/Detect hands with Neural Networks?

Data preparation and network training in Tensorflow (Dataset, Import, Training)