LightTrack: A Generic Framework for Online Top-Down Human Pose Tracking

Update 4/19/2020:

Paper will appear in CVPR 2020 Workshop on Towards Human-Centric Image/Video Synthesis and the 4th Look Into Person (LIP) Challenge.

Update 5/16/2019: Add Camera Demo

[Project Page] [Paper] [Github]

With the provided code, you can easily:

Perform online pose tracking on live webcam.
Perform online pose tracking on arbitrary videos.
Replicate ablation study experiments on PoseTrack'18 Validation Set.
Train models on your own.
Replace pose estimators or improve data association modules for future research.

Real-life Application Scenarios:

Surveillance / Sport analytics / Security / Self-driving / Selfie video / Short videos (Douyin, Tiktok, etc.)

LightTrack

Overview

LightTrack is an effective light-weight framework for human pose tracking, truly online and generic for top-down pose tracking. The code for the paper includes LightTrack framework as well as its replaceable component modules, including detector, pose estimator and matcher, the code of which largely borrows or adapts from Cascaded Pyramid Networks [[1]], PyTorch-YOLOv3, st-gcn and OpenSVAI [[3]].

Overview

In contrast to Visual Object Tracking (VOT) methods, in which the visual features are implicitly represented by kernels or CNN feature maps, we track each human pose by recursively updating the bounding box and its corresponding pose in an explicit manner. The bounding box region of a target is inferred from the explicit features, i.e., the human keypoints. Human keypoints can be considered as a series of special visual features. The advantages of using pose as explicit features include:

(1) The explicit features are human-related and interpretable, and have very strong and stable relationship with the bounding box position. Human pose enforces direct constraint on the bounding box region.
(2) The task of pose estimation and tracking requires human keypoints be predicted in the first place. Taking advantage of the predicted keypoints is efficient in tracking the ROI region, which is almost free. This mechanism makes the online tracking possible.
(3) It naturally keeps the identity of the candidates, which greatly alleviates the burden of data association in the system. Even when data association is necessary, we can re-use the pose features for skeleton-based pose matching. (Here we adopt Siamese Graph Convolutional Networks (SGCN) for efficient identity association.)

Single Pose Tracking (SPT) and Single Visual Object Tracking (VOT) are thus incorporated into one unified functioning entity, easily implemented by a replaceable single-person human pose estimation module. Below is a simple step-by-step explanation of how the LightTrack framework works.

Example 1

(1). Detection only at the 1st Frame. Blue bboxes indicate tracklets inferred from keypoints.

Example 0

(2). Detection at every other 10 Frames. Red bbox indicates keyframe detection.

Example 2

(3). Detection at every other 10 Frames for multi-person:

At non-keyframes, IDs are naturally kept for each person;
At keyframes, IDs are associated via spatial consistency.

For more technical details, please refer to our arXiv paper.

Prerequisites

Set up a Python3 environment with provided anaconda environment file.

# This anaconda environment should contain everything needed, including tensorflow, pytorch, etc.
conda env create -f environment.yml

(Optional: set up the environment on your own)

Install PyTorch 1.0.0 (or higher) and TorchVision. (Siamese Graph Convolutional Network)
Install Tensorflow 1.12. Tensorflow v2.0 is not tested yet. (Human Pose Estimator)

Install some other packages:

pip install cython opencv-python pillow matplotlib

Getting Started

Clone this repository and enter the ~~dragon~~ lighttrack folder:

git clone https://github.com/Guanghan/lighttrack.git;

# build some necessities
cd lighttrack/lib;
make;

cd ../graph/torchlight;
python setup.py install

# enter lighttrack
cd ../../

If you'd like to train LightTrack, download the COCO dataset and the PoseTrack dataset first. Note that this script will take a while and dump 21gb of files into ./data/coco. For PoseTrack dataset, you can replicate our ablation experiment results on the validation set. You will need to register at the official website and create entries in order to submit your test results to the server.
```
sh data/download_coco.sh
sh data/download_posetrack17.sh
sh data/download_posetrack18.sh
```

Demo on Live Camera

| PoseTracking Framework | Keyframe Detector | Keyframe ReID Module | Pose Estimator | FPS | |:----------:|:-----------:|:--------------:|:----------------:|:---------:| | LightTrack | YOLOv3 | Siamese GCN | MobileNetv1-Deconv | 220* / 15 |

Download weights.

cd weights;
bash ./download_weights.sh  # download weights for backbones (only for training), detectors, pose estimators, pose matcher, etc.
cd -;

Perform pose tracking demo on your Webcam.

# access virtual environment
source activate py36;

# Perform LightTrack demo (on camera) with light-weight detector and pose estimator
python demo_camera_mobile.py

Demo on Arbitrary Videos

Download demo video.

cd data/demo;
bash ./download_demo_video.sh  # download the video for demo; you could later replace it with your own video for fun
cd -;

Perform online tracking demo.

# access virtual environment
source activate py36;

# Perform LightTrack demo (on arbitrary video) with light-weight detector and pose estimator
python demo_video_mobile.py

After processing, pose tracking results are stored in standardized OpenSVAI format JSON files, located at [data/demo/jsons/].
Visualized images and videos have been output at [data/demo/visualize/] and [data/demo/videos/]. Note that the video is by default output with the actual average framerate. You can hardcode it to be faster or slower for different purposes.
Some statistics will also be reported, including FPS, number of persons encountered, etc. Below is the statistics with the provided video, using YOLOv3 as detector and MobileNetv1-Deconv as the pose estimator.

total_time_ALL: 19.99s
total_time_DET: 1.32s
total_time_POSE: 18.63s
total_time_LIGHTTRACK: 0.04s
total_num_FRAMES: 300
total_num_PERSONS: 600

Average FPS: 15.01fps
Average FPS excluding Pose Estimation: 220.08fps
Average FPS excluding Detection: 16.07fps
Average FPS for framework only: 7261.90fps

You can replace the demo video with your own for fun. You can also try different detectors or pose estimators.

Validate on PoseTrack 2018

Pose estimation models have been provided. It should have been downloaded to ./weights folder while running ./download_weights.sh script. We provide alternatives of CPN101 and MSRA152, pre-trained with ResNet101 and Res152, respectively.

| Image Size | Pose Estimator | Weights | |:----------:|:-------------:|------------------------------------------------------------------------------------------------| | 384x288 | CPN101 [[1]] | CPN_snapshot_293.ckpt | | 384x288 | MSRA152 [[2]] | MSRA_snapshot_285.ckpt|

Detections for PoseTrack'18 validation set have been pre-computed. We use the same detections from [[3]] in our experiments. Two options are a

Lighttrack

Install / Use

README

LightTrack: A Generic Framework for Online Top-Down Human Pose Tracking

Update 4/19/2020:

Update 5/16/2019: Add Camera Demo

Real-life Application Scenarios:

Table of Contents

Overview

Prerequisites

(Optional: set up the environment on your own)

Getting Started

Demo on Live Camera

Demo on Arbitrary Videos

Validate on PoseTrack 2018