Tfoptflow
Optical Flow Prediction with TensorFlow. Implements "PWC-Net: CNNs for Optical Flow Using Pyramid, Warping, and Cost Volume," by Deqing Sun et al. (CVPR 2018)
Install / Use
/learn @philferriere/TfoptflowREADME
Optical Flow Prediction with Tensorflow
This repo provides a TensorFlow-based implementation of the wonderful paper "PWC-Net: CNNs for Optical Flow Using Pyramid, Warping, and Cost Volume," by Deqing Sun et al. (CVPR 2018).
There are already a few attempts at implementing PWC-Net using TensorFlow out there. However, they either use outdated architectures of the paper's CNN networks, only provide TF inference (no TF training), only work on Linux platforms, and do not support multi-GPU training.
This implementation provides both TF-based training and inference. It is portable: because it doesn't use any dynamically loaded CUDA-based TensorFlow user ops, it works on Linux and Windows. It also supports multi-GPU training (the notebooks and results shown here were collected on a GTX 1080 Ti paired with a Titan X). The code also allows for mixed-precision training.
Finally, as shown in the "Links to pre-trained models" section, we achieve better results than the ones reported in the official paper on the challenging MPI-Sintel 'final' dataset.
Table of Contents
- Background
- Environment Setup
- Links to pre-trained models
- PWC-Net
- Datasets
- References
- Acknowledgments
Background
The purpose of optical flow estimation is to generate a dense 2D real-valued (u,v vector) map of the motion occurring from one video frame to the next. This information can be very useful when trying to solve computer vision problems such as object tracking, action recognition, video object segmentation, etc.
Figure [2017a] (a) below shows training pairs (black and white frames 0 and 1) from the Middlebury Optical Flow dataset as well as the their color-coded optical flow ground truth. Figure (b) indicates the color coding used for easy visualization of the (u,v) flow fields. Usually, vector orientation is represented by color hue while vector length is encoded by color saturation:

The most common measures used to evaluate the quality of optical flow estimation are angular error (AE) and endpoint error (EPE). The angular error between two optical flow vectors (u<sub>0</sub>, v<sub>0</sub>) and (u<sub>1</sub>, v<sub>1</sub>) is defined as arccos((u<sub>0</sub>, v<sub>0</sub>) . (u<sub>1</sub>, v<sub>1</sub>)). The endpoint error measures the distance between the endpoints of two optical flow vectors (u<sub>0</sub>, v<sub>0</sub>) and (u<sub>1</sub>, v<sub>1</sub>) and is defined as sqrt((u<sub>0</sub> - u<sub>1</sub>)<sup>2</sup> + (v<sub>0</sub> - v<sub>1</sub>)<sup>2</sup>).
Environment Setup <a name="environment-setup"></a>
The code in this repo was developed and tested using Anaconda3 v.5.2.0. To reproduce our conda environment, please refer to the following files:
On Ubuntu:
On Windows:
Links to pre-trained models <a name="links"></a>
Pre-trained models can be found here. They come in two flavors: "small" (sm, with 4,705,064 learned parameters) models don't use dense connections or residual connections, "large" (lg, with 14,079,050 learned parameters) models do. They are all built with a 6-level pyramid, upsampling level 2 by 4 in each dimension to generate the final prediction, and construct an 81-channel cost volume at each level from a search range (maximum displacement) of 4.
Please note that we trained these models using slightly different dataset and learning rate schedules. The official multistep schedule discussed in [2018a] is as follows: S<sub>long</sub> 1.2M iters training, batch size 8 + S<sub>fine</sub> 500k iters finetuning, batch size 4). Ours is S<sub>long</sub> only, 1.2M iters, batch size 8, on a mix of FlyingChairs and FlyingThings3DHalfRes. FlyingThings3DHalfRes is our own version of FlyingThings3D where every input image pair and groundtruth flow has been downsampled by two in each dimension. We also use a different set of augmentation techniques.
Model performance
| Model name | Notebooks | FlyingChairs (384x512) AEPE | Sintel clean (436x1024) AEPE | Sintel final (436x1024) AEPE |
| :---: | :---: | :---: | :---: | :---: |
| pwcnet-lg-6-2-multisteps-chairsthingsmix | train | 1.44 (notebook) | 2.60 (notebook) | 3.70 (notebook) |
| pwcnet-sm-6-2-multisteps-chairsthingsmix | train | 1.71 (notebook) | 2.96 (notebook) | 3.83 (notebook) |
As a reference, here are the official, reported results:

Model inference times
We also measured the following MPI-Sintel (436 x 1024) inference times on a few GPUs:
| Model name | Titan X | GTX 1080 | GTX 1080 Ti |
| :---: | :---: | :---: | :---: |
| pwcnet-lg-6-2-cyclic-chairsthingsmix | 90ms | 81ms | 68ms |
| pwcnet-sm-6-2-cyclic-chairsthingsmix | 68.5ms | 64.4ms | 53.8ms |
A few clarifications about the numbers above...
First, please note that this implementation is, by design, portable, i.e., it doesn't use any user-defined CUDA kernels whereas the official NVidia implementation does. Ours will work on any OS and any hardware configuration (even one without a GPU) that can run TensorFlow.
Second, the timing numbers we report are the inference times of the models trained on FlyingChairs and FlyingThings3DHalfRes. These are models that you can train longer if you want to, or finetune using an additional dataset, should you want to do so. In other words, these graphs haven't been frozen yet.
In a typical production environment, you would freeze the model after final training/finetuning and optimize the graph to whatever platform(s) you need to distribute them on using TensorFlow XLA or TensorRT. In that important context, the inference numbers we report on unoptimized graphs are rather meaningless.
PWC-Net <a name="pwc-net"></a>
Basic Idea <a name="pwc-net-basic-idea"></a>
Per [2018a], PWC Net improves on FlowNet2 [2016a] by adding domain knowledge into the design of the network. The basic idea behind optical flow estimation it that a pixel will retain most of its brightness over time despite a positional change from one frame to the next ("brightness" constancy). We can grab a small patch around a pixel in video frame 1 and find another small patch in video frame 2 that will maximize some function (e.g., normalized cross correlation) of the two patches. Sliding that patch over the entire frame 1, looking for a peak, generates what's called a cost volume (the C in PWC). This techniques is fairly robust (invariant to color change) but is expensive to compute. In some cases, you may need a fairly large patch to reduce the number of false positives in frame1, raising the complexity even more.
To alleviate the cost of generating the cost volume, the first optimization is to use pyramidal processing (the P in PWC). Using a lower resolution image lets you perform the search sliding a smaller patch from frame 1 over a smaller version of frame 2, yielding a smaller motion vector, then use that information as a hint to perform a more targeted search at the next level of resolution in the pyramid. That multiscale motion estimation can be performed in the image domain or in the feature domain (i.e., using the downscaled feature maps generated by a convnet). In practice, PWC warps (the W in PWC) frame 1 using an upsampled version of the motion flow estimated at a lower resolution because this will lead to searching for a smaller motion increment in the next higher resolution level of the pyramid (hence, allowing for a smaller search range). Here's a screenshot of a talk given by Deqing Sun that illustrates this process using a 2-level pyramid:

Note that none of the three optimizations used here (P/W/C) are unique to PWC-Net. These are techniques that were also used in SpyNet [2016b] and FlowNet2 [2016a]. However, here, they are used on the CNN features, rather than on an image pyramid:

The authors also acknowledge the fact that careful data augmentation (e.g., adding horizontal flipping) was necessary to reach best performance. To improve robustness, the authors also recommend training on multiple datasets (Sintel+KITTI+HD1K, for example) with careful class imbalance rebalancing.
Since this algorithm only works on two c
