Evaluating Weakly Supervised Object Localization Methods Right (CVPR 2020)

CVPR 2020 paper | TPAMI paper

Junsuk Choe1,3*, Seong Joon Oh2*, Seungho Lee1, Sanghyuk Chun3, Zeynep Akata4, Hyunjung Shim1
* Equal contribution

1 School of Integrated Technology, Yonsei University
2 Clova AI Research, LINE Plus Corp. 3 Clova AI Research, NAVER Corp. 4 University of Tübingen

Weakly-supervised object localization (WSOL) has gained popularity over the last years for its promise to train localization models with only image-level labels. Since the seminal WSOL work of class activation mapping (CAM), the field has focused on how to expand the attention regions to cover objects more broadly and localize them better. However, these strategies rely on full localization supervision to validate hyperparameters and for model selection, which is in principle prohibited under the WSOL setup. In this paper, we argue that WSOL task is ill-posed with only image-level labels, and propose a new evaluation protocol where full supervision is limited to only a small held-out set not overlapping with the test set. We observe that, under our protocol, the five most recent WSOL methods have not made a major improvement over the CAM baseline. Moreover, we report that existing WSOL methods have not reached the few-shot learning baseline, where the full-supervision at validation time is used for model training instead. Based on our findings, we discuss some future directions for WSOL.

Overview of WSOL performances 2016-2019. Above image shows that recent improvements in WSOL are illusory due to (1) different amount of implicit full supervision through validation and (2) a fixed score-map threshold to generate object boxes. Under our evaluation protocol with the same validation set sizes and oracle threshold for each method, CAM is still the best. In fact, our few-shot learning baseline, i.e., using the validation supervision (10 samples/class) at training time, outperforms existing WSOL methods.

Updates

9 Jul, 2020: Journal submission available.
27 Mar, 2020: New WSOL evaluations with MaxBoxAccV2 are added.
28 Feb, 2020: New box evaluation (MaxBoxAccV2) is available.
22 Jan, 2020: Initial upload.

1. Our dataset contribution
- The dataset splits
2. Dataset downloading and license
3. Code dependencies
4. WSOL evaluation
5. Library of WSOL methods
6. WSOL training and evaluation
7. Code license
8. How to cite

1. Our dataset contribution

WSOL is an ill-posed problem when only image-level labels are available (see paper for an argument). To be able to solve the WSOL task, certain amount of full supervision is inevitable, and prior WSOL approaches have utilized different amount of implicit and explicit full supervision (usually through validation). We propose to use a fixed amount of full supervision per method by carefully designing validation splits (called train-fullsup in the paper), such that different methods use the same amount of localization-labelled validation split.

In this section, we explain how each dataset is split, and introduce our data contributions (image collections and new annotations) on the way.

The dataset splits

split | ImageNet | CUB | OpenImages ----------------|------------------------|----------------------------|--------------------------- train-weaksup | ImageNet "train" | CUB-200-2011 "train" | OpenImages30k "train" train-fullsup | ImageNetV2 | CUBV2 | OpenImages30k "val" test | ImageNet "val" | CUB-200-2011 "test" | OpenImages30k "test"

We propose three disjoint splits for every dataset: train-weaksup, train-fullsup, and test. The train-weaksup contains images with weak supervision (the image-level labels). The train-fullsup contains images with full supervision (either bounding box or binary mask). It is left as freedom for the user to utilize it for hyperparameter search, model selection, ablative studies, or even model fitting. The test split contains images with full supervision; it must be used only for the final performance report. For example, checking the test results multiple times with different model configurations violates the protocol as the learner implicitly uses more full supervision than allowed. The splits and their roles are more extensively explained in the paper.

ImageNet
- "train" and "val" splits of original ImageNet are treated as our train-weaksup and test.
- ImageNetV2 is treated as our train-fullsup. Note that we have annotated bounding boxes on ImageNetV2.
CUB
- "train" and "test" splits of original CUB-200-2011 are treated as our train-weaksup and test.
- We contribute images and annotations that are similar as the original CUB, namely CUBV2.
OpenImages
- We curate the existing OpenImagesV5 for the task of WSOL.
- We have randomly selected images from the original "train", "val", and "test" splits of the instance segmentation subset.

2. Dataset downloading and license

For original ImageNet and CUB datasets, please follow the common procedure to download the datasets. In this section, we only explain how to obtain the less used (or never used before) datasets. We also provide the license status for each dataset. This section is for those who are interested in the full data for each dataset. If the aim is to utilize the data for WSOL evaluation and/or training, please follow the links below:

Evaluation only
Training + evaluation

ImageNetV2

Download images

We utilize 10,000 images in the Threshold0.7 split of ImageNetV2 for our train-fullsup split. We have annotated bounding boxes on those images. Box labels exist in here and are licensed by NAVERCorp. under Attribution 2.0 Generic (CC-BY-2.0).

CUBV2

Download images

We have collected and annotated CUBV2 on our own as the train-fullsup split. We have ensured that the data distribution follows the original CUB dataset and there is no duplicate image. We have collected 5 images per class (1,000 images total) from Flickr. Box labels and license files of all images exist in here. Both class and box labels are licensed by NAVERCorp under Attribution 2.0 Generic (CC-BY-2.0).

OpenImages30k

Download images
Download segmentation masks

The WSOL community has relied on ImageNet and CUB datasets at least for the last three years. It is perhaps time for us to move on. We provide a WSOL benchmark based on the OpenImages30k dataset to provide a new perspective on the generalizability of WSOL methods in the past and future. To make it suitable for the WSOL task, we use 100 classes to ensure the minimum number of single-class samples for each class. We have randomly selected 29,819, 2,500, and 5,000 images from the original "train", "val", and "test" splits of OpenImagesV5. Corresponding metadata can be found in here. The annotations are licensed by Google LLC under Attribution 4.0 International (CC-BY-4.0). The images are listed as having a Attribution 2.0 Generic (CC-BY-2.0).

Dataset statistics

Below tables summarizes dataset statisti

Wsolevaluation

Install / Use

README