Cut and Learn for Unsupervised Image & Video Object Detection and Instance Segmentation

Cut-and-LEaRn (CutLER) is a simple approach for training object detection and instance segmentation models without human annotations. It outperforms previous SOTA by 2.7 times for AP50 and 2.6 times for AR on 11 benchmarks.

Cut and Learn for Unsupervised Object Detection and Instance Segmentation
Xudong Wang, Rohit Girdhar, Stella X. Yu, Ishan Misra
FAIR, Meta AI; UC Berkeley
CVPR 2023

[project page] [arxiv] [colab] [bibtex]

Unsupervised video instance segmentation (VideoCutLER) is also supported. We demonstrate that video instance segmentation models can be learned without using any human annotations, without relying on natural videos (ImageNet data alone is sufficient), and even without motion estimations! The code is available here.

VideoCutLER: Surprisingly Simple Unsupervised Video Instance Segmentation
Xudong Wang, Ishan Misra, Ziyun Zeng, Rohit Girdhar, Trevor Darrell
UC Berkeley; FAIR, Meta AI
CVPR 2024

[code] [PDF] [arxiv] [bibtex]

Features

We propose MaskCut approach to generate pseudo-masks for multiple objects in an image.
CutLER can learn unsupervised object detectors and instance segmentors solely on ImageNet-1K.
CutLER exhibits strong robustness to domain shifts when evaluated on 11 different benchmarks across domains like natural images, video frames, paintings, sketches, etc.
CutLER can serve as a pretrained model for fully/semi-supervised detection and segmentation tasks.
We also propose VideoCutLER, a surprisingly simple unsupervised video instance segmentation (UVIS) method without relying on optical flows. ImaegNet-1K is all we need for training a SOTA UVIS model!

Installation

See installation instructions.

Dataset Preparation

See Preparing Datasets for CutLER.

Method Overview

<img src="docs/pipeline.jpg" width=55%> Cut-and-Learn has two stages: 1) generating pseudo-masks with MaskCut and 2) learning unsupervised detectors from pseudo-masks of unlabeled data.

1. MaskCut

MaskCut can be used to provide segmentation masks for multiple instances of each image.

MaskCut Demo

Try out the MaskCut demo using Colab (no GPU needed):

Try out the web demo: (thanks to @hysts!)

If you want to run MaskCut locally, we provide demo.py that is able to visualize the pseudo-masks produced by MaskCut. Run it with:

cd maskcut
python demo.py --img-path imgs/demo2.jpg \
  --N 3 --tau 0.15 --vit-arch base --patch-size 8 \
  [--other-options]

We give a few demo images in maskcut/imgs/. If you want to run demo.py with cpu, simply add "--cpu" when running the demo script. For imgs/demo4.jpg, you need to use "--N 6" to segment all six instances in the image. Following, we give some visualizations of the pseudo-masks on the demo images.

Generating Annotations for ImageNet-1K with MaskCut

To generate pseudo-masks for ImageNet-1K using MaskCut, first set up the ImageNet-1K dataset according to the instructions in datasets/README.md, then execute the following command:

cd maskcut
python maskcut.py \
--vit-arch base --patch-size 8 \
--tau 0.15 --fixed_size 480 --N 3 \
--num-folder-per-job 1000 --job-index 0 \
--dataset-path /path/to/dataset/traindir \
--out-dir /path/to/save/annotations \

As the process of generating pseudo-masks for all 1.3 million images in 1,000 folders takes a significant amount of time, it is recommended to use multiple runs. Each run should process the pseudo-mask generation for a smaller number of image folders by setting "--num-folder-per-job" and "--job-index". Once all runs are completed, you can merge all the resulting json files by using the following command:

python merge_jsons.py \
--base-dir /path/to/save/annotations \
--num-folder-per-job 2 --fixed-size 480 \
--tau 0.15 --N 3 \
--save-path imagenet_train_fixsize480_tau0.15_N3.json

The "--num-folder-per-job", "--fixed-size", "--tau" and "--N" of merge_jsons.py should match the ones used to run maskcut.py.

We also provide a submitit script to launch the pseudo-mask generation process with multiple nodes.

cd maskcut
bash run_maskcut_with_submitit.sh

After that, you can use "merge_jsons.py" to merge all these json files as described above.

2. CutLER

Inference Demo for CutLER with Pre-trained Models

Try out the CutLER demo using Colab (no GPU needed):

Try out the web demo: (thanks to @hysts!)

Try out Replicate demo and the API:

If you want to run CutLER demos locally,

Pick a model and its config file from model zoo, for example, model_zoo/configs/CutLER-ImageNet/cascade_mask_rcnn_R_50_FPN.yaml.
We provide demo.py that is able to demo builtin configs. Run it with:

cd cutler
python demo/demo.py --config-file model_zoo/configs/CutLER-ImageNet/cascade_mask_rcnn_R_50_FPN_demo.yaml \
  --input demo/imgs/*.jpg \
  [--other-options]
  --opts MODEL.WEIGHTS /path/to/cutler_w_cascade_checkpoint

The configs are made for training, therefore we need to specify MODEL.WEIGHTS to a model from model zoo for evaluation. This command will run the inference and show visualizations in an OpenCV window.

To run on cpu, add MODEL.DEVICE cpu after --opts.
To save outputs to a directory (for images) or a file (for webcam or video), use --output.

Following, we give some visualizations of the model predictions on the demo images.

Unsupervised Model Learning

Before training the detector, it is necessary to use MaskCut to generate pseudo-masks for all ImageNet data. You can either use the pre-generated json file directly by downloading it from here and placing it under "DETECTRON2_DATASETS/imagenet/annotations/", or generate your own pseudo-masks by following the instructions in MaskCut.

We provide a script train_net.py, that is made to train all the configs provided in CutLER. To train a model with "train_net.py", first setup the ImageNet-1K dataset following datasets/README.md, then run:

cd cutler
export DETECTRON2_DATASETS=/path/to/DETECTRON2_DATASETS/
python train_net.py --num-gpus 8 \
  --config-file model_zoo/configs/CutLER-ImageNet/cascade_mask_rcnn_R_50_FPN.yaml

If you want to train a model using multiple nodes, you may need to adjust some model parameters and some SBATCH command options in "tools/train-1node.sh" and "tools/single-node_run.sh", then run:

cd cutler
sbatch tools/train-1node.sh \
  --config-file model_zoo/configs/CutLER-ImageNet/cascade_mask_rcnn_R_50_FPN.yaml \
  MODEL.WEIGHTS /path/to/dino/d2format/model \
  OUTPUT_DIR output/

You can also convert a pre-trained DINO model to detectron2's format by yourself following this link.

Self-training

We further improve performance by self-training the model on its predictions.

Firstly, we can get model predictions on ImageNet via running:

python train_net.py --num-gpus 8 \
  --config-file model_zoo/configs/CutLER-ImageNet/cascade_mask_rcnn_R_50_FPN.yaml \
  --test-dataset imagenet_train \
  --eval-only TEST.DETECTIONS_PER_IMAGE 30 \
  MODEL.WEIGHTS output/model_final.pth \ # load previous stage/round checkpoints
  OUTPUT_DIR output/ # path to save model predictions

Secondly, we can run the following command to generate the json file for the first round of self-training:

python tools/get_self_training_ann.py \
  --new-pred output/inference/coco_instances_results.json \ #

CutLER

Install / Use

README