Scan2Cap
[CVPR 2021] Scan2Cap: Context-aware Dense Captioning in RGB-D Scans
Install / Use
/learn @daveredrum/Scan2CapREADME
Scan2Cap: Context-aware Dense Captioning in RGB-D Scans
<p align="center"><img src="demo/Scan2Cap.gif" width="600px"/></p>Introduction
We introduce the task of dense captioning in 3D scans from commodity RGB-D sensors. As input, we assume a point cloud of a 3D scene; the expected output is the bounding boxes along with the descriptions for the underlying objects. To address the 3D object detection and description problems, we propose Scan2Cap, an end-to-end trained method, to detect objects in the input scene and describe them in natural language. We use an attention mechanism that generates descriptive tokens while referring to the related components in the local context. To reflect object relations (i.e. relative spatial relations) in the generated captions, we use a message passing graph module to facilitate learning object relation features. Our method can effectively localize and describe 3D objects in scenes from the ScanRefer dataset, outperforming 2D baseline methods by a significant margin (27.61% CiDEr<!-- -->@<!-- -->0.5IoU improvement).
Please also check out the project website here.
For additional detail, please see the Scan2Cap paper:
"Scan2Cap: Context-aware Dense Captioning in RGB-D Scans"
by Dave Zhenyu Chen, Ali Gholami, Matthias Nießner and Angel X. Chang
from Technical University of Munich and Simon Fraser University.
News
- [08/22/2022] We launched the Scan2Cap Dense Captioning Benchmark. Come check it out!
- [08/22/2022] We released a new implementation of Scan2Cap with 1)8x faster training time; 2) revised evaluation metrics; 3) benchmark toolbox. Please see more details in the faster-captioning repo.
:star2: Benchmark Challenge :star2:
We provide the Scan2Cap Benchmark Challenge for benchmarking your model automatically on the hidden test set! Learn more at our benchmark challenge website.
After finishing training the model, please download the benchmark data and put the unzipped ScanRefer_filtered_test.json under data/. Then, you can run the following script the generate predictions:
python benchmark/predict.py --folder <output_folder> --use_multiview --use_normal --use_topdown --use_relation --num_graph_steps 2 --num_locals 10
Note that the flags must match the ones set before training. The training information is stored in outputs/<folder_name>/info.json. The generated predictions are stored in outputs/<folder_name>/pred.json.
For submitting the predictions, please compress the pred.json as a .zip or .7z file and follow the instructions to upload your results.
Local Benchmarking on Val Set
Before submitting the results on the test set to the official benchmark, you can also benchmark the performance on the val set. Run the following script to generate GTs for val set first:
python scripts/build_benchmark_gt.py --split val
NOTE: don't forget to change the
DATA_ROOTinscripts/build_benchmark_gt.py
Generate the predictions on val set:
python benchmark/predict.py --folder <output_folder> --use_multiview --use_normal --use_topdown --use_relation --num_graph_steps 2 --num_locals 10 --test_split val
Evaluate the predictions on the val set:
python benchmark/eval.py --split val --path <path to predictions> --verbose
(Optional) Compile accelerated generalized IoU for faster evaluation:
python cython_compile.py build_ext --inplace
Data
ScanRefer
If you would like to access to the ScanRefer dataset, please fill out this form. Once your request is accepted, you will receive an email with the download link.
Note: In addition to language annotations in ScanRefer dataset, you also need to access the original ScanNet dataset. Please refer to the ScanNet Instructions for more details.
Download the dataset by simply executing the wget command:
wget <download_link>
Scan2CAD
As learning the relative object orientations in the relational graph requires CAD model alignment annotations in Scan2CAD, please refer to the Scan2CAD official release (you need ~8MB on your disk). Once the data is downloaded, extract the zip file under data/ and change the path to Scan2CAD annotations (CONF.PATH.SCAN2CAD) in lib/config.py . As Scan2CAD doesn't cover all instances in ScanRefer, please download the mapping file and place it under CONF.PATH.SCAN2CAD. Parsing the raw Scan2CAD annotations by the following command:
python scripts/Scan2CAD_to_ScanNet.py
Setup
Please execute the following command to install PyTorch 1.8:
conda install pytorch==1.8.0 torchvision==0.9.0 torchaudio==0.8.0 cudatoolkit=10.2 -c pytorch
Install the necessary packages listed out in requirements.txt:
pip install -r requirements.txt
And don't forget to refer to Pytorch Geometric to install the graph support.
After all packages are properly installed, please run the following commands to compile the CUDA modules for the PointNet++ backbone:
cd lib/pointnet2
python setup.py install
Before moving on to the next step, please don't forget to set the project root path to the CONF.PATH.BASE in lib/config.py.
Data preparation
- Download the ScanRefer dataset and unzip it under
data/- You might want to runpython scripts/organize_scanrefer.pyto organize the data a bit. - Download the preprocessed GLoVE embeddings (~990MB) and put them under
data/. - Download the ScanNetV2 dataset and put (or link)
scans/under (or to)data/scannet/scans/(Please follow the ScanNet Instructions for downloading the ScanNet dataset).
After this step, there should be folders containing the ScanNet scene data under the
data/scannet/scans/with names likescene0000_00
- Pre-process ScanNet data. A folder named
scannet_data/will be generated underdata/scannet/after running the following command. Roughly 3.8GB free space is needed for this step:
cd data/scannet/
python batch_load_scannet_data.py
<!-- 5. (Optional) Download the preprocessed [multiview features (~36GB)](http://kaldir.vc.in.tum.de/enet_feats.hdf5) and put it under `data/scannet/scannet_data/`. -->After this step, you can check if the processed scene data is valid by running:
python visualize.py --scene_id scene0000_00
-
(Optional) Pre-process the multiview features from ENet.
a. Download the ENet pretrained weights (1.4MB) and put it under
data/b. Download and decompress the extracted ScanNet frames (~13GB).
c. Change the data paths in
config.pymarked with TODO accordingly.d. Extract the ENet features:
python scripts/compute_multiview_features.pye. Project ENet features from ScanNet frames to point clouds; you need ~36GB to store the generated HDF5 database:
python scripts/project_multiview_features.py --maxpoolYou can check if the projections make sense by projecting the semantic labels from image to the target point cloud by:
python scripts/project_multiview_labels.py --scene_id scene0000_00 --maxpool
Usage
End-to-End training for 3D dense captioning
Run the following script to start the end-to-end training of Scan2Cap model using the multiview features and normals. For more training options, please run scripts/train.py -h:
python scripts/train.py --use_multiview --use_normal --use_topdown --use_relation --use_orientation --num_graph_steps 2 --num_locals 10 --batch_size 12 --epoch 50
The trained model as well as the intermediate results will be dumped into outputs/<output_folder>. For evaluating the model (@0.5IoU), please run the following script and change the <output_folder> accordingly, and note that arguments must match the ones for training:
python scripts/eval.py --folder <output_folder> --use_multiview --use_normal --use_topdown --use_relation --num_graph_steps 2 --num_locals 10 --eval_caption --min_iou 0.5
Evaluating the detection performance:
python scripts/eval.py --folder <output_folder> --use_multiview --use_normal --use_topdown --use_relation --num_graph_steps 2 --num_locals 10 --eval_detection
You can even evaluate the pretraiend object detection backbone:
python scripts/eval.py --folder <output_folder> --use_multiview --use_normal --use_topdown --use_relation --num_graph_steps 2 --num_locals 10 --eval_detection --eval_pretrained
If you want to visualize the results, please run this script to generate bounding boxes and descriptions for scene <scene_id> to outputs/<output_folder>:
python scripts/visualize.py --folder <output_folder> --scene_id <scene_id> --use_multiview --use_normal --use_topdown --use_relation --num_graph_steps 2 --num_locals 10
Note that you need to run `python scripts/export_scannet_axis_aligned_me
