<p align="center"> <h1 align="center">Visual Programming for Zero-shot Open-Vocabulary 3D Visual Grounding</h1> <p align="center"> Zhihao Yuan, Jinke Ren, Chun-Mei Feng, Hengshuang Zhao, Shuguang Cui, Zhen Li </p> <h2 align="center">CVPR 2024</h2> <p align="center"> <a href='https://arxiv.org/abs/2311.15383'> <img src='https://img.shields.io/badge/Paper-PDF-red?style=flat&logo=arXiv&logoColor=red' alt='Paper PDF'> </a> <a href='https://curryyuan.github.io/ZSVG3D/' style='padding-left: 0.5rem;'> <img src='https://img.shields.io/badge/Project-Page-blue?style=flat&logo=Google%20chrome&logoColor=blue' alt='Project Page'> </a> </p> </p>  <p align="center"> <img src="docs/static/images/figure_1.png" alt="Logo" width="80%"> </p>

Comparative overview of two 3DVG approaches. (a) Supervised 3DVG involves input from 3D scans combined with text queries, guided by object-text pair annotations, (b) Zero-shot 3DVG identifies the location of target objects using programmatic representation generated by LLMs, i.e., target category, anchor category, and relation grounding, thereby highlighting its superiority in decoding spatial relations and object identifiers within a given space, e.g., the location of the keyboard (outlined in green) can be retrieved based on the distance between the keyboard and the door (outlined in blue).

Instructions

Environment

The code is tested on Ubuntu 18.04 with following package, but should also work with other versions.

python==3.9.12
pytorch==1.11.0
pytorch3d==0.7.2
langchain==0.2.1

Zero-shot evaluation

Evaluation on Nr3d

Download our preprocessed 3D features from here and place them under data/scannet folder.

Run the following command:

python visprog_nr3d.py --prog_path data/nr3d_val.json

Evaluation on ScanRefer

Dowanload our preprocessed Mask3D predictions from here.

Run the following command:

python visprog_scanrefer.py --prog_path data/scanrefer_val.json

Using BLIP2 for LOC module

By default, it only use 3D only LOC module. Change the loc argument of ProgramInterpreter to LOC_BLIP in visprog_nr3d.py and LOC_BLIP_pred in visprog_scanrefer.py to use the BLIP2 models for LOC module.

You need to download our preprocessed croped images from GT instance and Mask3D prediction and change the image_path to your downloaded path.

Visual Programming Generation

We provide the script for visual program generation. You need to modify the OpenAI key first.

python gen_visprog.py

Data Preparation

You can also process the features by yourself.

First, install the dependencies:

cd ./models/pointnext/PointNeXt
bash install.sh

Prepare ScanNet 2D data following OpenScene and 3D data following vil3dref.

Then, run the following scripts:

python preprocess/process_feat_3d.py
python preprocess/process_feat_2d.py

You can refer to preprocess/process_mask3d.ipynb for processing 3D instance segments.

ZSVG3D

Install / Use

README