<p align="center"> <h1 align="center"> OpenIns3D: Snap and Lookup for 3D Open-vocabulary Instance Segmentation</h1> <p align="center"> <a href="https://zheninghuang.github.io/"><strong>Zhening Huang</strong></a> · <a href="https://xywu.me"><strong>Xiaoyang Wu</strong></a> · <a href="https://xavierchen34.github.io/"><strong>Xi Chen</strong></a> · <a href="https://hszhao.github.io"><strong>Hengshuang Zhao</strong></a> · <a href="https://sites.google.com/site/indexlzhu/home"><strong>Lei Zhu</strong></a> · <a href="http://sigproc.eng.cam.ac.uk/Main/JL"><strong>Joan Lasenby</strong></a> </p> <h3 align="center"><a href="https://arxiv.org/abs/2309.00616">Paper</a> | <a href="https://www.youtube.com/watch?v=kwlMJkEfTyY">Video</a> | <a href="https://zheninghuang.github.io/OpenIns3D/">Project Page</a></h3> <div align="center"></div> </p>

<p align="center"> <strong> TL;DR: OpenIns3D proposes a "mask-snap-lookup" scheme to achieve 2D-input-free 3D open-world scene understanding, which attains SOTA performance across datasets, even with fewer input prerequisites. 🚀✨ </p> <table> <tr> <td><img src="assets/demo_1.gif" width="100%"/></td> <td><img src="assets/demo_2.gif" width="100%"/></td> <td><img src="assets/demo_3.gif" width="100%"/></td> </tr> <tr> <td align='center' width='24%'>device to watch BBC news</td> <td align='center' width='24%'>furniture that is capable of producing music</td> <td align='center' width='24%'>Ma Long's domain of excellence</td> <tr> <tr> <td><img src="assets/demo_4.gif" width="100%"/></td> <td><img src="assets/demo_5.gif" width="100%"/></td> <td><img src="assets/demo_6.gif" width="100%"/></td> </tr> <tr> <td align='center' width='24%'>most comfortable area to sit in the room</td> <td align='center' width='24%'>penciling down ideas during brainstorming</td> <td align='center' width='24%'>furniture offers recreational enjoyment with friends</td> <tr> </table> <br>

Highlights

2 Aug, 2024: Major update 🔥: We have released optimized and easy-to-use code for OpenIns3D to reproduce all the results in the paper and demo.
1 Jul, 2024: OpenIns3D has been accepted at ECCV 2024 🎉. We will release more code on various experiments soon.
6 Jan, 2024: We have released a major revision, incorporating S3DIS and ScanNet benchmark code. Try out the latest version.
31 Dec, 2023 We release the batch inference code on ScanNet.
31 Dec, 2023 We release the zero-shot inference code， test it on your own data!
Sep, 2023: OpenIns3D is released on arXiv, alongside with explanatory video, project page. We will release the code at end of this year.

Overview

Installation
Reproducing All Benchmarks Results
Replacing Snap with RGBD
Zero-Shot Inference with Single Vocabulary
Zero-Shot Inference with Multiple Vocabulary
Citation
Acknowledgement

Installation

Please check the installation file to install OpenIns3D for:

reproducing all results in the paper,
testing on your own dataset

Reproducing Results

🗂️ Replica

🔧 Data Preparation:

Execute the following command to set up the Replica dataset, including scene .ply files, predicted masks, and ground truth:

sh scripts/prepare_replica.sh
sh scripts/prepare_yoloworld.sh

📊 Open Vocabulary Instance Segmentation:

python openins3d/main.py --dataset replica --task OVIS --detector yoloworld

📈 Results Log: | Task | AP | AP50 | AP25 | Log | |-----------------------------|:----:|:----:|:----:|:----:| | Replica OVIS (in paper) | 13.6 | 18.0 | 19.7 | | | Replica OVIS (this Code) | 15.4 | 19.5 | 25.2 | log |

🗂️ ScanNet

🔧 Data Preparation:

Make sure you have completed the form on ScanNet to obtain access.
Place the download-scannet.py script into the scripts directory.
Run the following command to download all _vh_clean_2.ply files for validation sets, as well as instance ground truth, GT-masks, and detected masks:

sh scripts/prepare_scannet.sh

📊 Open Vocabulary Object Recognition:

python openins3d/main.py --dataset scannet --task OVOR --detector odise

📈 Results Log: | Task | Top-1 Accuracy | Log | |-----------------------------|:--------------:|:----:| | ScanNet_OVOR (in paper) | 60.4 | | | ScanNet_OVOR (this Code) | 64.2 | log |

📊 Open Vocabulary Object Detection:

python openins3d/main.py --dataset scannet --task OVOD --detector odise

📊 Open Vocabulary Instance Segmentation:

python openins3d/main.py --dataset scannet --task OVIS --detector odise

📈 Results Log: | Task | AP | AP50 | AP25 | Log | |-----------------------------|:----:|:----:|:----:|:----:| | ScanNet_OVOD (in paper) | 17.8 | 28.3 | 36.0 | | | ScanNet_OVOD (this Code) | 20.7 | 29.9 | 39.7 | log | | ScanNet_OVIS (in paper) | 19.9 | 28.7 | 38.9 | | | ScanNet_OVIS (this Code) | 23.3 | 34.6 | 42.6 | log |

🗂️ S3DIS

🔧 Data Preparation:

Make sure you have completed the form on S3DIS to obtain access.
Then, run the following command to acquire scene .ply files, predicted masks, and ground truth:

sh scripts/prepare_s3dis.sh

📊 Open Vocabulary Instance Segmentation:

python openins3d/main.py --dataset s3dis --task OVIS --detector odise

📈 Results Log: | Task | AP | AP50 | AP25 | Log | |-----------------------------|:----:|:----:|:----:|:----:| | S3DIS OVIS (in paper) | 21.1 | 28.3 | 29.5 | | | S3DIS OVIS (this Code) | 22.9 | 29.0 | 31.4 | log |

🗂️ STPLS3D

🔧 Data Preparation:

Make sure you have completed the form STPLS3D to gain access.
Then, run the following command to obtain scene .ply files, predicted masks, and ground truth:

sh scripts/prepare_stpls3d.sh

📊 Open Vocabulary Instance Segmentation:

python openins3d/main.py --dataset stpls3d --task OVIS --detector odise

📈 Results Log: | Task | AP | AP50 | AP25 | Log | |-----------------------------|:------:|:-----:|:-----:|:----:| | STPLS3D OVIS (in paper) | 11.4 | 14.2 | 17.2 | | | STPLS3D OVIS (this Code) | 15.3 | 17.3 | 17.4 | log |

Replacing Snap with RGBD

We also evaluate the performance of OpenIns3D when the Snap module is replaced with original RGBD images while keeping the other design intact.

🗂️ Replica

🔧 Data Preparation

Download the Replica dataset and RGBD images:

sh scripts/prepare_replica.sh
sh scripts/prepare_replica2d.sh
sh scripts/prepare_yoloworld.sh

📊 Open Vocabulary Instance Segmentation

python openins3d/main.py --dataset replica --task OVIS --detector yoloworld --use_2d true

📈 Results Log
| Task | AP | AP50 | AP25 | Log | |----------------|:----:|:----:|:----:|:----------------------------------------:| | OpenMask3D | 13.1 | 18.4 | 24.2 | | | Open3DIS | 18.5 | 24.5 | 28.2 | | | OpenIns3D | 21.1 | 26.2 | 30.6 | log |

Zero-Shot Inference with Single Vocabulary

We demonstrate how to perform single-vocabulary instance segmentation similar to the teaser image in the paper. The key new feature is the introduction of a CLIP ranking and filtering module to reduce false-positive results. (Works best with RGBD but is also fine with SNAP.)

Quick Start:

📥 Download the demo dataset by running:
```
sh scripts/prepare_demo_single.sh 
```
🚀 Run the model by executing:
```
python zero_sho
```

OpenIns3D

Install / Use

README

Highlights

Overview

Installation

Reproducing Results

🗂️ Replica

📈 Results Log: | Task | AP | AP50 | AP25 | Log | |-----------------------------|:----:|:----:|:----:|:----:| | Replica OVIS (in paper) | 13.6 | 18.0 | 19.7 | | | Replica OVIS (this Code) | 15.4 | 19.5 | 25.2 | log |

🗂️ ScanNet

🗂️ S3DIS

🗂️ STPLS3D

Replacing Snap with RGBD

🗂️ Replica

Zero-Shot Inference with Single Vocabulary