XMask3D

[NeurIPS 2024] XMask3D: Cross-modal Mask Reasoning for Open Vocabulary 3D Semantic Segmentation

Generate Convert Improve

Install / Use

/learn @wangzy22/XMask3D

About this skill

Quality Score

0/100

README

XMask3D: Cross-modal Mask Reasoning for Open Vocabulary 3D Semantic Segmentation

Created by Ziyi Wang*, Yanbo Wang*, Xumin Yu, Jie Zhou, Jiwen Lu.

This repository is a pyTorch implementation of our NeurIPS 2024 paper XMask3D.

XMask3D is a framework for open vocabulary 3D semantic segmentation that improves fine-grained boundary delineation by aligning 3D features with a 2D-text embedding space at the mask level. Using a mask generator based on a pre-trained diffusion model, it enables precise textual control over dense pixel representations, enhancing the versatility of generated masks. By integrating 3D global features into a 2D denoising UNet, XMask3D adds 3D geometry awareness to mask generation. The resulting 2D masks align 3D representations with vision-language features, yielding competitive segmentation performance across benchmarks.

[arXiv] intro

Installation

Follow the installation.md to install all required packages so you can do the training & evaluation afterwards.

Data Preparation

For convenience, the download link for the processed dataset is provided here. You can download the dataset by executing the command below.

sh scripts/download_datasets.sh

Pre-trained Model Preparation

For this project, you will need the pre-trained CLIP model and the Stable Diffusion model. Due to the instability of official network links, we provide alternative download options below:

# CLIP ViT-Large Patch14
cd /path/to/your/workspace
wget -O openai.tar.gz https://cloud.tsinghua.edu.cn/f/3890f1df1c5248a7a6e8/?dl=1
tar -xzvf openai.tar.gz
# Stable Diffusion v1.3 Checkpoint
wget -O sd_model.tar.gz https://cloud.tsinghua.edu.cn/f/8dce9b137f574e6eb57c/?dl=1
tar -xzvf sd_model.tar.gz

Usage

Training

sh run/train.sh --exp_dir=<EXPERIMENT_DIRECTORY> --config=<CONFIG_FILE>

For example, to train on the ScanNet B15N4 benchmark, run:

sh run/train.sh --exp_dir=out/exp_b15n4 --config=config/scannet/xmask3d_scannet_B15N4.yaml

Resume

sh run/resume.sh --exp_dir=<EXPERIMENT_DIRECTORY> --config=<CONFIG_FILE>

For example, to resume the last ckpt on the ScanNet B15N4 benchmark, run:

sh run/resume.sh --exp_dir=out/exp_b15n4 --config=config/scannet/xmask3d_scannet_B15N4.yaml

Inference

sh run/infer.sh --exp_dir=<EXPERIMENT_DIRECTORY> --config=<CONFIG_FILE> --ckpt_name=<CKPT_NAME>

For example, to run inference using the checkpoint b15n4.pth.tar on the ScanNet B15N4 benchmark, execute the following command:

sh run/infer.sh --exp_dir=out/exp_b15n4 --config=config/scannet/xmask3d_scannet_B15N4.yaml --ckpt_name=b15n4.pth.tar

Checkpoint

| Benchmark | hIoU / mIoU<sub>b</sub> / mIoU<sub>n</sub> | Download Link | |-----------------------|-----------------------------------------------|--------------------------| | Scannet B15N4 | 70.0 / 69.8 / 70.2 | [Tsinghua Cloud] [Google] | | Scannet B12N7 | 61.7 / 70.2 / 55.1 | [Tsinghua Cloud] [Google] | | Scannet B10N9 | 55.7 / 76.5 / 43.8 | [Tsinghua Cloud] [Google] | | Scannet B170N30 | 18.0 / 27.8 / 13.3 | [Tsinghua Cloud] [Google] | | Scannet B150N50 | 15.5 / 24.4 / 11.4 | [Tsinghua Cloud] [Google] |

Citation

If you find our work useful in your research, please consider citing:

@article{wang2024xmask3d,
  title={XMask3D: Cross-modal Mask Reasoning for Open Vocabulary 3D Semantic Segmentation},
  author={Wang, Ziyi and Wang, Yanbo and Yu, Xumin and Zhou, Jie and Lu, Jiwen},
  journal={arXiv preprint arXiv:2411.13243},
  year={2024}
}

Related Skills

node-connect

349.2k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

109.5k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

349.2k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

qqbot-media

349.2k

QQBot 富媒体收发能力。使用 <qqmedia> 标签，系统根据文件扩展名自动识别类型（图片/语音/视频/文件）。