ViewRefer3D

(ICCV2023) Official implementation of 'ViewRefer: Grasp the Multi-view Knowledge for 3D Visual Grounding with GPT and Prototype Guidance'

Generate Convert Improve

Install / Use

/learn @Ivan-Tang-3D/ViewRefer3D

About this skill

Quality Score

0/100

README

ViewRefer: Grasp the Multi-view Knowledge for 3D Visual Grounding with GPT and Prototype Guidance

Official implementation of 'ViewRefer: Grasp the Multi-view Knowledge for 3D Visual Grounding with GPT and Prototype Guidance'.

The paper has been accepted by ICCV 2023.

News

We release the GPT-expanded Sr3D dataset and the training code of ViewRefer 📌.

[2023.9] We release AAAI2024 'Point-PEFT', adapting 3D pre-trained Models with 1% parameters to downstream tasks .

[2024.4] We release 'Any2Point', adapting Any-Modality pre-trained Models with 1% parameters to 3D downstream tasks with SOTA performance.

Introduction

ViewRefer is a multi-view framework for 3D visual grounding, which grasps view knowledge to alleviate the challenging view discrepancy issue. For the text and 3D modalities, we respectively introduce LLM-expanded grounding texts and a fusion transformer for capturing multi-view information. We present multi-view prototypes to provide highlevel guidance to our framework, which contributes to superior 3D grounding performance.

Requirements

Please refer to referit3d for the installation and data preparation.

We adopt pre-trained BERT from huggingface. Please install related packages:

pip install transformers

Download the pre-trained BERT, and put them into a folder, noted as PATH_OF_BERT.

Download GPT-expanded Sr3D dataset, and put them into the folder './data'.

Getting Started

Training

To train on Sr3D dataset, run:

    SR3D_GPT='./referit3d_3dvg/data/Sr3D_release.csv'
    PATH_OF_SCANNET_FILE='./referit3d_3dvg/data/keep_all_points_with_global_scan_alignment.pkl'
    PATH_OF_REFERIT3D_FILE=${SR3D_GPT}
    PATH_OF_BERT='./referit3d_3dvg/data/bert'

    VIEW_NUM=4
    EPOCH=100
    DATA_NAME=SR3D
    EXT=ViewRefer
    DECODER=4
    NAME=${DATA_NAME}_${VIEW_NUM}view_${EPOCH}ep_${EXT}
    TRAIN_FILE=train_referit3d

    python -u ./referit3d_3dvg/scripts/${TRAIN_FILE}.py \
    -scannet-file ${PATH_OF_SCANNET_FILE} \
    -referit3D-file ${PATH_OF_REFERIT3D_FILE} \
    --bert-pretrain-path ${PATH_OF_BERT} \
    --log-dir logs/results/${NAME} \
    --model 'referIt3DNet_transformer' \
    --unit-sphere-norm True \
    --batch-size 24 \
    --n-workers 8 \
    --max-train-epochs ${EPOCH} \
    --encoder-layer-num 3 \
    --decoder-layer-num ${DECODER} \
    --decoder-nhead-num 8 \
    --view_number ${VIEW_NUM} \
    --rotate_number 4 \
    --label-lang-sup True

Refer to this link for the checkpoint and training log of ViewRefer on Sr3D dataset.

Test

To test on Sr3D dataset, run:

    SR3D_GPT='./referit3d_3dvg/data/Sr3D_release.csv'
    PATH_OF_SCANNET_FILE='./referit3d_3dvg/data/keep_all_points_with_global_scan_alignment.pkl'
    PATH_OF_REFERIT3D_FILE=${SR3D_GPT}
    PATH_OF_BERT='./referit3d_3dvg/data/bert'

    VIEW_NUM=4
    EPOCH=100
    DATA_NAME=SR3D
    EXT=ViewRefer_test
    DECODER=4
    NAME=${DATA_NAME}_${VIEW_NUM}view_${EPOCH}ep_${EXT}
    TRAIN_FILE=train_referit3d

    python -u ./referit3d_3dvg/scripts/${TRAIN_FILE}.py \
    --mode evaluate \
    -scannet-file ${PATH_OF_SCANNET_FILE} \
    -referit3D-file ${PATH_OF_REFERIT3D_FILE} \
    --bert-pretrain-path ${PATH_OF_BERT} \
    --log-dir logs/results/${NAME} \
    --resume-path "./checkpoints/best_model.pth"\
    --model 'referIt3DNet_transformer' \
    --unit-sphere-norm True \
    --batch-size 24 \
    --n-workers 8 \
    --max-train-epochs ${EPOCH} \
    --encoder-layer-num 3 \
    --decoder-layer-num ${DECODER} \
    --decoder-nhead-num 8 \
    --view_number ${VIEW_NUM} \
    --rotate_number 4 \
    --label-lang-sup True

Acknowledgement

This repo benefits from ReferIt3D and MVT-3DVG. Thanks for their wonderful works.

Citation

@article{guo2023viewrefer,
  title={ViewRefer: Grasp the Multi-view Knowledge for 3D Visual Grounding with GPT and Prototype Guidance},
  author={Guo, Ziyu and Tang, Yiwen and Zhang, Renrui and Wang, Dong and Wang, Zhigang and Zhao, Bin and Li, Xuelong},
  journal={arXiv preprint arXiv:2303.16894},
  year={2023}
}

Contact

If you have any questions about this project, please feel free to contact tangyiwen@pjlab.org.cn.

Related Skills

node-connect

347.0k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

107.8k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

347.0k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

qqbot-media

347.0k

QQBot 富媒体收发能力。使用 <qqmedia> 标签，系统根据文件扩展名自动识别类型（图片/语音/视频/文件）。