DSVT: an efficient and deployment-friendly sparse backbone for large-scale point clouds

This repo is the official implementation of CVPR paper: DSVT: Dynamic Sparse Voxel Transformer with Rotated Sets as well as the follow-ups. Our DSVT achieves state-of-the-art performance on large-scale Waymo Open Dataset with real-time inference speed (27Hz). We have made every effort to ensure that the codebase is clean, concise, easily readable, state-of-the-art, and relies only on minimal dependencies.

DSVT: Dynamic Sparse Voxel Transformer with Rotated Sets

Haiyang Wang*, Chen Shi*, Shaoshuai Shi $^\dagger$, Meng Lei, Sen Wang, Di He, Bernt Schiele, Liwei Wang $^\dagger$

Primary contact: Haiyang Wang ( wanghaiyang6@stu.pku.edu.cn )

News

[24-08-12] 🔥 GiT was accepted by ECCV2024 with Oral presentation. Hope you enjoy the success of plain transformer family.
[24-07-01] 🔥 Our GiT was accepted by ECCV2024. If you find it helpful, please give it a star. 🤗
[24-03-15] 🔥 GiT, the first successful general vision model only using a ViT is released. Corresponding to Potential Research, we attempted to address problems with the general model on the vision side. Combining DSVT, UniTR and GiT to construct an LLM-like unified model suitable for autonomous driving scenarios is an intriguing direction.
[24-01-20] DSVT (Waymo) has been merged into the mmdetection3d, thanks to the community's implementation.
[23-09-30] A few brief thoughts on potential research for the development of 3D perception. I would be delighted if they prove helpful for your research. Please refer here.
[23-09-25] Code of UniTR has been released in here.
[23-09-25] 🚀 UniTR has been accepted at ICCV2023, which is built upon DSVT. It's the first unified multi-modal transformer backbone for 3D perception. Hope it can be the prerequisite for 3D Vision Foundation Model.
[23-08-22] Thank jingyue202205 for his diligent efforts. DSVT is implemented using the TensorRT in an end-to-end manner, referring to here.
[23-08-15] Bug Alert: We use incorrect position embeddings in DSVTBlock (see issue#50).
[23-07-09] Bugfixed: The bug of wrong dynamic shape used in trtexec has been fixed (see issue#43 and deploy guidance). Before: Pytorch(36.0ms) -> TRT-fp16(32.9ms), After: Pytorch(36.0ms) -> TRT-fp16(13.8ms)
[23-06-30] 🔥 DSVT (Waymo) has been merged to OpenPCDet.
[23-06-23] 🔥 Code of Deployment is released.
[23-06-03] Code of NuScenes is released (SOTA).
[23-03-30] Code of Waymo is released (SOTA).
[23-02-28] 🔥 DSVT is accepted at CVPR 2023.
[23-01-15] DSVT is released on arXiv.

Overview

TODO

[x] Release the arXiv version.
[x] SOTA performance of 3D object detection (Waymo & Nuscenes) and BEV Map Segmentation (Nuscenes).
[x] Clean up and release the code of Waymo.
[x] Release code of NuScenes.
[x] Release code of Deployment.
[x] Merge DSVT to OpenPCDet.
[ ] Release the Waymo Multi-Frames Configs.

Introduction

Dynamic Sparse Voxel Transformer is an efficient yet deployment-friendly 3D transformer backbone for outdoor 3D object detection. It partitions a series of local regions in each window according to its sparsity and then computes the features of all regions in a fully parallel manner. Moreover, to allow the cross-set connection, it designs a rotated set partitioning strategy that alternates between two partitioning configurations in consecutive self-attention layers.

DSVT achieves state-of-the-art performance on large-scale Waymo one-sweeps 3D object detection (78.2 mAPH L1 and 72.1 mAPH L2 on one-stage setting) and (78.9 mAPH L1 and 72.8 mAPH L2 on two-stage setting), surpassing previous models by a large margin. Moreover, as for multiple sweeps setting ( 2, 3, 4 sweeps settings), our model reaches 74.6 mAPH L2, 75.0 mAPH L2 and 75.6 mAPH L2 in terms of one-stage framework and 75.1 mAPH L2, 75.5 mAPH L2 and 76.2 mAPH L2 on two-stage framework, which outperforms the previous best multi-frame methods with a large margin. Note that our model is not specifically designed for multi-frame detection, and only takes concatenated point clouds as input.

Pipeline

Main results

We provide the pillar and voxel 3D version of one-stage DSVT. The two-stage versions with CT3D are also listed below.

3D Object Detection (on Waymo validation)

We run training for 3 times and report average metrics across all results. Regrettably, we are unable to provide the pre-trained model weights due to Waymo Dataset License Agreement. However, we can provide the training logs.

One-Sweeps Setting

| Model | #Sweeps | mAP/H_L1 | mAP/H_L2 | Veh_L1 | Veh_L2 | Ped_L1 | Ped_L2 | Cyc_L1 | Cyc_L2 | Log | |---------|---------|--------|--------|--------|--------|--------|--------|--------|--------|--------| | DSVT(Pillar) | 1 | 79.5/77.1 | 73.2/71.0 | 79.3/78.8 | 70.9/70.5 | 82.8/77.0 | 75.2/69.8 | 76.4/75.4 | 73.6/72.7 | Log | | DSVT(Voxel) | 1 | 80.3/78.2 | 74.0/72.1 | 79.7/79.3 | 71.4/71.0 | 83.7/78.9 | 76.1/71.5 | 77.5/76.5 | 74.6/73.7 | Log | | DSVT(Pillar-TS) | 1 | 80.6/78.2 | 74.3/72.1 | 80.2/79.7 | 72.0/71.6 | 83.7/78.0 | 76.1/70.7 | 77.8/76.8 | 74.9/73.9 | Log | | D

DSVT

Install / Use

README