DSVT
[CVPR2023] Official Implementation of "DSVT: Dynamic Sparse Voxel Transformer with Rotated Sets"
Install / Use
/learn @Haiyang-W/DSVTREADME
DSVT: an efficient and deployment-friendly sparse backbone for large-scale point clouds
<!-- [](https://paperswithcode.com/sota/3d-object-detection-on-waymo-pedestrian?p=embracing-single-stride-3d-object-detector) [](https://paperswithcode.com/sota/3d-object-detection-on-waymo-cyclist?p=embracing-single-stride-3d-object-detector) [](https://paperswithcode.com/sota/3d-object-detection-on-waymo-vehicle?p=embracing-single-stride-3d-object-detector) -->This repo is the official implementation of CVPR paper: DSVT: Dynamic Sparse Voxel Transformer with Rotated Sets as well as the follow-ups. Our DSVT achieves state-of-the-art performance on large-scale Waymo Open Dataset with real-time inference speed (27Hz). We have made every effort to ensure that the codebase is clean, concise, easily readable, state-of-the-art, and relies only on minimal dependencies.
<div align="center"> <img src="assets/Figure2.png" width="500"/> </div>DSVT: Dynamic Sparse Voxel Transformer with Rotated Sets
Haiyang Wang*, Chen Shi*, Shaoshuai Shi $^\dagger$, Meng Lei, Sen Wang, Di He, Bernt Schiele, Liwei Wang $^\dagger$
- Primary contact: Haiyang Wang ( wanghaiyang6@stu.pku.edu.cn )
News
- [24-08-12] 🔥 GiT was accepted by ECCV2024 with Oral presentation. Hope you enjoy the success of plain transformer family.
- [24-07-01] 🔥 Our GiT was accepted by ECCV2024. If you find it helpful, please give it a star. 🤗
- [24-03-15] 🔥 GiT, the first successful general vision model only using a ViT is released. Corresponding to Potential Research, we attempted to address problems with the general model on the vision side. Combining DSVT, UniTR and GiT to construct an LLM-like unified model suitable for autonomous driving scenarios is an intriguing direction.
- [24-01-20] DSVT (Waymo) has been merged into the mmdetection3d, thanks to the community's implementation.
- [23-09-30] A few brief thoughts on potential research for the development of 3D perception. I would be delighted if they prove helpful for your research. Please refer here.
- [23-09-25] Code of UniTR has been released in here.
- [23-09-25] 🚀 UniTR has been accepted at ICCV2023, which is built upon DSVT. It's the first unified multi-modal transformer backbone for 3D perception. Hope it can be the
prerequisite for 3D Vision Foundation Model. - [23-08-22] Thank jingyue202205 for his diligent efforts. DSVT is implemented using the TensorRT in an end-to-end manner, referring to here.
- [23-08-15] Bug Alert: We use incorrect position embeddings in DSVTBlock (see issue#50).
- [23-07-09] Bugfixed: The bug of wrong dynamic shape used in trtexec has been fixed (see issue#43 and deploy guidance). Before: Pytorch(
36.0ms) -> TRT-fp16(32.9ms), After: Pytorch(36.0ms) -> TRT-fp16(13.8ms) - [23-06-30] 🔥 DSVT (Waymo) has been merged to OpenPCDet.
- [23-06-23] 🔥 Code of Deployment is released.
- [23-06-03] Code of NuScenes is released (SOTA).
- [23-03-30] Code of Waymo is released (SOTA).
- [23-02-28] 🔥 DSVT is accepted at CVPR 2023.
- [23-01-15] DSVT is released on arXiv.
Overview
- Todo
- Introduction
- Main Results
- Installation
- Quick Start
- TensorRT Deployment
- Possible Issues
- Citation
- Potential Research
TODO
- [x] Release the arXiv version.
- [x] SOTA performance of 3D object detection (Waymo & Nuscenes) and BEV Map Segmentation (Nuscenes).
- [x] Clean up and release the code of Waymo.
- [x] Release code of NuScenes.
- [x] Release code of Deployment.
- [x] Merge DSVT to OpenPCDet.
- [ ] Release the Waymo Multi-Frames Configs.
Introduction
Dynamic Sparse Voxel Transformer is an efficient yet deployment-friendly 3D transformer backbone for outdoor 3D object detection. It partitions a series of local regions in each window according to its sparsity and then computes the features of all regions in a fully parallel manner. Moreover, to allow the cross-set connection, it designs a rotated set partitioning strategy that alternates between two partitioning configurations in consecutive self-attention layers.
DSVT achieves state-of-the-art performance on large-scale Waymo one-sweeps 3D object detection (78.2 mAPH L1 and 72.1 mAPH L2 on one-stage setting) and (78.9 mAPH L1 and 72.8 mAPH L2 on two-stage setting), surpassing previous models by a large margin. Moreover, as for multiple sweeps setting ( 2, 3, 4 sweeps settings), our model reaches 74.6 mAPH L2, 75.0 mAPH L2 and 75.6 mAPH L2 in terms of one-stage framework and 75.1 mAPH L2, 75.5 mAPH L2 and 76.2 mAPH L2 on two-stage framework, which outperforms the previous best multi-frame methods with a large margin. Note that our model is not specifically designed for multi-frame detection, and only takes concatenated point clouds as input.

Main results
We provide the pillar and voxel 3D version of one-stage DSVT. The two-stage versions with CT3D are also listed below.
3D Object Detection (on Waymo validation)
We run training for 3 times and report average metrics across all results. Regrettably, we are unable to provide the pre-trained model weights due to Waymo Dataset License Agreement. However, we can provide the training logs.
One-Sweeps Setting
| Model | #Sweeps | mAP/H_L1 | mAP/H_L2 | Veh_L1 | Veh_L2 | Ped_L1 | Ped_L2 | Cyc_L1 | Cyc_L2 | Log | |---------|---------|--------|--------|--------|--------|--------|--------|--------|--------|--------| | DSVT(Pillar) | 1 | 79.5/77.1 | 73.2/71.0 | 79.3/78.8 | 70.9/70.5 | 82.8/77.0 | 75.2/69.8 | 76.4/75.4 | 73.6/72.7 | Log | | DSVT(Voxel) | 1 | 80.3/78.2 | 74.0/72.1 | 79.7/79.3 | 71.4/71.0 | 83.7/78.9 | 76.1/71.5 | 77.5/76.5 | 74.6/73.7 | Log | | DSVT(Pillar-TS) | 1 | 80.6/78.2 | 74.3/72.1 | 80.2/79.7 | 72.0/71.6 | 83.7/78.0 | 76.1/70.7 | 77.8/76.8 | 74.9/73.9 | Log | | D
