OccSora

OccSora: 4D Occupancy Generation Models as World Simulators for Autonomous Driving

Generate Convert Improve

Install / Use

/learn @wzzheng/OccSora

About this skill

Quality Score

0/100

README

OccSora: 4D Occupancy Generation Models as World Simulators for Autonomous Driving

Paper | Project Page

OccSora: 4D Occupancy Generation Models as World Simulators for Autonomous Driving

Lening Wang*, Wenzhao Zheng* $\dagger$, Yilong Ren, Han Jiang, Zhiyong Cui, Haiyang Yu, Jiwen Lu

* Equal contribution $\dagger$ Project leader

With trajectory-aware 4D generation, OccSora has the potential to serve as a world simulator for the decision-making of autonomous driving.

News

[2024/05/31] Training, evaluation, and visualization code release.
[2024/05/31] Paper released on arXiv.

Demo

Trajectory-aware Video Generation:

demo

Scene Video Generation:

demo

Overview

overview

Different from most existing world models which adopt an autoregressive framework to perform next-token prediction, we propose a diffusion-based 4D occupancy generation model, OccSora, to model long-term temporal evolutions more efficiently. We employ a 4D scene tokenizer to obtain compact discrete spatial-temporal representations for 4D occupancy input and achieve high-quality reconstruction for long-sequence occupancy videos. We then learn a diffusion transformer on the spatial-temporal representations and generate 4D occupancy conditioned on a trajectory prompt. OccSora can generate 16s-videos with authentic 3D layout and temporal consistency, demonstrating its ability to understand the spatial and temporal distributions of driving scenes.

Getting Started

Installation

Create a conda environment with Python version 3.8.0
Install all the packages in environment.yaml
Please refer to mmdetection3d about the installation of mmdetection3d

Preparing

Create a soft link from data/nuscenes to your_nuscenes_path
Prepare the gts semantic occupancy introduced in [Occ3d]
Download the generated train/val pickle files and put them in data/

[nuscenes_infos_train_temporal_v3_scene.pkl]

[nuscenes_infos_val_temporal_v3_scene.pkl]

The dataset should be organized as follows:

OccSora/data
    nuscenes                 -    downloaded from www.nuscenes.org
        lidarseg
        maps
        samples
        sweeps
        v1.0-trainval
        gts                  -    download from Occ3d
    nuscenes_infos_train_temporal_v3_scene.pkl
    nuscenes_infos_val_temporal_v3_scene.pkl

Training

Train the VQVAE on A100 with 80G GPU memory.

python train_1.py --py-config config/train_vqvae.py --work-dir out/vqvae

Generate training Token data using the vqvae results

python step02.py --py-config config/train_vqvae.py --work-dir out/vqvae

Train the OccSora on A100 with 80G GPU memory.

torchrun --nnodes=1 --nproc_per_node=8 train_2.py --model DiT-XL/2 --data-path /path

Evaluation

Evaluate the model on A100 with 80G GPU memory.

The token is obtained by denoising the noise samples_array.npy

python sample.py --model DiT-XL/2 --image-size 256 --ckpt "/results/001-DiT-XL-2/checkpoints/1200000.pt"

Visualization

python visualize_demo.py --py-config config/train_vqvae.py --work-dir out/vqvae

Related Projects

Our code is based on OccWorld and DiT.

Also thanks to these excellent open-sourced repos: TPVFormer MagicDrive BEVFormer

Citation

If you find this project helpful, please consider citing the following paper:

  @article{wang2024occsora,
    title={OccSora: 4D Occupancy Generation Models as World Simulators for Autonomous Driving},
    author={Wang, Lening and Zheng, Wenzhao and Ren, Yilong and Jiang, Han and Cui, Zhiyong and Yu, Haiyang and Lu, Jiwen},
    journal={arXiv preprint arXiv:2405.20337},
    year={2024}
	}

Related Skills

node-connect

347.2k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

108.0k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

347.2k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

qqbot-media

347.2k

QQBot 富媒体收发能力。使用 <qqmedia> 标签，系统根据文件扩展名自动识别类型（图片/语音/视频/文件）。