OccSora
OccSora: 4D Occupancy Generation Models as World Simulators for Autonomous Driving
Install / Use
/learn @wzzheng/OccSoraREADME
OccSora: 4D Occupancy Generation Models as World Simulators for Autonomous Driving
Paper | Project Page
OccSora: 4D Occupancy Generation Models as World Simulators for Autonomous Driving
Lening Wang*, Wenzhao Zheng* $\dagger$, Yilong Ren, Han Jiang, Zhiyong Cui, Haiyang Yu, Jiwen Lu
* Equal contribution $\dagger$ Project leader
With trajectory-aware 4D generation, OccSora has the potential to serve as a world simulator for the decision-making of autonomous driving.
News
- [2024/05/31] Training, evaluation, and visualization code release.
- [2024/05/31] Paper released on arXiv.
Demo
Trajectory-aware Video Generation:

Scene Video Generation:

Overview

Different from most existing world models which adopt an autoregressive framework to perform next-token prediction, we propose a diffusion-based 4D occupancy generation model, OccSora, to model long-term temporal evolutions more efficiently. We employ a 4D scene tokenizer to obtain compact discrete spatial-temporal representations for 4D occupancy input and achieve high-quality reconstruction for long-sequence occupancy videos. We then learn a diffusion transformer on the spatial-temporal representations and generate 4D occupancy conditioned on a trajectory prompt. OccSora can generate 16s-videos with authentic 3D layout and temporal consistency, demonstrating its ability to understand the spatial and temporal distributions of driving scenes.
Getting Started
Installation
-
Create a conda environment with Python version 3.8.0
-
Install all the packages in environment.yaml
-
Please refer to mmdetection3d about the installation of mmdetection3d
Preparing
-
Create a soft link from data/nuscenes to your_nuscenes_path
-
Prepare the gts semantic occupancy introduced in [Occ3d]
-
Download the generated train/val pickle files and put them in data/
[nuscenes_infos_train_temporal_v3_scene.pkl]
[nuscenes_infos_val_temporal_v3_scene.pkl]
The dataset should be organized as follows:
OccSora/data
nuscenes - downloaded from www.nuscenes.org
lidarseg
maps
samples
sweeps
v1.0-trainval
gts - download from Occ3d
nuscenes_infos_train_temporal_v3_scene.pkl
nuscenes_infos_val_temporal_v3_scene.pkl
Training
Train the VQVAE on A100 with 80G GPU memory.
python train_1.py --py-config config/train_vqvae.py --work-dir out/vqvae
Generate training Token data using the vqvae results
python step02.py --py-config config/train_vqvae.py --work-dir out/vqvae
Train the OccSora on A100 with 80G GPU memory.
torchrun --nnodes=1 --nproc_per_node=8 train_2.py --model DiT-XL/2 --data-path /path
Evaluation
Evaluate the model on A100 with 80G GPU memory.
The token is obtained by denoising the noise samples_array.npy
python sample.py --model DiT-XL/2 --image-size 256 --ckpt "/results/001-DiT-XL-2/checkpoints/1200000.pt"
Visualization
python visualize_demo.py --py-config config/train_vqvae.py --work-dir out/vqvae
Related Projects
Our code is based on OccWorld and DiT.
Also thanks to these excellent open-sourced repos: TPVFormer MagicDrive BEVFormer
Citation
If you find this project helpful, please consider citing the following paper:
@article{wang2024occsora,
title={OccSora: 4D Occupancy Generation Models as World Simulators for Autonomous Driving},
author={Wang, Lening and Zheng, Wenzhao and Ren, Yilong and Jiang, Han and Cui, Zhiyong and Yu, Haiyang and Lu, Jiwen},
journal={arXiv preprint arXiv:2405.20337},
year={2024}
}
Related Skills
node-connect
347.2kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
108.0kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
347.2kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
qqbot-media
347.2kQQBot 富媒体收发能力。使用 <qqmedia> 标签,系统根据文件扩展名自动识别类型(图片/语音/视频/文件)。
