DrivingDiffusion

[ECCV 2024] Officially implement of the paper "DrivingDiffusion: Layout-Guided Multi-View Driving Scenarios Video Generation with Latent Diffusion Model".

Generate Convert Improve

Install / Use

/learn @shalfun/DrivingDiffusion

About this skill

Quality Score

0/100

README

DrivingDiffusion

The first multi-view driving scene video generator.

Project Page | Paper

[DrivingDiffusion] Training Pipeline

Consistency Module & Local Prompt

[DrivingDiffusion] Long Video Generate Pipeline

[DrivingDiffusion-Future] Future Generate Pipeline

Abstract

With the increasing popularity of autonomous driving based on the powerful and unified bird's-eye-view (BEV) representation, a demand for high-quality and large-scale multi-view video data with accurate annotation is urgently required. However, such large-scale multi-view data is hard to obtain due to expensive collection and annotation costs. To alleviate the problem, we propose a spatial-temporal consistent diffusion framework DrivingDiffusion, to generate realistic multi-view videos controlled by 3D layout. There are three challenges when synthesizing multi-view videos given a 3D layout: How to keep 1) cross-view consistency and 2) cross-frame consistency? 3) How to guarantee the quality of the generated instances? Our DrivingDiffusion solves the problem by cascading the multi-view single-frame image generation step, the single-view video generation step shared by multiple cameras, and post-processing that can handle long video generation. In the multi-view model, the consistency of multi-view images is ensured by information exchange between adjacent cameras. In the temporal model, we mainly query the information that needs attention in subsequent frame generation from the multi-view images of the first frame. We also introduce the local prompt to effectively improve the quality of generated instances. In post-processing, we further enhance the cross-view consistency of subsequent frames and extend the video length by employing temporal sliding window algorithm. Without any extra cost, our model can generate large-scale realistic multi-camera driving videos in complex urban scenes, fueling the downstream driving tasks. The code will be made publicly available. <img width="907" alt="abs" src="https://github.com/DrivingDiffusion/DrivingDiffusion.github.io/blob/main/static/images/intro.png">

News&Logs

[2023/8/15] Single-View future generation.
[2023/5/08] Multi-View video generation controlled by 3D Layout.
[2023/3/01] Multi-View image generation controlled by 3D Layout.
[2023/3/01] Single-View image generation controlled by 3D Layout.
[2023/2/03] Single-View image generation controlled by Laneline Layout.

Usage

Setup Environment

conda create -n dridiff python=3.8
conda activate dridiff
pip install -r requirements.txt

DrivingDiffusion is training on 8 A100.

Weight

We use the stable-diffsuion-v1-4 initial weights and base structure. Stable Diffusion is a latent text-to-image diffusion model capable of generating photo-realistic images given any text input. For more information about how Stable Diffusion functions, please have a look at 🤗's Stable Diffusion with 🧨Diffusers blog, which you can find at HuggingFace

Data Preparation

nuScenes

Custom Dataset

Training

Coming soon...

Inference

Coming soon...

Results

Visualization of Multi-View Image Generation.

Visualization of Temporal Generation.

Visualization of Control Capability.

Multi-View Video Generation of Driving Scenes Controlled by 3D Layout

Videos

Ability to Construct future

Control future video generation through text description of road conditions

Future video generation without text description of road conditions

Citation

If DrivingDiffusion is useful or relevant to your research, please kindly recognize our contributions by citing our paper:

@article{li2023drivingdiffusion,
      title={DrivingDiffusion: Layout-Guided multi-view driving scene video generation with latent diffusion model}, 
      author={Xiaofan Li and Yifu Zhang and Xiaoqing Ye},
      journal={arXiv preprint arXiv:2310.07771},
      year={2023}
}

Related Skills

qqbot-channel

353.3k

QQ 频道管理技能。查询频道列表、子频道、成员、发帖、公告、日程等操作。使用 qqbot_channel_api 工具代理 QQ 开放平台 HTTP 接口，自动处理 Token 鉴权。当用户需要查看频道、管理子频道、查询成员、发布帖子/公告/日程时使用。

docs-writer

100.7k

`docs-writer` skill instructions As an expert technical writer and editor for the Gemini CLI project, you produce accurate, clear, and consistent documentation. When asked to write, edit, or revie

model-usage

353.3k

Use CodexBar CLI local cost usage to summarize per-model usage for Codex or Claude, including the current (most recent) model or a full model breakdown. Trigger when asked for model-level usage/cost data from codexbar, or when you need a scriptable per-model summary from codexbar cost JSON.

project-overview

FlightPHP Skeleton Project Instructions This document provides guidelines and best practices for structuring and developing a project using the FlightPHP framework. Instructions for AI Coding A

shalfun

View profile

View on GitHub

GitHub Stars559

CategoryContent

Updated1d ago

Forks19

shalfun/DrivingDiffusion

Languages

Python

Security Score

95/100

Audited on Apr 8, 2026

No findings