AVD2
[ICRA 2025]AVD2: Accident Video Diffusion for Accident Video Description
Install / Use
/learn @An-Answer-tree/AVD2README
2025 IEEE International Conference on Robotics & Automation (ICRA2025)
AVD2: Accident Video Diffusion for Accident Video Description
The First Work to Generate Accident Videos:

This repository is an official implementation of AVD2: Accident Video Diffusion for Accident Video Description.
Created by:
Cheng Li<sup>[1,2,*]</sup>, Keyuan Zhou<sup>[1,3,*]</sup>, Tong Liu<sup>[1,4,*]</sup>, Yu Wang<sup>[1,5,*]</sup>, Mingqiao Zhuang<sup>[6]</sup>,
Huan-ang Gao<sup>[1]</sup>, Bu Jin<sup>[1]</sup>, and Hao Zhao<sup>[1,7,8,†]</sup>
* Indicates equal contribution.
† The corresponding author.
Affiliations:
- Institute for AI Industry Research (AIR), Tsinghua University.
- Academy of Interdisciplinary Studies, the Hong Kong University of Science and Technology.
- College of Communication Engineering, Jilin University.
- School of Cyber Science and Engineering, Nanjing University of Science and Technology.
- School of Automation, Beijing Institute of Technology.
- College of Foreign Language and Literature, Fudan University.
- Beijing Academy of Artificial Intelligence (BAAI).
- Lightwheel AI.
Our System Framework:

Our AVD2 Project Video is available at:
AVD2 Project Video: https://youtu.be/iGdSIofB_k8
Introduction
We propose a novel framework, AVD2 (Accident Video Diffusion for Accident Video Description), which enhances transparency and explainability in autonomous driving systems by providing detailed natural language narrations and reasoning for accident scenarios. AVD2 jointly tackles both the accident description and prevention tasks, offering actionable insights through a shared video representation.This repository includes (will be released soon) the full implementation of AVD2, along with the training and evaluation setups, the generated accident dataset EMMAU dataset and the conda environment.
Note
We have uploaded the required environment of our AVD2 system.
We have released the whole raw EMM-AU dataset (including raw MM-AU dataset and the raw generation videos.
We have released the whole processed dataset of the EMM-AU dataset.
We have released the instructions and codes for the data augmentation (including super-resolution code and the instructions for Open-Sora finetuning).
We have released the checkpoint file of our fintuned improved Open-Sora 1.2 model.
We have released the data preprocessing codes ("/root/src/prepro/") and the model evaluation codes ("/root/src/evalcap/"&"/root/evaluation/") of the project.
Getting Started Environment
Create conda environment:
conda create --name AVD2 python=3.8
Install torch:
pip install torch==1.13.1+cu117 torchaudio==0.13.1+cu117 torchvision==0.14.1+cu117 -f https://download.pytorch.org/whl/torch_stable.html
Install apex:
git clone https://github.com/NVIDIA/apex
cd apex
pip install -v --no-cache-dir --no-build-isolation --global-option="--cpp_ext" --global-option="--cuda_ext" --global-option="--deprecated_fused_adam" --global-option="--xentropy" --global-option="--fast_multihead_attn" ./
cd ..
rm -rf apex
Install mpi4py:
conda install -c conda-forge mpi4py openmpi
Install other dependencies and packages
pip install -r requirements.txt
More Details for our System
Our AVD2 framework is based on the Action-aware Driving Caption Transformer (ADAPT) and Self Critical Sequence Training (SCST).
The codes and more information about ADAPT and SCST can be found and referenced here:
ADAPT: https://arxiv.org/pdf/2302.00673
ADAPT codes: https://github.com/jxbbb/ADAPT/tree/main?tab=MIT-1-ov-file
SCST: https://arxiv.org/abs/1612.00563
SCST codes: https://github.com/ruotianluo/self-critical.pytorch
Dataset
This part includes the Dataset Preprocessing code, the Raw Dataset (including the whole EMM-AU dataset), the codes and steps to do the data augmentation and the Processed Dataset.
Dataset Preprocessing
Need to change the name of the train/val/test dataset and the locations.
cd src
cd prepro
sh preprocess.sh
Raw Dataset Download
EMM-AU(Enhanced MM-AU Dataset) contains "Raw MM-AU Dataset" and the "Enhanced Generated Videos".
| Parts | Download |
|-------------------|----------------------|
| Raw MM-AU Dataset | Official Github Page |
| Our Enhanced Generated Videos | HuggingFace |
Data Augmentation
We utilized Project Open-Sora 1.2 to inference the "Enhanced Part" of EMM-AU. You can reference Open-Sora Official GitHub Page for installation.
Fine-tuning for Open-Sora 1.2
Before fine-tuning, you need to prepare a csv file. HERE IS A METHOD
An example ready for training:
path, text, num_frames, width, height, aspect_ratio
/absolute/path/to/image1.jpg, caption, 1, 720, 1280, 0.5625
/absolute/path/to/video1.mp4, caption, 120, 720, 1280, 0.5625
/absolute/path/to/video2.mp4, caption, 20, 256, 256, 1
Then use the bash command to train new model or fine-tuned model(based on YOUR_PRETRAINED_CKPT).
You can also change the training config in "configs/opensora-v1-2/train/stage3.py"
# one node
torchrun --standalone --nproc_per_node 8 scripts/train.py \
configs/opensora-v1-2/train/stage3.py --data-path YOUR_CSV_PATH --ckpt-path YOUR_PRETRAINED_CKPT
# multiple nodes
colossalai run --nproc_per_node 8 --hostfile hostfile scripts/train.py \
configs/opensora-v1-2/train/stage3.py --data-path YOUR_CSV_PATH --ckpt-path YOUR_PRETRAINED_CKPT
Inference with Open-Sora 1.2
You can Download our pretrained model for Accident Videos Generation.
# text to video
python scripts/inference.py configs/opensora-v1-2/inference/sample.py \
--num-frames 4s --resolution 720p --aspect-ratio 9:16 \
--prompt "a beautiful waterfall"
# batch generation(need a txt file, each line has a single prompt)
python scripts/inference.py configs/opensora-v1-2/inference/sample.py \
--num-frames 4s --resolution 720p --aspect-ratio 9:16 \
--num-sampling-steps 30 --flow 5 --aes 6.5 \
--prompt-path YOUR_PROMPT.TXT \
--batch-size 1 \
--loop 1 \
--save-dir YOUR_SAVE_DIR \
--ckpt-path YOUR_CHECKPOINT
RRDBNet Super-Resolution
The conda environment for the super-resolution part can be installed as:
conda create --name S_R python=3.8
source activate S_R
cd src/Super_resolution
pip install -r requirements.txt
Also, you may need to install these two code-base:
The first one:
pip install git+https://github.com/XPixelGroup/BasicSR.git
The second one:
pip install git+https://github.com/xinntao/Real-ESRGAN.git
Then running the RRDBNet model code within the Real-ESRGAN framework to do the super-resolution steps for the dataset.
python Super_Resolution.py
Processed Dataset Download
You can download the Processed_EMM-AU_Dataset in our HuggingFace.
All of the captions (annotations) document for the 2000 generated videos has been released in the ("root/Process_Dataset/generated_2000videos_captions.json").
Download Our Fine-tuned Open-Sora 1.2 model for Video Generation
You can download the checkpoint of the pretrained_model_for_video_generation in our HuggingFace. This is our improved pretrained Open-Sora 1.2 model by 2 steps fine-tuning based on the original official pretrained Open-Sora.
Train the Basic Model
conda activate AVD2
sh scripts/BDDX_multitask.sh
Testing/Evaluation
You can download the output from the ("/root/output/checkpoint")
To evaluate the output, you need to Modify the data format firstly:
cd evaluation
python tsv2coco.py
python json2coco.py
Here, we provided the right Transformed data format ("/root/evaluation/ground_truth_captions1", "/root/evaluation/ground_truth_captions2","/root/evaluation/generated_captions1","/root/evaluation/generated_captions1").
Then, you can run the testing/evaluation codes here:
pip install pycocoevalcap -i https://pypi.tuna.tsinghua.edu.cn/simple
# or
pip install pycocoevalcap
python pycocoevaluationmetric.py
Visualization
These are the random examples of the generated accident frames in our EMMAU dataset:

This is the visualization of the Understanding ability of our AVD2 system (comparred with the ChatGPT-4o & ground truth):
Accident example 1:

<span style="color:black">AVD2 Prediction</span>
<span style="color👱♂️">Description:</span>
A vehicle changes lanes with the same direction to ego-car; Vehicles don't give way to normal driving vehicles when turning or changing lanes.
<span style="color📘">Avoidance:</span>
Before turning or changing lanes, vehicles should turn on the turn signal in advance, observe the surrounding vehicles and control the speed. When driving, vehicles should abide by traffic rules, and give the way for the normal running vehicles. Vehicles that will enter the main road should give way to the veh
