SkillAgentSearch skills...

AVD2

[ICRA 2025]AVD2: Accident Video Diffusion for Accident Video Description

Install / Use

/learn @An-Answer-tree/AVD2
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

2025 IEEE International Conference on Robotics & Automation (ICRA2025)

AVD2: Accident Video Diffusion for Accident Video Description

The First Work to Generate Accident Videos:

The Teaser

This repository is an official implementation of AVD2: Accident Video Diffusion for Accident Video Description.

Created by:
Cheng Li<sup>[1,2,*]</sup>, Keyuan Zhou<sup>[1,3,*]</sup>, Tong Liu<sup>[1,4,*]</sup>, Yu Wang<sup>[1,5,*]</sup>, Mingqiao Zhuang<sup>[6]</sup>,
Huan-ang Gao<sup>[1]</sup>, Bu Jin<sup>[1]</sup>, and Hao Zhao<sup>[1,7,8,†]</sup>

* Indicates equal contribution.
† The corresponding author.

Affiliations:

  1. Institute for AI Industry Research (AIR), Tsinghua University.
  2. Academy of Interdisciplinary Studies, the Hong Kong University of Science and Technology.
  3. College of Communication Engineering, Jilin University.
  4. School of Cyber Science and Engineering, Nanjing University of Science and Technology.
  5. School of Automation, Beijing Institute of Technology.
  6. College of Foreign Language and Literature, Fudan University.
  7. Beijing Academy of Artificial Intelligence (BAAI).
  8. Lightwheel AI.

Our System Framework:

The Framework Architecture

Our AVD2 Project Video is available at:

AVD2 Project Video: https://youtu.be/iGdSIofB_k8

Introduction

We propose a novel framework, AVD2 (Accident Video Diffusion for Accident Video Description), which enhances transparency and explainability in autonomous driving systems by providing detailed natural language narrations and reasoning for accident scenarios. AVD2 jointly tackles both the accident description and prevention tasks, offering actionable insights through a shared video representation.This repository includes (will be released soon) the full implementation of AVD2, along with the training and evaluation setups, the generated accident dataset EMMAU dataset and the conda environment.

Note

We have uploaded the required environment of our AVD2 system.
We have released the whole raw EMM-AU dataset (including raw MM-AU dataset and the raw generation videos.
We have released the whole processed dataset of the EMM-AU dataset.
We have released the instructions and codes for the data augmentation (including super-resolution code and the instructions for Open-Sora finetuning).
We have released the checkpoint file of our fintuned improved Open-Sora 1.2 model.
We have released the data preprocessing codes ("/root/src/prepro/") and the model evaluation codes ("/root/src/evalcap/"&"/root/evaluation/") of the project.

Getting Started Environment

Create conda environment:

conda create --name AVD2 python=3.8

Install torch:

pip install torch==1.13.1+cu117 torchaudio==0.13.1+cu117 torchvision==0.14.1+cu117 -f https://download.pytorch.org/whl/torch_stable.html

Install apex:

git clone https://github.com/NVIDIA/apex
cd apex
pip install -v --no-cache-dir --no-build-isolation --global-option="--cpp_ext" --global-option="--cuda_ext" --global-option="--deprecated_fused_adam" --global-option="--xentropy" --global-option="--fast_multihead_attn" ./
cd ..
rm -rf apex

Install mpi4py:

conda install -c conda-forge mpi4py openmpi

Install other dependencies and packages

pip install -r requirements.txt

More Details for our System

Our AVD2 framework is based on the Action-aware Driving Caption Transformer (ADAPT) and Self Critical Sequence Training (SCST).
The codes and more information about ADAPT and SCST can be found and referenced here:
ADAPT: https://arxiv.org/pdf/2302.00673
ADAPT codes: https://github.com/jxbbb/ADAPT/tree/main?tab=MIT-1-ov-file
SCST: https://arxiv.org/abs/1612.00563
SCST codes: https://github.com/ruotianluo/self-critical.pytorch

Dataset

This part includes the Dataset Preprocessing code, the Raw Dataset (including the whole EMM-AU dataset), the codes and steps to do the data augmentation and the Processed Dataset.

Dataset Preprocessing

Need to change the name of the train/val/test dataset and the locations.

cd src
cd prepro
sh preprocess.sh

Raw Dataset Download

EMM-AU(Enhanced MM-AU Dataset) contains "Raw MM-AU Dataset" and the "Enhanced Generated Videos".
| Parts | Download | |-------------------|----------------------| | Raw MM-AU Dataset | Official Github Page | | Our Enhanced Generated Videos | HuggingFace |

Data Augmentation

We utilized Project Open-Sora 1.2 to inference the "Enhanced Part" of EMM-AU. You can reference Open-Sora Official GitHub Page for installation.

Fine-tuning for Open-Sora 1.2

Before fine-tuning, you need to prepare a csv file. HERE IS A METHOD
An example ready for training:

path, text, num_frames, width, height, aspect_ratio
/absolute/path/to/image1.jpg, caption, 1, 720, 1280, 0.5625
/absolute/path/to/video1.mp4, caption, 120, 720, 1280, 0.5625
/absolute/path/to/video2.mp4, caption, 20, 256, 256, 1

Then use the bash command to train new model or fine-tuned model(based on YOUR_PRETRAINED_CKPT).
You can also change the training config in "configs/opensora-v1-2/train/stage3.py"

# one node
torchrun --standalone --nproc_per_node 8 scripts/train.py \
    configs/opensora-v1-2/train/stage3.py --data-path YOUR_CSV_PATH --ckpt-path YOUR_PRETRAINED_CKPT
# multiple nodes
colossalai run --nproc_per_node 8 --hostfile hostfile scripts/train.py \
    configs/opensora-v1-2/train/stage3.py --data-path YOUR_CSV_PATH --ckpt-path YOUR_PRETRAINED_CKPT

Inference with Open-Sora 1.2

You can Download our pretrained model for Accident Videos Generation.

# text to video
python scripts/inference.py configs/opensora-v1-2/inference/sample.py \
  --num-frames 4s --resolution 720p --aspect-ratio 9:16 \
  --prompt "a beautiful waterfall"

# batch generation(need a txt file, each line has a single prompt)
python scripts/inference.py configs/opensora-v1-2/inference/sample.py \
  --num-frames 4s --resolution 720p --aspect-ratio 9:16 \
  --num-sampling-steps 30 --flow 5 --aes 6.5 \
  --prompt-path YOUR_PROMPT.TXT \
  --batch-size 1 \
  --loop 1 \
  --save-dir YOUR_SAVE_DIR \
  --ckpt-path YOUR_CHECKPOINT

RRDBNet Super-Resolution

The conda environment for the super-resolution part can be installed as:

conda create --name S_R python=3.8
source activate S_R
cd src/Super_resolution
pip install -r requirements.txt

Also, you may need to install these two code-base:
The first one:

pip install git+https://github.com/XPixelGroup/BasicSR.git

The second one:

pip install git+https://github.com/xinntao/Real-ESRGAN.git

Then running the RRDBNet model code within the Real-ESRGAN framework to do the super-resolution steps for the dataset.

python Super_Resolution.py

Processed Dataset Download

You can download the Processed_EMM-AU_Dataset in our HuggingFace.
All of the captions (annotations) document for the 2000 generated videos has been released in the ("root/Process_Dataset/generated_2000videos_captions.json").

Download Our Fine-tuned Open-Sora 1.2 model for Video Generation

You can download the checkpoint of the pretrained_model_for_video_generation in our HuggingFace. This is our improved pretrained Open-Sora 1.2 model by 2 steps fine-tuning based on the original official pretrained Open-Sora.

Train the Basic Model

conda activate AVD2
sh scripts/BDDX_multitask.sh

Testing/Evaluation

You can download the output from the ("/root/output/checkpoint")
To evaluate the output, you need to Modify the data format firstly:

cd evaluation
python tsv2coco.py
python json2coco.py

Here, we provided the right Transformed data format ("/root/evaluation/ground_truth_captions1", "/root/evaluation/ground_truth_captions2","/root/evaluation/generated_captions1","/root/evaluation/generated_captions1").
Then, you can run the testing/evaluation codes here:

pip install pycocoevalcap -i https://pypi.tuna.tsinghua.edu.cn/simple
# or
pip install pycocoevalcap
python pycocoevaluationmetric.py

Visualization

These are the random examples of the generated accident frames in our EMMAU dataset:

The example frame

This is the visualization of the Understanding ability of our AVD2 system (comparred with the ChatGPT-4o & ground truth):

Accident example 1:

Example of EMMAU 1
<span style="color:black">AVD2 Prediction</span>
<span style="color👱‍♂️">Description:</span> A vehicle changes lanes with the same direction to ego-car; Vehicles don't give way to normal driving vehicles when turning or changing lanes.
<span style="color📘">Avoidance:</span> Before turning or changing lanes, vehicles should turn on the turn signal in advance, observe the surrounding vehicles and control the speed. When driving, vehicles should abide by traffic rules, and give the way for the normal running vehicles. Vehicles that will enter the main road should give way to the veh

View on GitHub
GitHub Stars88
CategoryContent
Updated3d ago
Forks0

Languages

Python

Security Score

95/100

Audited on Mar 30, 2026

No findings