UniAD
[CVPR'25] Official implementation for paper - Contextual AD Narration with Interleaved Multimodal Sequence
Install / Use
/learn @ant-research/UniADREADME
Contextual AD Narration with Interleaved Multimodal Sequence
<a href="https://arxiv.org/abs/2403.12922"><img src="https://img.shields.io/badge/arXiv-2403.12922-b31b1b.svg"></a> <a href="https://huggingface.co/hlwang06/UniAD"><img src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue)"></a>
Hanlin Wang<sup>1,3</sup>, Zhan Tong<sup>2</sup>, Kecheng Zheng<sup>3</sup>, Yujun Shen<sup>3</sup>, Limin Wang<sup>1,4,†</sup><br> <sup>1</sup>State Key Laboratory for Novel Software Technology, Nanjing University<br> <sup>2</sup>ESAT, KU Leuven <sup>3</sup>Ant Group <sup>4</sup>Shanghai Artificial Intelligence Laboratory <sup> <br><sup>†</sup>corresponding author
Setup
Follow the following guide to set up the environment.
-
Git clone repo
git clone https://github.com/ant-research/UniAD cd UniAD -
Download and unzip checkpoints
Download necessary files from here
Download 'CLIP_L14_frames_features_5fps.h5' from MAD
Use method in AutoAD-II to get 'MAD_examplers.pth.tar'
Download 'LLAMA2-7B'
-
Create environment and install packages
Create environment for MAD:
conda create -n UniAD_MAD python=3.8 -y conda activate UniAD_MAD pip install -r requirements_MAD_clean.txtCreate environment for CMDAD & TVAD:
conda create -n UniAD_CMDAD python=3.8 -y conda activate UniAD_CMDAD pip install -r requirements_CMD_clean.txt pip install --no-deps torchvision==0.13.1Create environment for critic evaluation in CMDAD & TVAD:
conda create -n UniAD_critic python=3.9 -y pip install -r requirements_critic.txt
Running
We train our model with 8 A100 GPUs and evaluate with a single A6000 GPU card.
Evaluation
Conduct evaluation on MAD:
CUDA_VISIBLE_DEVICES=0 python main.py --LLM_path 'LLAMA2-7B path' --batch-size-val 3 --char_feature_path 'MAD_examplers.pth.tar path' --char_prompt_type 0 --resume 'MAD.pt path' --if_finutune_GPT 0 --if_img_only 0 --if_lora 1 --if_only_flamingo 2 --mylogs 'output file diectory path' --movie_feature_path 'CLIP_L14_frames_features_5fps.h5 path' --name MAD_LLAMA2 --previous_video_num 1 --val-data 'MAD_eval_char_refine_final.json path' --workers 4
Conduct evaluation on CMDAD & TVAD:
CUDA_VISIBLE_DEVICES=0 python main.py --if_finutune_GPT 0 --mylogs 'output file diectory path' --name CMDAD_LLAMA2 --precision fp32 --if_lora 1 --train-data "" --val-data 'cmdad_char_refine_eval.json path' --log-every-n-steps 1 --dataset-type json --batch-size 1 --batch-size-val 1 --workers 1 --Visual_Loss 0 --LLM_path 'LLAMA2-7B path' --if_only_flamingo 2 --if_special_prompt 0 --num_latents 32 --num_char 32 --if_img_only 0 --movie_feature_eval_path 'VideoLLaMa_CMD_eval_fp16.h5 path' --char_feature_path 'chars_all_videollama.pth.tar path' --previous_video_num 1 --lr 0.0001 --resume 'CMDAD_TVAD.pt path' --eval_data_name CMDAD
CUDA_VISIBLE_DEVICES=0 python main.py --if_finutune_GPT 0 --mylogs 'output file diectory path' --name TVAD_LLAMA2 --precision fp32 --if_lora 1 --train-data "" --val-data 'tvad_char_refine_eval.json path' --log-every-n-steps 1 --dataset-type json --batch-size 1 --batch-size-val 1 --workers 1 --Visual_Loss 0 --LLM_path 'LLAMA2-7B path' --if_only_flamingo 2 --if_special_prompt 0 --num_latents 32 --num_char 32 --if_img_only 0 --movie_feature_eval_path 'TV_eval_videollama.h5 path' --char_feature_path 'chars_all_videollama.pth.tar path' --previous_video_num 1 --lr 0.0001 --resume 'CMDAD_TVAD.pt path' --eval_data_name TVAD
Note:
During the preparation for open-sourcing, we conducted ablation experiments on CMDAD and TVAD using the latest experimental settings. We found that the effects of the context modeling and character refinement module were minimal after introducing the VideoLLaMA model and the character prediction results from AutoAD-Zero. For more specific details, please refer to our updated arXiv paper.
Train on MAD
Prepare training data of MAD from MAD and use character prediction results from AutoAD to organize the data into the following format:
[
{
"start": "",
"end": "",
"ad": "",
"char": [],
"ad_chars": [],
"ad_chars_in_chars": [],
"context": [
{
"start": "",
"end": "",
"ad": "",
"char": [],
"ad_chars": [],
"ad_chars_in_chars": []
}
],
"movie_id": ""
},
...
]
Then run:
torchrun --nproc_per_nod 8 -m main --if_finutune_GPT 0 --accum-freq 4 --if_lora 1 --if_only_flamingo 2 --num_latents 30
- --if_special_prompt 0
- --if_img_only 0
- --num_char 30
- --previous_video_num 1
- --AD_pretrained 1
- --AD_pretrained_checkpoint 'LLaMA_AD_pretrain.pt path'
- --movie_feature_path 'CLIP_L14_frames_features_5fps.h5 path'
- --char_feature_path 'MAD_examplers.pth.tar path'
- --dataset 'MAD'
- --train-data 'train data path'
- --val-data ''
- --LLM_path 'LLAMA2-7B path'
- --batch-size 3
- --batch-size-val 3
- --epochs 10
- --LLM_name 'LLaMA'
- --lr 0.00005
- --warmup 6000
- --save-frequency 1
- --val-frequency 1
- --precision 'fp32'
- --mylogs 'output file diectory path'
- --name 'output file name'
Citation
Don't forget to cite this source if it proves useful in your research!
@article{wang2024uniad,
title={Contextual AD Narration with Interleaved Multimodal Sequence},
author={Hanlin Wang and Zhan Tong and Kecheng Zheng and Yujun Shen and Limin Wang},
year={2025},
eprint={2403.12922},
archivePrefix={arXiv},
primaryClass={cs.CV}}
Acknowledgement
Our implementation is based on
Thanks for their remarkable contribution and released code!
Note
Note: This repo is governed by the license of llama. We strongly advise users not to knowingly generate or allow others to knowingly generate harmful content, including hate speech, violence, pornography, deception, etc.
(注:本仓库受llama的许可协议限制。我们强烈建议,用户不应传播及不应允许他人传播以下内容,包括但不限于仇恨言论、暴力、色情、欺诈相关的有害信息。)
Related Skills
node-connect
353.3kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
111.7kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
353.3kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
qqbot-media
353.3kQQBot 富媒体收发能力。使用 <qqmedia> 标签,系统根据文件扩展名自动识别类型(图片/语音/视频/文件)。
