Video Diffusion Alignment via Reward Gradient

VADER

</div>

This is the official implementation of our paper Video Diffusion Alignment via Reward Gradient by

Mihir Prabhudesai*, Zheyang Qin*, Russell Mendonca*, Katerina Fragkiadaki, Deepak Pathak .

Abstract

We have made significant progress towards building foundational video diffusion models. As these models are trained using large-scale unsupervised data, it has become crucial to adapt these models to specific downstream tasks, such as video-text alignment or ethical video generation. Adapting these models via supervised fine-tuning requires collecting target datasets of videos, which is challenging and tedious. In this work, we instead utilize pre-trained reward models that are learned via preferences on top of powerful discriminative models. These models contain dense gradient information with respect to generated RGB pixels, which is critical to be able to learn efficiently in complex search spaces, such as videos. We show that our approach can enable alignment of video diffusion for aesthetic generations, similarity between text context and video, as well long horizon video generations that are 3X longer than the training sequence length. We show our approach can learn much more efficiently in terms of reward queries and compute than previous gradient-free approaches for video generation.

Features

[x] Adaptation of VideoCrafter2 Text-to-Video Model
[x] Adaptation of Open-Sora V1.2 Text-to-Video Model
[x] Adaptation of ModelScope Text-to-Video Model
[x] DPO and DDPO Baselines
[ ] Adaptation of Stable Video Diffusion Image2Video Model
[ ] Movie generation code

Demo

| | | | | ----------------------------------------------------------------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------- | | <img src="assets/videos/8.gif" width=""> | <img src="assets/videos/5.gif" width=""> | <img src="assets/videos/7.gif" width=""> | | <img src="assets/videos/10.gif" width=""> | <img src="assets/videos/3.gif" width=""> | <img src="assets/videos/4.gif" width=""> | | <img src="assets/videos/9.gif" width=""> | <img src="assets/videos/1.gif" width=""> | <img src="assets/videos/11.gif" width=""> |

🌟 VADER-VideoCrafter

We highly recommend proceeding with the VADER-VideoCrafter model first, which performs better.

⚙️ Installation

Assuming you are in the VADER/ directory, you are able to create a Conda environments for VADER-VideoCrafter using the following commands:

cd VADER-VideoCrafter
conda create -n vader_videocrafter python=3.10
conda activate vader_videocrafter
conda install pytorch==2.3.0 torchvision==0.18.0 torchaudio==2.3.0 pytorch-cuda=12.1 -c pytorch -c nvidia
conda install xformers -c xformers
pip install -r requirements.txt
git clone https://github.com/tgxs002/HPSv2.git
cd HPSv2/
pip install -e .
cd ..

We are using the pretrained Text-to-Video VideoCrafter2 model via Hugging Face. If you unfortunately find the model is not automatically downloaded when you running inference or training script, you can manually download it and put the model.ckpt in VADER/VADER-VideoCrafter/checkpoints/base_512_v2/model.ckpt.
We provided pretrained LoRA weights on HuggingFace. The vader_videocrafter_pickscore.pt is the model fine-tuned using PickScore function on chatgpt_custom_animal.txt with LoRA rank of 16, while vader_videocrafter_hps_aesthetic.pt is the model fine-tuned using a combination of HPSv2.1 and Aesthetic function on chatgpt_custom_instruments.txt with LoRA rank of 8.

📺 Inference

Please run accelerate config as the first step to configure accelerator settings. If you are not familiar with the accelerator configuration, you can refer to VADER-VideoCrafter documentation.

Assuming you are in the VADER/ directory, you are able to do inference using the following commands:

cd VADER-VideoCrafter
sh scripts/run_text2video_inference.sh

We have tested on PyTorch 2.3.0 and CUDA 12.1. The inferece script works on a single GPU with 16GBs VRAM, when we set val_batch_size=1 and use fp16 mixed precision. It should also work with recent PyTorch and CUDA versions.
VADER/VADER-VideoCrafter/scripts/main/train_t2v_lora.py is a script for inference of the VideoCrafter2 using VADER via LoRA.
- Most of the arguments are the same as the training process. The main difference is that --inference_only should be set to True.
- --lora_ckpt_path is required to set to the path of the pretrained LoRA model. Specially, if the lora_ckpt_path is set to 'huggingface-pickscore' or 'huggingface-hps-aesthetic', it will download the pretrained LoRA model from the respective HuggingFace model hub, VADER_VideoCrafter_PickScore or VADER_VideoCrafter_HPS_Aesthetic. Otherwise, it will load the pretrained LoRA model from the path you provided. If you do not provide any lora_ckpt_path, the original VideoCrafter2 model will be used for inference. Note that if you use 'huggingface-pickscore' you need to set --lora_rank 16, whereas if you use 'huggingface-hps-aesthetic' you need to set --lora_rank 8.

🔧 Training

Please run accelerate config as the first step to configure accelerator settings. If you are not familiar with the accelerator configuration, you can refer to VADER-VideoCrafter documentation.

Assuming you are in the VADER/ directory, you are able to train the model using the following commands:

cd VADER-VideoCrafter
sh scripts/run_text2video_train.sh

Our experiments are conducted on PyTorch 2.3.0 and CUDA 12.1 while using 4 A6000s (48GB RAM). It should also work with recent PyTorch and CUDA versions. The training script have been tested on a single GPU with 16GBs VRAM, when we set train_batch_size=1 val_batch_size=1 and use fp16 mixed precision.
VADER/VADER-VideoCrafter/scripts/main/train_t2v_lora.py is also a script for fine-tuning the VideoCrafter2 using VADER via LoRA.
- You can read the VADER-VideoCrafter documentation to understand the usage of arguments.

🎬 VADER-Open-Sora

⚙️ Installation

Assuming you are in the VADER/ directory, you are able to create a Conda environments for VADER-Open-Sora using the following commands:

cd VADER-Open-Sora
conda create -n vader_opensora python=3.10
conda activate vader_opensora
conda install pytorch==2.3.0 torchvision==0.18.0 torchaudio==2.3.0 pytorch-cuda=12.1 -c pytorch -c nvidia
conda install xformers -c xformers
pip install -v -e .
git clone https://github.com/tgxs002/HPSv2.git
cd HPSv2/
pip install -e .
cd ..

📺 Inference

Please run accelerate config as the first step to configure accelerator settings. If you are not familiar with the accelerator configuration, you can refer to VADER-Open-Sora documentation.

Assuming you are in the VADER/ directory, you are able to do inference using the following commands:

cd VADER-Open-Sora
sh scripts/run_text2video_inference.sh

We have tested on PyTorch 2.3.0 and CUDA 12.1. If the resolution is set as 360p, a GPU with 40GBs of VRAM is required when we set val_batch_size=1 and use bf16 mixed precision . It should also work with recent PyTorch and CUDA versions. Please refer to the original Open-Sora repository for more details about the GPU requirements and the model settings.
VADER/VADER-Open-Sora/scripts/train_t2v_lora.py is a script for do inference via the Open-Sora 1.2 using VADER.
- --num-frames, '--resolution', 'fps' and 'aspect-ratio' are inherited from the original Open-Sora model. In short, you can set '--num-frames' as '2s', '4s', '8s', and '16s'. Available values for --resolution are '240p', '360p', '480p', and '720p'. The default value of 'fps' is 24 and 'aspect-ratio' is 3:4. Please refer to the original Open-Sora repository for more details. One thing to keep in mind, for instance, is that if you set --num-frames to 2s and --resolution to '240p', it is better to use bf16 mixed precision instead of fp16. Otherwise, the model may generate noise videos.
- --prompt-path is the path of the prompt file. Unlike VideoCrafter, we do not provide prompt function for Open-Sora. Instead, you can provide a prompt file, which contains a list of prompts.
- --num-processes is the number of processes for Accelerator. It is recommended to set it to the number of GPUs.
VADER/VADER-Open-Sora/configs/opensora-v1-2/vader/vader_inferece.py is the configuration file for inference. You can modify the configuration file to change

VADER

Install / Use

README

Video Diffusion Alignment via Reward Gradient

Abstract

Features

Demo

🌟 VADER-VideoCrafter

⚙️ Installation

📺 Inference

🔧 Training

🎬 VADER-Open-Sora

⚙️ Installation

📺 Inference