CoCoCo: Improving Text-Guided Video Inpainting for Better Consistency, Controllability and Compatibility

Bojia Zi<sup>1</sup>, Shihao Zhao<sup>2</sup>, Xianbiao Qi<sup>*5</sup>, Jianan Wang<sup>4</sup>, Yukai Shi<sup>3</sup>, Qianyu Chen<sup>1</sup>, Bin Liang<sup>1</sup>, Rong Xiao<sup>5</sup>, Kam-Fai Wong<sup>1</sup>, Lei Zhang<sup>4</sup>

* is corresponding author.

This is the inference code for our paper CoCoCo.

<p align="center"> <img src="https://github.com/zibojia/COCOCO/blob/main/__asset__/COCOCO.PNG" alt="COCOCO" style="width: 100%;"/> </p> <table> <tr> <td><img src="__asset__/sea_org.gif"></td> <td><img src="__asset__/sea1.gif"></td> <td><img src="__asset__/sea2.gif"></td> </tr> <tr> <td> Orginal </td> <td> The ocean, the waves ... </td> <td> The ocean, the waves ... </td> </tr> </table> <table> <tr> <td><img src="__asset__/river_org.gif"></td> <td><img src="__asset__/river1.gif"></td> <td><img src="__asset__/river2.gif"></td> </tr> <tr> <td> Orginal </td> <td> The river with ice ... </td> <td> The river with ice ... </td> </tr> </table> <table> <tr> <td><img src="__asset__/sky_org.gif"></td> <td><img src="__asset__/sky1.gif"></td> <td><img src="__asset__/sky2.gif"></td> </tr> <tr> <td> Orginal </td> <td> Meteor streaking in the sky ... </td> <td> Meteor streaking in the sky ... </td> </tr> </table>

Table of Contents

Features
Installation
Usage
TODO
Citation
Acknowledgement

Features

Consistent text-guided video inpainting
- By using damped attention, we have decent inpainting visual content
Higher text controlability
- We have better text controlability
Personalized video inpainting
- We develop a training-free method to implement personalized video inpainting by leveraging personalized T2Is
Gradio Demo using SAM2
- We use SAM2 to create Video Inpaint Anything
Infinite Video Inpainting
- By using the slidding window, you are allowed to inpaint any length videos.
Controlable Video Inpainting
- By composing with the controlnet, we find that we can inpaint controlable content in the given masked region
More inpainting tricks will be released soon...

Installation

Step1. Installation Checklist

Before install the dependencies, you should check the following requirements to overcome the installation failure.

[x] You have a GPU with at least 24G GPU memory.
[x] Your CUDA with nvcc version is greater than 12.0.
[x] Your Pytorch version is greater than 2.4.
[x] Your gcc version is greater than 9.4.
[x] Your diffusers version is 0.11.1.
[x] Your gradio version is 3.40.0.

Step2. Install the requirements

If you update your enviroments successfully, then try to install the dependencies by pip.

# Install the CoCoCo dependencies
pip3 install -r requirements.txt

# Compile the SAM2
pip3 install -e .

If everything goes well, I think you can turn to the next steps.

Usage

1. Download pretrained models.

Note that our method requires both parameters of SD1.5 inpainting and cococo.

The pretrained image inpainting model (Stable Diffusion Inpainting.)
The CoCoCo Checkpoints.
Warning: the runwayml delete their models and weights, so we must download the image inpainting model from other url.
After download, you should put these two models in two folders, the image inpainting folder should contains scheduler, tokenizer, text_encoder, vae, unet, the cococo folder should contain model_0.pth to model-3.pth

2. Prepare the mask

~~You can obtain mask by GroundingDINO or Track-Anything, or draw masks by yourself.~~

We release the gradio demo to use the SAM2 to implement Video Inpainting Anything. Try our Demo!

3. Run our validation script.

By running this code, you can simply get the video inpainting results.

python3 valid_code_release.py --config ./configs/code_release.yaml \
--prompt "Trees. Snow mountains. best quality." \
--negative_prompt "worst quality. bad quality." \
--guidance_scale 10 \ # the cfg number, higher means more powerful text controlability
--video_path ./images/ \ # the path that store the video and masks, the format is the images.npy and masks.npy
--model_path [cococo_folder_name] \ # the path to cococo weights, e.g. ./cococo_weights
--pretrain_model_path [sd_folder_name] \ # the path that store the pretrained stable inpainting model, e.g. ./stable-diffusion-v1-5-inpainting
--sub_folder unet # set the subfolder of pretrained stable inpainting model to get the unet checkpoints

4. Personalized Video Inpainting (Optional)

We give a method to allow users to compose their own personlized video inpainting model by using personalized T2Is WITHOUT TRAINING. There are three steps in total:

Convert the opensource model to Pytorch weights.
Transform the personalized image diffusion to personliazed inpainting diffusion. Substract the weights of personalized image diffusion from SD1.5, and add them on inpainting model. Surprisingly, this method can get a personalized image inpainting model, and it works well:)
Add the weight of personalized inpainting model to our CoCoCo.

Convert safetensors to Pytorch weights

For the model using different key, we use the following script to process opensource T2I model.

For example, the epiCRealism, it is different from the key of the StableDiffusion.

model.diffusion_model.input_blocks.1.1.norm.bias
model.diffusion_model.input_blocks.1.1.norm.weight

Therefore, we develope a tool to convert this type model to the delta of weight.

cd task_vector;
python3 convert.py \
  --tensor_path [safetensor_path] \ # set the safetensor path
  --unet_path [unet_path] \ # set the path to SD1.5 unet weights, e.g. stable-diffusion-v1-5-inpainting/unet/diffusion_pytorch_model.bin
  --text_encoder_path [text_encoder_path] \ # set the text encoder path, e.g. stable-diffusion-v1-5-inpainting/text_encoder/pytorch_model.bin
  --vae_path [vae_path] \ # set the vae path, e.g. stable-diffusion-v1-5-inpainting/vae/diffusion_pytorch_model.bin
  --source_path ./resources \ # the path you put some preliminary files, e.g. ./resources
  --target_path ./resources \ # the path you put some preliminary files, e.g. ./resources
  --target_prefix [prefix]; # set the converted filename prefix

For the model using same key and trained by LoRA.

For example, the Ghibli LoRA.

lora_unet_up_blocks_3_resnets_0_conv1.lora_down.weight
lora_unet_up_blocks_3_resnets_0_conv1.lora_up.weight

python3 convert_lora.py \
  --tensor_path [tensor_path] \ # the safetensor path
  --unet_path [unet_path] \ # set the path to SD1.5 unet weights, e.g. stable-diffusion-v1-5-inpainting/unet/diffusion_pytorch_model.bin 
  --text_encoder_path [text_encoder_path] \ # set the text encoder path,

COCOCO

Install / Use

README