COCOCO
Video-Inpaint-Anything: This is the inference code for our paper CoCoCo: Improving Text-Guided Video Inpainting for Better Consistency, Controllability and Compatibility.
Install / Use
/learn @zibojia/COCOCOREADME
CoCoCo: Improving Text-Guided Video Inpainting for Better Consistency, Controllability and Compatibility
<a href='https://cococozibojia.github.io'><img src='https://img.shields.io/badge/Project-Page-Green'></a> <a href='https://arxiv.org/pdf/2403.12035'><img src='https://img.shields.io/badge/Paper-arXiv-red'></a>
Bojia Zi<sup>1</sup>, Shihao Zhao<sup>2</sup>, Xianbiao Qi<sup>*5</sup>, Jianan Wang<sup>4</sup>, Yukai Shi<sup>3</sup>, Qianyu Chen<sup>1</sup>, Bin Liang<sup>1</sup>, Rong Xiao<sup>5</sup>, Kam-Fai Wong<sup>1</sup>, Lei Zhang<sup>4</sup>
* is corresponding author.
This is the inference code for our paper CoCoCo.
<p align="center"> <img src="https://github.com/zibojia/COCOCO/blob/main/__asset__/COCOCO.PNG" alt="COCOCO" style="width: 100%;"/> </p> <table> <tr> <td><img src="__asset__/sea_org.gif"></td> <td><img src="__asset__/sea1.gif"></td> <td><img src="__asset__/sea2.gif"></td> </tr> <tr> <td> Orginal </td> <td> The ocean, the waves ... </td> <td> The ocean, the waves ... </td> </tr> </table> <table> <tr> <td><img src="__asset__/river_org.gif"></td> <td><img src="__asset__/river1.gif"></td> <td><img src="__asset__/river2.gif"></td> </tr> <tr> <td> Orginal </td> <td> The river with ice ... </td> <td> The river with ice ... </td> </tr> </table> <table> <tr> <td><img src="__asset__/sky_org.gif"></td> <td><img src="__asset__/sky1.gif"></td> <td><img src="__asset__/sky2.gif"></td> </tr> <tr> <td> Orginal </td> <td> Meteor streaking in the sky ... </td> <td> Meteor streaking in the sky ... </td> </tr> </table>Table of Contents <!-- omit in toc -->
Features
- Consistent text-guided video inpainting
- By using damped attention, we have decent inpainting visual content
- Higher text controlability
- We have better text controlability
- Personalized video inpainting
- We develop a training-free method to implement personalized video inpainting by leveraging personalized T2Is
- Gradio Demo using SAM2
- We use SAM2 to create Video Inpaint Anything
- Infinite Video Inpainting
- By using the slidding window, you are allowed to inpaint any length videos.
- Controlable Video Inpainting
- By composing with the controlnet, we find that we can inpaint controlable content in the given masked region
- More inpainting tricks will be released soon...
Installation
Step1. Installation Checklist
Before install the dependencies, you should check the following requirements to overcome the installation failure.
- [x] You have a GPU with at least 24G GPU memory.
- [x] Your CUDA with nvcc version is greater than 12.0.
- [x] Your Pytorch version is greater than 2.4.
- [x] Your gcc version is greater than 9.4.
- [x] Your diffusers version is 0.11.1.
- [x] Your gradio version is 3.40.0.
Step2. Install the requirements
If you update your enviroments successfully, then try to install the dependencies by pip.
# Install the CoCoCo dependencies
pip3 install -r requirements.txt
# Compile the SAM2
pip3 install -e .
If everything goes well, I think you can turn to the next steps.
Usage
1. Download pretrained models.
Note that our method requires both parameters of SD1.5 inpainting and cococo.
-
The pretrained image inpainting model (Stable Diffusion Inpainting.)
-
The CoCoCo Checkpoints.
-
Warning: the runwayml delete their models and weights, so we must download the image inpainting model from other url.
-
After download, you should put these two models in two folders, the image inpainting folder should contains scheduler, tokenizer, text_encoder, vae, unet, the cococo folder should contain model_0.pth to model-3.pth
2. Prepare the mask
~~You can obtain mask by GroundingDINO or Track-Anything, or draw masks by yourself.~~
We release the gradio demo to use the SAM2 to implement Video Inpainting Anything. Try our Demo!
<p align="center"> <img src="https://github.com/zibojia/COCOCO/blob/main/__asset__/DEMO.PNG" alt="DEMO" style="width: 95%;"/> </p>3. Run our validation script.
By running this code, you can simply get the video inpainting results.
python3 valid_code_release.py --config ./configs/code_release.yaml \
--prompt "Trees. Snow mountains. best quality." \
--negative_prompt "worst quality. bad quality." \
--guidance_scale 10 \ # the cfg number, higher means more powerful text controlability
--video_path ./images/ \ # the path that store the video and masks, the format is the images.npy and masks.npy
--model_path [cococo_folder_name] \ # the path to cococo weights, e.g. ./cococo_weights
--pretrain_model_path [sd_folder_name] \ # the path that store the pretrained stable inpainting model, e.g. ./stable-diffusion-v1-5-inpainting
--sub_folder unet # set the subfolder of pretrained stable inpainting model to get the unet checkpoints
4. Personalized Video Inpainting (Optional)
We give a method to allow users to compose their own personlized video inpainting model by using personalized T2Is WITHOUT TRAINING. There are three steps in total:
-
Convert the opensource model to Pytorch weights.
-
Transform the personalized image diffusion to personliazed inpainting diffusion. Substract the weights of personalized image diffusion from SD1.5, and add them on inpainting model. Surprisingly, this method can get a personalized image inpainting model, and it works well:)
-
Add the weight of personalized inpainting model to our CoCoCo.
Convert safetensors to Pytorch weights
-
For the model using different key, we use the following script to process opensource T2I model.
For example, the epiCRealism, it is different from the key of the StableDiffusion.
model.diffusion_model.input_blocks.1.1.norm.bias model.diffusion_model.input_blocks.1.1.norm.weightTherefore, we develope a tool to convert this type model to the delta of weight.
cd task_vector; python3 convert.py \ --tensor_path [safetensor_path] \ # set the safetensor path --unet_path [unet_path] \ # set the path to SD1.5 unet weights, e.g. stable-diffusion-v1-5-inpainting/unet/diffusion_pytorch_model.bin --text_encoder_path [text_encoder_path] \ # set the text encoder path, e.g. stable-diffusion-v1-5-inpainting/text_encoder/pytorch_model.bin --vae_path [vae_path] \ # set the vae path, e.g. stable-diffusion-v1-5-inpainting/vae/diffusion_pytorch_model.bin --source_path ./resources \ # the path you put some preliminary files, e.g. ./resources --target_path ./resources \ # the path you put some preliminary files, e.g. ./resources --target_prefix [prefix]; # set the converted filename prefix -
For the model using same key and trained by LoRA.
For example, the Ghibli LoRA.
lora_unet_up_blocks_3_resnets_0_conv1.lora_down.weight lora_unet_up_blocks_3_resnets_0_conv1.lora_up.weightpython3 convert_lora.py \ --tensor_path [tensor_path] \ # the safetensor path --unet_path [unet_path] \ # set the path to SD1.5 unet weights, e.g. stable-diffusion-v1-5-inpainting/unet/diffusion_pytorch_model.bin --text_encoder_path [text_encoder_path] \ # set the text encoder path,
