SkillAgentSearch skills...

COCOCO

Video-Inpaint-Anything: This is the inference code for our paper CoCoCo: Improving Text-Guided Video Inpainting for Better Consistency, Controllability and Compatibility.

Install / Use

/learn @zibojia/COCOCO

README

CoCoCo: Improving Text-Guided Video Inpainting for Better Consistency, Controllability and Compatibility

<a href='https://cococozibojia.github.io'><img src='https://img.shields.io/badge/Project-Page-Green'></a> <a href='https://arxiv.org/pdf/2403.12035'><img src='https://img.shields.io/badge/Paper-arXiv-red'></a>

Bojia Zi<sup>1</sup>, Shihao Zhao<sup>2</sup>, Xianbiao Qi<sup>*5</sup>, Jianan Wang<sup>4</sup>, Yukai Shi<sup>3</sup>, Qianyu Chen<sup>1</sup>, Bin Liang<sup>1</sup>, Rong Xiao<sup>5</sup>, Kam-Fai Wong<sup>1</sup>, Lei Zhang<sup>4</sup>

* is corresponding author.

This is the inference code for our paper CoCoCo.

<p align="center"> <img src="https://github.com/zibojia/COCOCO/blob/main/__asset__/COCOCO.PNG" alt="COCOCO" style="width: 100%;"/> </p> <table> <tr> <td><img src="__asset__/sea_org.gif"></td> <td><img src="__asset__/sea1.gif"></td> <td><img src="__asset__/sea2.gif"></td> </tr> <tr> <td> Orginal </td> <td> The ocean, the waves ... </td> <td> The ocean, the waves ... </td> </tr> </table> <table> <tr> <td><img src="__asset__/river_org.gif"></td> <td><img src="__asset__/river1.gif"></td> <td><img src="__asset__/river2.gif"></td> </tr> <tr> <td> Orginal </td> <td> The river with ice ... </td> <td> The river with ice ... </td> </tr> </table> <table> <tr> <td><img src="__asset__/sky_org.gif"></td> <td><img src="__asset__/sky1.gif"></td> <td><img src="__asset__/sky2.gif"></td> </tr> <tr> <td> Orginal </td> <td> Meteor streaking in the sky ... </td> <td> Meteor streaking in the sky ... </td> </tr> </table>

Table of Contents <!-- omit in toc -->

Features

  • Consistent text-guided video inpainting
    • By using damped attention, we have decent inpainting visual content
  • Higher text controlability
    • We have better text controlability
  • Personalized video inpainting
    • We develop a training-free method to implement personalized video inpainting by leveraging personalized T2Is
  • Gradio Demo using SAM2
    • We use SAM2 to create Video Inpaint Anything
  • Infinite Video Inpainting
    • By using the slidding window, you are allowed to inpaint any length videos.
  • Controlable Video Inpainting
    • By composing with the controlnet, we find that we can inpaint controlable content in the given masked region
  • More inpainting tricks will be released soon...

Installation

Step1. Installation Checklist

Before install the dependencies, you should check the following requirements to overcome the installation failure.

  • [x] You have a GPU with at least 24G GPU memory.
  • [x] Your CUDA with nvcc version is greater than 12.0.
  • [x] Your Pytorch version is greater than 2.4.
  • [x] Your gcc version is greater than 9.4.
  • [x] Your diffusers version is 0.11.1.
  • [x] Your gradio version is 3.40.0.

Step2. Install the requirements

If you update your enviroments successfully, then try to install the dependencies by pip.

# Install the CoCoCo dependencies
pip3 install -r requirements.txt
# Compile the SAM2
pip3 install -e .

If everything goes well, I think you can turn to the next steps.

Usage

1. Download pretrained models.

Note that our method requires both parameters of SD1.5 inpainting and cococo.

  • The pretrained image inpainting model (Stable Diffusion Inpainting.)

  • The CoCoCo Checkpoints.

  • Warning: the runwayml delete their models and weights, so we must download the image inpainting model from other url.

  • After download, you should put these two models in two folders, the image inpainting folder should contains scheduler, tokenizer, text_encoder, vae, unet, the cococo folder should contain model_0.pth to model-3.pth

2. Prepare the mask

~~You can obtain mask by GroundingDINO or Track-Anything, or draw masks by yourself.~~

We release the gradio demo to use the SAM2 to implement Video Inpainting Anything. Try our Demo!

<p align="center"> <img src="https://github.com/zibojia/COCOCO/blob/main/__asset__/DEMO.PNG" alt="DEMO" style="width: 95%;"/> </p>

3. Run our validation script.

By running this code, you can simply get the video inpainting results.

python3 valid_code_release.py --config ./configs/code_release.yaml \
--prompt "Trees. Snow mountains. best quality." \
--negative_prompt "worst quality. bad quality." \
--guidance_scale 10 \ # the cfg number, higher means more powerful text controlability
--video_path ./images/ \ # the path that store the video and masks, the format is the images.npy and masks.npy
--model_path [cococo_folder_name] \ # the path to cococo weights, e.g. ./cococo_weights
--pretrain_model_path [sd_folder_name] \ # the path that store the pretrained stable inpainting model, e.g. ./stable-diffusion-v1-5-inpainting
--sub_folder unet # set the subfolder of pretrained stable inpainting model to get the unet checkpoints

4. Personalized Video Inpainting (Optional)

We give a method to allow users to compose their own personlized video inpainting model by using personalized T2Is WITHOUT TRAINING. There are three steps in total:

  • Convert the opensource model to Pytorch weights.

  • Transform the personalized image diffusion to personliazed inpainting diffusion. Substract the weights of personalized image diffusion from SD1.5, and add them on inpainting model. Surprisingly, this method can get a personalized image inpainting model, and it works well:)

  • Add the weight of personalized inpainting model to our CoCoCo.

<table> <tr> <td><img src="__asset__/gibuli_lora_org.gif"></td> <td><img src="__asset__/gibuli_merged1.gif"></td> <td><img src="__asset__/gibuli_merged2.gif"></td> </tr> </table> <table> <tr> <td><img src="__asset__/unmbrella_org.gif"></td> <td><img src="__asset__/unmbrella1.gif"></td> <td><img src="__asset__/unmbrella2.gif"></td> </tr> </table> <table> <tr> <td><img src="__asset__/gibuli.gif"></td> <td><img src="__asset__/bocchi1.gif"></td> <td><img src="__asset__/bocchi2.gif"></td> </tr> </table>

Convert safetensors to Pytorch weights

  • For the model using different key, we use the following script to process opensource T2I model.

    For example, the epiCRealism, it is different from the key of the StableDiffusion.

    model.diffusion_model.input_blocks.1.1.norm.bias
    model.diffusion_model.input_blocks.1.1.norm.weight
    

    Therefore, we develope a tool to convert this type model to the delta of weight.

    cd task_vector;
    python3 convert.py \
      --tensor_path [safetensor_path] \ # set the safetensor path
      --unet_path [unet_path] \ # set the path to SD1.5 unet weights, e.g. stable-diffusion-v1-5-inpainting/unet/diffusion_pytorch_model.bin
      --text_encoder_path [text_encoder_path] \ # set the text encoder path, e.g. stable-diffusion-v1-5-inpainting/text_encoder/pytorch_model.bin
      --vae_path [vae_path] \ # set the vae path, e.g. stable-diffusion-v1-5-inpainting/vae/diffusion_pytorch_model.bin
      --source_path ./resources \ # the path you put some preliminary files, e.g. ./resources
      --target_path ./resources \ # the path you put some preliminary files, e.g. ./resources
      --target_prefix [prefix]; # set the converted filename prefix
    
  • For the model using same key and trained by LoRA.

    For example, the Ghibli LoRA.

    lora_unet_up_blocks_3_resnets_0_conv1.lora_down.weight
    lora_unet_up_blocks_3_resnets_0_conv1.lora_up.weight
    
    python3 convert_lora.py \
      --tensor_path [tensor_path] \ # the safetensor path
      --unet_path [unet_path] \ # set the path to SD1.5 unet weights, e.g. stable-diffusion-v1-5-inpainting/unet/diffusion_pytorch_model.bin 
      --text_encoder_path [text_encoder_path] \ # set the text encoder path, 
    
View on GitHub
GitHub Stars322
CategoryContent
Updated7d ago
Forks11

Languages

Python

Security Score

85/100

Audited on Mar 13, 2026

No findings